Detecting Duplicates over Sliding Windows with RAM ...

Detecting Duplicates over Sliding Windows with RAMEfficient Detached Counting Bloom Filter Arrays Jiansheng Wei*1, Hong Jiang#2, Ke Zhou*3, Dan Feng*4, Hua Wang*5 *

School of Computer, Huazhong University of Science and Technology, Wuhan, China Wuhan National Laboratory for Optoelectronics, Wuhan, China 3

Corresponding author: [email protected] [email protected], [email protected], [email protected]

1

#

Dept. of Computer Science and Engineering, University of Nebraska-Lincoln, Lincoln, NE, USA 2

[email protected]

Abstract— Detecting duplicates over sliding windows is an important technique for monitoring and analysing data streams. Since recording the exact information of elements in a sliding window can be RAM-resource-intensive and introduce unacceptable search complexities, several approximate membership representation schemes have been proposed to build in-memory fast indices. However, various challenges facing RAM utilization and scalability remain. This paper proposes a Detached Counting Bloom filter Array (DCBA) to flexibly and efficiently detect duplicates over sliding windows. A DCBA consists of an array of detached counting Bloom filters (DCBFs), where each DCBF is essentially a Bloom filter that is associated with a detached timer (counter) array. The DCBA scheme functions as a circular FIFO queue and keeps a filling DCBF for accommodating fresh elements and a decaying DCBF for evicting stale elements. DCBA allows the timer arrays belonging to fully filled DCBFs to be offloaded to disks to greatly improve memory space efficiency. The fully filled DCBFs will remain stable until their elements become stale, which allows a DCBA to be efficiently replicated for the purpose of data reliability or information sharing. Further, DCBA can be cooperatively maintained by clustered nodes, which provides scalable solution for mining massive data streams. Mathematical analysis and experimental results show that a DCBA (containing 64 DCBFs) requires less than 10% of its components to be kept in RAM while maintaining more than 95% of its query performance, which significantly outperforms existing schemes in memory efficiency and scalability.

I. INTRODUCTION Duplicate detection is an important technique for monitoring and analysing data streams. For example, an advertiser may want to know how many different users have accessed a specific advertisement through monitoring the click streams, which requires duplicate clicks to be identified and eliminated in a timely manner. In some network applications, such as sensor reading [1], telephone call recording [2], and web content analysing [3], the data streams can be too large for the system to hold the entire element sequences in RAM. As a result, an in-memory sliding window [4] [5] that keeps a fixed number of the most recent elements becomes an ideal model for analysing data streams. In general, a sliding window with a large size can accommodate more elements and thus benefit the analysis results. However, both the memory consumption and the search complexity can increase accordingly as the window

size grows. To this end, some kind of high-efficient low-cost approximate membership representation scheme is in desperate need for summarizing and querying elements in a sliding window in real time. One of the most widely used membership representation scheme is Bloom filter (BF) [6], which is a compact data structure that can record or query an element in constant time by using a hash algorithm. A BF for representing a static set S = {e1, e2, …, en} of n elements consists of an array of m bits and a group of k independent hash functions h1, …, hk with the range of {1, …, m}. Initially, all the bits of the array are set to 0, and then the bits hi(e) (1≤i≤k) are set to 1 for each element e∈S. To determine if an item x exists in S, one simply checks whether all the bits hi (x) (1≤i≤k) are set to 1. If not, x is definitely not an element of S. Conversely, x can be assumed as a member of S with some error probability. This kind of false positive is caused by the hash collision, which occurs when any of the bits hi(x) is accidentally set by some other elements actually existing in S. The probability that a BF falsely identifies a non-member item as an element of the represented set is called the false positive rate or error rate. Fortunately, no false negative will happen, and the false positive rate can be controlled through mathematical methods. A BF can achieve high space efficiency and high query performance while representing a static set. However, it faces challenges when dealing with a dynamic set, such as elements monitored by a sliding window. When a stale element is removed from a dynamic set, the corresponding bits in the associated BF can not be simply reversed to 0, as they can be shared by other elements and reset operations can introduce false negatives. To represent the changing membership for a sliding window, several BF-based schemes have been proposed. Metwally et al. [7] propose to use counting Bloom filters while detecting click fraud through a sliding window in the pay-per-click advertisement systems. A counting Bloom filter uses counters instead of single bits as representing cells, and each counter records the number of elements that are associated with the corresponding cell. If a fresh element is inserted, the corresponding counters will be incremented; conversely, the corresponding counters will be decremented if a stale element is deleted. However, this method can be effective only when the stale element to be removed is known,

which is impossible when there is insufficient memory to keep the exact element sequence. Shen and Zhang [8] introduce decaying Bloom filters to support stale element eviction without requiring the exact element sequence to be kept in memory. A decaying Bloom filter consists of an array of m counters, where each counter takes d = ⌈log(n+1)⌉ bits to count from 0 to n, with n being the size of the sliding window and 0 representing an empty cell. Before inserting an element x, all the m cells must be checked to decrease the non-zero counters by 1, and then the k cells corresponding to x will be set to the maximum value n. Obviously, x will not be evicted until n more fresh elements are inserted, which means that x slides out of the window. The problem with this approach is that the time complexity of inserting an element is O(m+k), which can be unacceptable for many real-time applications. Zhang and Guan [9] solve the above problem by constructing a similar timing Bloom filter with each counter containing ⌈log(n+c+1)⌉ bits, where n is the window size and c is a constant. In particular, each counter is initialized with all ‘1’s instead of all ‘0’s to represent an empty cell. To insert an element, k corresponding cells are selected and set to the current timestamp t, which is generated by a timer that counts from 0 to n+c-1 in a circular manner. The timer will be incremented by 1 once an element has been inserted. Among all the possible n+c values, only n consecutive values, ranging from (t-n+1) to t, represent the elements that belong to the current window. Consider that there are m cells in the timing Bloom filter, only m/c cells need to be checked to reinitialize the expired counter once an element is inserted, and all the cells will be checked to remove stale elements before the timer cycles back. The key of a timing Bloom filter is to utilize additional counting space c to buy extra time for removing stale timestamps, thus requiring more memory for each counter. The timing Bloom filter scheme achieves better time efficiency than the other two approaches, but its space overhead can still be a potential bottleneck. For example, consider representing a sliding window with a capacity of 100 million elements, which is a common situation when tracing network packets, a timing Bloom filter with an error rate threshold of 1/104 requires over 6.58GB RAM space, in contrast to a standard BF that consumes only 241MB memory (see Section II-C for more details). Obviously, allocating 28 bits (i.e., ⌈log(n+c+1)⌉ with n=108 and c=n-1) for each counter can greatly increase the RAM overhead of the timing Bloom filter. Most importantly, none of the above techniques is scalable as data streams increase in size. To address the above challenges, we propose in this paper a Detached Counting Bloom filter Array (DCBA) that aims to more efficiently represent membership for sliding windows over large data streams. A DCBA consists of an array of detached counting Bloom filters (DCBFs), where each DCBF is essentially a Bloom filter that is associated with a detached timer (counter) array. To represent elements in a sliding window, a DCBA logically functions as a circular FIFO queue and maintains a filling DCBF to accommodate fresh elements and a decaying DCBF to evict stale elements. The DCBA scheme provides three salient features to achieve good

flexibility. First, timer arrays belonging to fully filled DCBFs can be optionally offloaded to disks to greatly improve the RAM space efficiency. Second, a DCBA can be easily replicated to improve data reliability or to share information, since all the fully filled DCBFs are set to the query-only mode until their accommodated elements become stale. Third, a DCBA can be cooperatively maintained by clustered nodes, which provides scalability for analysing massive data streams. Mathematical analysis reveals that DCBA can offload over 90% of its building-blocks to disks, which significantly improves its RAM space efficiency and enables it to outperform existing schemes. Experimental results based on real-world datasets show that DCBA can also achieve high query performance that is comparable to the existing timing Bloom filter scheme [9] while maintaining a predefined query accuracy. The rest of this paper is organized as follows. In the next section, we provide the necessary background information to further motivate our research. The DCBA design is detailed in Section III. Section IV presents mathematical analysis of the DCBA scheme and experimental results about our DCBA prototype. Finally, we review related work in Section V and conclude the paper in Section VI. II. MOTIVATION AND BACKGROUND In this section, we first briefly discuss duplicate detection in data streams to further motivate our research. We then present the necessary background information about decaying window models and the mathematical principles behind Bloom filters to prepare for our DCBA design. A. Duplicate Detection in Data Streams Duplicate detection has been deployed in many network applications. In the content distribution networks (CDN) [10], if a requested data object can be quickly identified as a duplicate in the local cache or a nearby shared cache, the content can be delivered to the user end immediately and the network bandwidth can be saved. Another important application that has been well studied is the fraud detection in click streams. Specifically, a publisher gets the payment from an advertiser if a user clicks through the published link to the target advertisement. The publisher can prevent hit shaving [11], which can be performed by the advertiser, by inviting a commissioner as a third party to log the click-through behaviour from the customer to the web site of the advertiser. However, it becomes a challenge for the commissioner to detect hit inflation [12], which can be performed by the publisher with automated scripts to increase the number of clicks and thus earn more revenue. A proposed solution is to monitor the click stream through a decaying window model [7] and filter out duplicate clicks that are generated from the same client script. Even so, buffering a circular FIFO queue that contains the exact click sequence can consume too much memory, and the performance can be bottlenecked by the time complexity of sequential search. Recently, Zhang and Guan [9] propose timing Bloom filters to summarize the click sequence in a sliding window. The information of each click is encoded into an array of counters

through a group of hash functions. Thus querying a click requires only constant time. Each counter records the timestamps of associated clicks to enable stale element eviction, and the counting range is much larger than the window size to gain extra time for searching and resetting expired counters. However, a timing Bloom filter can become a memory bottleneck with extremely large data streams. For example, recent studies estimate that about 294 billion emails are sent per day all over the world in 2010 [13] and around 89.1% of them are spam emails [14]. It is possible that a large email service provider has to send/receive hundreds of millions of emails per day, and many emails can be replicated and forwarded multiple times in a few hours or days. Considering that all the emails must be scanned for the purposes of antispam, anti-virus or homeland security, it is important for the email server to quickly identify duplicates and analyse only unique emails. The limitations of using a timing Bloom filter to represent such large data streams are threefold. First, a timing Bloom filter can consume too much memory to maintain the counters (as illustrated in Section I); Second, it is inefficient to periodically synchronize a timing Bloom filter with an on-disk copy to prevent accidental data loss, since the data structure can be updated randomly by hash functions; And third, a timing Bloom filter can only be maintained by a single server, thus lacking scalability to extend the monitoring range of a data stream. The demands for monitoring massive data streams and the limitations of existing approaches motivate us to develop a more flexible membership representation scheme. B. Decaying Window Models Since the real-world data streams can be updated frequently and grow to very large sizes, it is impractical to trace and analyse all the elements in such data streams. Alternatively, existing approaches usually employ a decaying window model to evict expired items and record the most recent elements. In what follows, we briefly review the most widely used models. 1) Landmark window model: The landmark window model [7] breaks a data stream into disjoint segments corresponding to equal time intervals (e.g., an hour) or with equal size N. Only one segment needs to be kept or summarized in memory for data analysing at a time. If a new segment begins, the data container will be reinitialized to discard expired elements simultaneously. However, this model lacks the ability to mine element relationships across different windows.

3) Sliding window model: In this model, a sliding window [4] only maintains the most recent N elements, which removes an expired element immediately once a fresh element arrives. Obviously, the sliding window model is an ideal scheme for analysing the real-time state of a data stream, and it is a natural design to keep the latest N elements in a circular FIFO queue. However, if N is too large, the search complexity can become unacceptable to most of the online analysis applications, such as duplicate detection and frequency estimation. We focus on this ideal model in the paper and aim to design a membership representation scheme that supports fast search, insertion and deletion of time-ordered elements with low RAM consumption. C. The Principles of Bloom Filters Consider using a Bloom filter (BF) with an array of m bits and a group of k hash functions to represent a static set with n elements, a false positive will occur if all the k bits corresponding to the item x being queried are occasionally set to 1 by some of the n elements when x is not in the set, and the false positive probability of the BF can be derived as: f BF = (1 − (1 − 1 / m) kn ) k ≈ (1 − e − kn / m ) k . It has been proven that fBF can be minimized to (1/2) ln2 ⋅ (m/n) when k = ln2 ⋅ (m/n) [16]. Further, Kirsch et al. [17] propose that the computational efficiency of the k hash functions can be improved by using the technique of double hashing. To construct a BF for a static set with known cardinality n, if fBF must be restricted to a threshold ε with minimal space overhead, the optimal number of hash functions is (1) kopt = ⌈log2(1/ε)⌉, and the required minimal space (in bits) is (2) mmin=⌈log2e∙kopt∙n⌉. Clearly, fBF will not reach ε until all the n elements are inserted, thus n is also called the designed capacity of a BF and ε is also referred to as the target error rate, denoted as FBF. III. THE DCBA SCHEME This section presents the design and implementation issues of our Detached Counting Bloom filter Array (DCBA) scheme that efficiently represents membership for large data streams over sliding windows. DCBA can be flexibly deployed in either a single node or a distributed environment to detect duplicates in massive data streams.

A. Overview of DCBA A DCBA consists of an array of detached counting Bloom 2) Jumping window model: The jumping window model filters (DCBFs) that are homogeneous and share the same [15] divides a data stream into smaller disjoint sub-windows, hash functions. Specifically, a DCBF is a Bloom filter (BF) where each sub-window corresponds to a fixed time interval with each of its bits being associated with a counter [18], or contains a fixed number of elements. A jumping window which is different from the scheme of Metwally et al. [7] that covers a certain number of sub-windows and slides in jumps replace each bit in a BF with a counter. In our design, each as the data flows. Once all of the sub-windows are filled, the counter functions as a timer to record the timestamp of the eldest sub-window will be removed, and a new sub-window represented element. All the timers of a DCBF are further will be started to accommodate fresh elements. Since the grouped into a timer array (TA) to improve the access number of monitored elements oscillates as the window jumps, efficiency. it is difficult for the analysis results to appropriately reflect the Fig. 1 shows the process flow of using DCBA to represent actual state of a data stream. a sliding window in a single node. A window with size N

…

BFg−1

BFg filling

Flash or Disk

TA2

…

TAg−1

TAg offload

TAg

Fig. 1 Using DCBA to represent a sliding window in a single node

expired

monitoring

incoming

N elements data stream BF1 … BFg decaying

TA1 … TAg node 1

BF1 … BFg

…

1

0

1

…

BFg 0 1 0

TAg 0 11 0

1

27

hi (∙) (0≤i≤k)

Fig. 3 The in-memory structure of a DCBA

decaying TA1

39

…

…

BF2

bit vectorm

TA1 0 0

…

load

BF1

d

…

RAM

TA1

2 −1

BF1 … BF3 1 0 0 0 0 1 1 0 1

…

N elements data stream

bit vector1 bit vector2 bit vector3

…

incoming

B. The DCBA Structure As described in Section III-A, g−2 out of g TAs in a DCBA can be optionally offloaded to disks to save RAM resources. Fig. 3 shows an array of g BFs and 2 separate TAs that are associated with the decaying BF and the filling BF respectively. In particular, bits belonging to different BFs but with the same offset are mapped and stored in the same bit vector so that they can be read or written simultaneously in a single memory access. In general, it is recommended that the size g of a bit vector be a power of 2 and preferably an integer multiple of 32 or 64, so that a group of bits can be accessed in parallel by judiciously leveraging the memory bandwidth and the CPU cache mechanism. If g exceeds the bit width of a CPU register, the bit vector can be vectorized to utilize the vector processing units [19] of modern CPUs to achieve high computational efficiency.

…

monitoring

expired

there is a filling DCBF and a decaying DCBF to accommodate fresh elements and evict stale elements respectively.

…

slides over a data stream, and all the monitored elements are represented by an array of g DCBFs with each having a capacity of N/(g−1) elements. Specifically, when N elements have been inserted into the DCBA, the first DCBF that contains the oldest N/(g−1) elements will begin to decay, the (g−1)th DCBF contains the youngest, and the gth (last) DCBF continues to record incoming elements. In general, a fresh item will always be inserted into the last (filling) DCBF, and the first (decaying) DCBF always keeps the oldest elements that are about to be removed. If the filling DCBF becomes full, it will be retained for query only until its elements become stale, and the corresponding timer array (TA) that can consume a great amount of memory space may be optionally offloaded to hard disks or flash store to save RAM resources. On the other hand, if the decaying DCBF becomes empty, it will logically rotate to the last position and function as a new filling BCBF to accommodate fresh elements. When a DCBF begins to decay as the window slides, its corresponding TA will be loaded again to help determine whether a queried element has expired. Considering that g DCBFs are used to represent the sliding window, at most g−2 out of g TAs can be offloaded to disks, and the memory space efficiency can be greatly improved.

TA1 … TAg

filling

node r

Fig. 2 Using DCBA to represent a sliding window in a decentralized (clustered) system

Fig. 2 shows the process flow of using DCBA to represent a large sliding window in a decentralized (clustered) environment. While monitoring high-speed streams, all the TAs in a DCBA may have to be kept in RAM to avoid I/O overhead, and the limited RAM resources of a single node can still become a bottleneck for representing a large sliding window. To solve this problem, we propose to decentralize a DCBA by using clustered nodes that cooperate with each other. As shown in Fig. 2, a DCBA is split and maintained by r nodes, where each node holds a group of g DCBFs. All the r×g DCBFs logically function as a circular FIFO queue, and

To determine whether an element x has already appeared in the sliding window, the g BFs will be queried in parallel. Considering that k hash functions are used and shared among the DCBFs, the memory access complexity of querying an item will be O(k) rather than O(k×g). Further, if a positive is produced by the decaying BF, the associated decaying TA will be queried to check whether the item has already expired. The answer will be (possibly) ‘yes’ if any of the g DCBFs finally generates a valid positive result. Otherwise, x is definitely a fresh element. Clearly, the DCBA scheme can generate false positives for the elements being queried, as any of its constructing DCBF can produce error results with a certain probability. Let FDCBF denote the target error rate of each DCBF, the overall target error rate of the DCBA can be derived as FDCBA=1−(1−FDCBF)g. Conversely, if the total false positive rate must be constrained to a predefined threshold εDCBA, it can be inferred that the error rate threshold of each constructing DCBF should be εDCBF=1−(1−εDCBA)1/g. Thus, we can further determine the number of required hash functions k and the number of representing cells m for each DCBF by using Formula (1) and Formula (2) in Section II-C. The bit width d of each timer is fundamentally determined by the capacity of the DCBA. Suppose that the capacity of a DCBA is N, each DCBF will be designed to hold N/(g−1) elements, and each timer will contain d=⌈log2N/(g−1)⌉ bits to count from 0 to N/(g−1)−1, where 0 denotes the oldest timestamp. For example, if a DCBF is designed to accommodate 1M (220) elements, then each constructing timer will consume 20 bits to count from 0 to 220−1.

C. Detecting Duplicates over Sliding Windows While representing a sliding window with size N, the DCBA scheme maintains a base timer that counts from 0 to N/(g−1)−1 in a circular manner to generate timestamps for the monitored elements. To insert an element x, k bits in the filling BF will be chosen and set to 1 according to the hash functions hi (x) (1≤i≤k), and the k associated timers in the filling TA will be set to the value of the base timer. Then, the base timer will be incremented by 1. On the other hand, an element in the decaying DCBF will be considered expired once its timestamp becomes smaller than the base timer. As a result, there are theoretically at most N/(g−1) valid elements in both the decaying DCBF and the filling DCBF, and the decaying DCBF will become empty at the same time the filling DCBF becomes full. Since a representing bit as well as its associated timer can be shared by multiple elements in a DCBF, we determine the timestamp of an element according to a count-min policy. Specifically, the minimal value t among all the k timers corresponding to an element x will be considered as its timestamp. The probability that all the k timers corresponding to x are occasionally shared and set by other elements with larger timestamps than x is very small and can be constrained by restricting the target error rate of each DCBF. Algorithm 1 Insert(x) in a DCBA for Duplicate Detection Input: fresh element x , which can not be null Note: the results of hashi(x) can be cached and shared 1: for( i = 1; i

Detecting Duplicates over Sliding Windows with RAM ...

Detecting Duplicates over Sliding Windows with RAM ...

Suggest Documents