Detection of Superpoints Using a Vector Bloom Filter - IEEE Xplore

0 downloads 0 Views 3MB Size Report
Dec 24, 2015 - data structure, called vector bloom filter (VBF), which is a variant of standard BF. The VBF consists of six hash functions, four of which take some ...
514

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 11, NO. 3, MARCH 2016

Detection of Superpoints Using a Vector Bloom Filter Weijiang Liu, Wenyu Qu, Jian Gong, and Keqiu Li, Senior Member, IEEE

Abstract— Internet attacks, such as distributed denial-ofservice attacks and worm attacks, are increasing in severity and frequency. Identifying and mitigating realtime attacks are an important and challenging task for network administrators. An infected host can make a large number of connections to distinct destinations during a short time. Such a host is called a superpoint. Detecting superpoints can be utilized for traffic engineering and anomaly detection. This paper proposes a novel data streaming method for detecting superpoints and proves guarantees on its accuracy with low memory requirements. The superior performance of this method comes from a new data structure, called vector bloom filter (VBF), which is a variant of standard BF. The VBF consists of six hash functions, four of which take some consecutive bits from the input string as the corresponding value, respectively. The information of superpoints is obtained by using the overlapping of hash bit strings of the VBF. Theoretical analysis and experimental results show that the proposed method can detect superpoints precisely and efficiently through comparison with other existing approaches. Index Terms— IP flow, superpoint, bloom filter, cardinality estimation, random variable.

I. I NTRODUCTION EASUREMENT of flow-level statistics and traffic classification are essential for network accounting and billing, service provision or security [1]–[9]. Some networks in the Internet are plagued by various malicious activities such

M

Manuscript received June 10, 2015; revised September 18, 2015 and November 5, 2015; accepted November 10, 2015. Date of publication November 24, 2015; date of current version December 24, 2015. This work was supported in part by the National Science Foundation for Distinguished Young Scholars of China under Grant 61225010, in part by the State Key Program of National Natural Science of China under Grant 61432002, in part by the National Natural Science Foundation of China under Grant 61272417, Grant 61370187, Grant 61370198, and Grant 61300199, in part by the Scientific Research Fund of Liaoning Provincial Education Department under Grant L2013195 and Grant 3632010220, in part by the Fundamental Research Funds for the Central Universities Grant 3132014325, in part by the Jiangsu Provincial Key Laboratory of Network and Information Security, and in part by the Key Laboratory of Computer Network Technology in Jiangsu. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Wanlei Zhou. W. Liu is with the School of Information Technology, Dalian Maritime University, Dalian 116026, China, and also with the Key Laboratory of Computer Network and Information Integration, Ministry of Education, Southeast University, Nanjing 210096, China (e-mail: [email protected]). W. Qu is with the School of Software, Tianjin University, Tianjin 300072, China (e-mail: [email protected]). J. Gong is with the School of Computer Science and Engineering, Southeast University, Nanjing 210096, China (e-mail: [email protected]). K. Li is with the School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China, and also with the School of Computer Science and Technology, Tianjin University, Tianjin 300072, China (e-mail: [email protected]). Digital Object Identifier 10.1109/TIFS.2015.2503269

as spam, DOS attacks and so on [10]. Various types of worms have been a persistent security threat to the Internet [11]. Today, Internet worms including Storm worm, Code Red worm and Morris worm may be used to perform DDOS or exhaust system resources [12]. An infected host often sends packets to a great number of distinct destinations for worm propagation in a short time. It has been observed that a host infected by the Slammer worm sends up to 26,000 scans a second [13]. Such a host is called a superpoint. A superpoint is a source (destination) that connects to at least a given threshold m ∗ distinct destinations (sources) within a measurement period. Detecting superpoints is useful for many network operations such as traffic classification and intrusion detection, as they are often an indication of various security problems, e.g. port scan attacks, DDoS attacks, or worm propagation. Superpoints can be divided into super sources and super destinations. Finding super destinations is the analogue of finding super sources. In this paper, the proposed algorithm is described only for identifying super sources. The number of distinct destinations contacted by a host is called the cardinality of the host. Therefore, a superpoint (source) is a host that has a high cardinality within a short time. A naive method for finding superpoints is to track all distinct destinations that each source contacts, using a hash table. For example, Flowscan maintaining per-flow state requires large quantities of DRAM for operation [14], so it is unfeasible for monitoring on high speed links. Recently, substantial attention has been paid on solving the problem of detecting super sources and destinations. Previous work based on flow sampling provides a way for detecting superpoints [15], [16]. However, to keep up with link speed in high-speed networks, the adoption of low sampling rate will lead to awkward performance in accuracy. Data streaming methods [17], [18] can also be used to detect superpoints. Meanwhile, a module storing IP addresses must be operated to store and process the candidates of superpoints. How to store and look up is an intractable problem. To solve this problem, reversible sketches in [19] are designed to locate hosts with large connection degree. The reversible sketches are built on number theory. The hash functions are designed based on chinese reminder theory. However, since an IP address is viewed as a 32-bit integer, evaluating hash function involves arithmetic operators including multiplication and modulus of integers. In particular, some large integers may generate in the process of reconstructing IP addresses, thus the computation cost is high.

1556-6013 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

LIU et al.: DETECTION OF SUPERPOINTS USING A VBF

In this paper, we use the simplest way to solve this problem. Specifically, viewing an IP address as a bit string, we use the simplest hash function Bit-Extraction to implicitly store the part information of the source IPs. Bit-Extraction is to simply extract some consecutive bits from the source string as its function value. This hash function requires an extremal short time to perform. Moreover, the utilization of Bit-Extraction makes the reconstruction of source IP become very simple. Only concatenating some substrings may reconstruct the IP address. Our work aims to provide a new method for detecting superpoints, which is more efficient than the state-of-art work. The proposed method does not explicitly maintain any host identifier (ID), but can reconstruct the IDs of superpoints and estimate their cardinalities. We design a new data structure Vector Bloom Filter (VBF) for detecting superpoints. The reversible hash set of VBF consists of four Bit-Extraction functions, each of which simply extracts some consecutive bits from the input string as its hash value. The VBF has two good computational properties: (1) Using Bit-Extraction to hash each incoming packet makes recording data extremely fast, because Bit-Extraction is the simplest hash function. (2) Reconstructing source IP is extremely simple. Four Bit-Extractions map a source IP into four substrings, and vice versa, the four strings can also be concatenated to form the source IP very easily. The VBF consists of five n × m 2-dimentional bit arrays and six hash functions, where each bit array can be viewed as n bitmaps with size m. We develop a new method that constructs host information, identifies superpoints and estimates the cardinalities of the superpoints. When a packet arrives, 5 bits selected in the bit arrays are set to one. The row indexes for these 5 bits are the results that the source IP of the packet is hashed by the first five hash functions to generate, and the column indexes are same and the value which the sixth hash function maps the flow label to. Four of the first five hash functions are used to reconstruct host addresses – the identifiers of superpoints, the other one filters out the candidates of superpoints. The main contributions of this paper are summarized as follows: • A new data structure, called Vector Bloom Filter (VBF), is designed to detect superpoints. A VBF is a variant of standard Bloom Filter (BF). The VBF is the key to reconstruct host addresses (superpoint IDs) and detect superpoints; and its hash functions are with low computation complexity. • An algorithm for reconstructing the information of superpoints is presented. Reconstructing superpoint IDs by using the overlapping of hash bit strings of the VBF, thus avoiding storing superpoint IDs explicitly and superpoint ID lookup, leads to surprising reduction on the number of memory accesses. • Theoretical analysis shows that only very few phantoms can be identified as superpoints, and both of the estimation bias and the relative mean error of superpoint’s cardinality are very low. The experimental results based on real traces show the proposed method has better performance than other existing approaches. The basic idea of VBF is presented at its previous conference version [20]. This paper gives more theoretic analysis on the number of phantoms, estimation bias and relative

515

mean error. In addition, new experimental results and analysis section are added. The rest of this paper is organized as follows. Section II presents a brief review of related work with a discussion on the context of our work. In Section III, we propose the data streaming method that describes how to construct a VBF and identify superpoints. Section IV presents the analysis of the performance of VBF. In Section V, we evaluate its performance by using some public traces in experiments. The paper is concluded in Section VI. II. E XISTING W ORK ON D ETECTING S UPERPOINTS A lot of work has been made to detect superpoints in recent years. In general, three types of approaches have been proposed in the literature. A. Flow Sampling Venkataraman et al. [21] presented two filtering algorithms based on flow sampling for detecting superspreaders. Two distinct adaptive methods based on flow sampling were proposed in [15] and [16]. Cao et al. [22] designed an online sampling approach for identifying high cardinality host, consisting of two-phase filtering, thresholded bitmap and bias correction. Zhao et al. [23] provided two algorithms for detection of superpoints, i.e., the simple scheme and the advanced scheme. The simple scheme is based on flow sampling. It improved the traditional hash-based flow sampling techniques so that it can achieve a higher sampling rate. Li et al. [24] proposed an efficient spreader classification (ESC) scheme, which combines sampling, bitmap, and maximum likelihood estimation. Flow sampling can significantly reduce the amount of data that is stored and processed, so it has become an efficient and scalable tool for collecting network traffic. However, measurement accuracy is limited by the low sampling rate in high-speed networks. Since our method does not employ flow sampling, the VBF overcomes the limitation. B. Bitmap Method Counting the number of distinct flows is well-studied, equivalent to counting the number of distinct database records. Thus, Bitmap [25] in the context of databases is applied in a networking context field. Estan et al. [26] developed a family of extended bitmap algorithms for counting active flows on high speed links. The direct bitmap [26] uses a hash function on the flow ID to map each flow to a bit of the bitmap. Yoon et al. [17] designed a new method for data storage, called virtual vectors, to estimate the spread (cardinality) of sources. Each source is assigned a virtual vector where a bit is set for each destination that the source contacts. The advanced scheme in [23] combined data streaming and flow sampling. At this point, flow sampling is used to get candidate superpoint addresses. A 2-dimensional bit array is used to record the number of flows. In practice, the 2-dimensional array is equivalent to multiple bitmaps. Yoon and Chen [27] designed the random aging streaming filter, a two-dimensional bit array, for detecting

516

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 11, NO. 3, MARCH 2016

stealthy spreaders. Wang et al. [18] designed Virtual Connection Degree Sketch (VCDS), consisting of H virtual vectors, for measuring host connection degrees. To reduce the noise, bit-ANDing the H virtual vectors generates a filtered bitmap. Hence, VCDS can achieve a much accurate estimate because it uses a few virtual vectors. Cheng and Tang [28] proposed a superspreader identification system, which uses a bitmap to identify new flows and a counting Bloom filter to record the flow number of each source. Several error compensation mechanisms are designed to dynamically compensate the system while its states change. Xiao et al. [29] presented multi-virtual bitmaps, which is used to detect longterm stealthy attackers. Wellem et al. [30] designed a system to detect superspreaders. In the system, a bitmap is used to filter out sources that are superspreaders; a Bloom filter is adopted to identify a new flow; a hash table records superspreaders and their fan-out. However, in the above mentioned methods the actual IDs need to be stored in an extra space and may be looked up when necessary. The time consumption for storing and looking up the candidate is still a serious issue. Since our VBF does not store the ID explicitly, the VBF is exempt from the consumption. C. Reversible Sketches Scheweller et al. [31] designed reversible sketches for monitoring and analyzing network traffic. However, these sketches can not be directly used to detect superpoints. Guan et al. [19] proposed a method based on a reversible connection degree sketch for locating the hosts with large connection degrees. This method was extended in [32] to detect the hosts with significant changes in connection degrees. Furthermore, Wang et al. designed the reversible virtual connection degree sketch data structure to estimate node connection degrees [33]. Whereafter, Qin et al. [34] employed the reversible degree sketch to detect abnormal hosts for traffic management and network monitoring. All of the above mentioned methods use Chinese Remainder Theorem to reconstruct host addresses without preserving any host address information in the sketch. However, the time consumption of hash calculation is high because some large integers may be generated in computing process. Liu et al. [35] introduced a new mergeable and reversible sketch based on noisy group testing for identifying highcardinality hosts. The sketch can recover high-cardinality hosts, meanwhile it requires quite a few updating bit operations for each coming packet. Liu et al. [36] proposed two schemes for detecting superpoints. Both of these two schemes use a standard BF to filter duplicate contacts out. This increases the number of memory accesses. Meanwhile, since the effect of hash collision is neglected, estimation accuracy significantly degrades as the number of flows increases. In this paper, we use the simplest hash function Bit-Extraction to implicitly store the part information of the source IPs. Bit-Extraction only requires an extremal short time to perform. Specially, the reconstruction of source IPs becomes very simple and easy because of the utilization of Bit-Extractions.

TABLE I T HE H ASH F UNCTIONS OF THE VBF

III. D ESIGN OF DATA S TREAMING M ETHOD A. Vector Bloom Filter A VBF is a variant of bloom filter which consists of (K + L + 1) hash functions. The first K hash functions which take some consecutive bits from the input string as function values, are used to reconstruct host addresses. They form the reversible hash set of VBF. The middle L hash functions which can hash source IPs into value range uniformly, are used to filter some host candidates out. The last function is used to generate column index of the bit for storing an arriving flow. The vector means that each entry in VBF is not a single bit but a bit vector. Instead of having one array of size n shared by the (K + L) hash functions, each hash function has a range of n consecutive vector locations disjoint from all others. In this work, K and L are set to 4 and 1, respectively. The meaning of these hash functions is showed in Table I, where SIP denotes the source IP of the arriving packet. The functions h i (1 ≤ i ≤ 5) : {0, 1, · · · , 232 − 1} → {0, 1, · · · , 212 − 1} use SIPs as function input. The function f : {0, 1, · · · , 264 − 1} → {0, 1, · · · , m − 1} maps flow labels into bitmaps with size m. Since the functions h 1 , h 2 , h 3 and h 4 only select some consecutive bits as their output, they can be evaluated quickly. Ideally, h 5 and f have the following desired properties: (i) Speed: the hash function must be computed quickly because it has to be applied to each arrival packet. (ii) Uniformity: the output must be sufficiently uniform on the hash range, i.e., the function behaves like a truly random function. The above properties are also the first two properties of hash functions used for packet selection [37]. In practice, these two properties are in opposition. In general, the more complex hash function can generate more uniform outputs, but it also needs more computation time. The specific design of hash function represents a trade-off between complexity and ease of implementation. Indeed, there exist some functions with these two properties. For example, Pagh and Pagh [38] construct a family of hash functions that behave like a truly random function with high probability and can be evaluated in constant time. In addition, extensive experimental examples show that simple hash functions perform superior performance in uniformity. Chung et al. [39] explain that this phenomenon arises from a combination of the randomness of the hash function and the randomness of the data. Bob hash function, which does mostly XOR and Shift operations, shows good

LIU et al.: DETECTION OF SUPERPOINTS USING A VBF

517

TABLE II S OME F REQUENTLY-U SED N OTATIONS IN T HIS PAPER

Fig. 1.

Update procedure of a VBF.

performance in both speed and uniformity [40]. Cheng et al. [41] further prove that XOR operations of hash functions help to improve uniformity. Therefore, the easy implementation consisting of mostly XOR and Shift operations may be a good choice. In our experiments, we design h 5 and f in this way.

Algorithm 2 MergeStri ng(I nput1, I nput2, Out put) 1 2 3

Algorithm 1 Updating VBF Initialize 2 A i [ j ][k] = 0, i = 1, . . . , 5, j = 0, . . . , 4095, k = 0, . . . , m − 1 3 Update 4 Upon the arrival of a packet (sx , d x ) 5 col = f (sx , d x ) 6 for i = 1; i ≤ 5; i + + do 7 r ow = h i (sx ) 8 Ai [r ow][col] = 1 1

Let IP address u = s1 s2 s3 s4 s5 s6 s7 s8 be a superpoint identifier where si (1 ≤ i ≤ 8) is a 4-bit string and an IP address is a 32-bit string. Then h 1 (u) = s1 s2 s3 , h 2 (u) = s3 s4 s5 , h 3 (u) = s5 s6 s7 , and h 4 (u) = s7 s8 s1 . The data structure used in the VBF is denoted as A = (A1 , A2 , · · · , A5 ). Ai (1 ≤ i ≤ 5) is a 4096 × m bit array Ai [ j ][k] (0 ≤ j < 4096, 0 ≤ k < m) associated with a function h i . All Ai (1 ≤ i ≤ 5) share f . When a packet px = (sx , dx ) arrives, each Ai is updated by setting the bit in its row h i (sx ) and column f (sx , dx ) as illustrated in Fig. 1, where sx and d x are the source IP and destination IP of px . The bits in the VBF, A, are set to all 0s at the beginning of measurement. Algorithm 1 describes the process of updating VBF. Why is the VBF a variant of BF? In the following way, VBF is a standard Bloom Filter. For each px = (sx , d x ), define hash functions h + i (sx , d x ) = (h i (sx ), f (sx , d x )) (1 ≤ i ≤ 5). Thus, VBF is a standard Bloom Filter consisting of five hash functions h + i s. For each arrival packet, the five bits Ai [h + (s , d )] (1 ≤ i ≤ 5) are set to 1. When a source IP is x x i considered as a process unit, in each array there is a vector corresponding to the source IP. Consequently, this is also called Vector Bloom Filter. For the convenience of reading, some frequently-used notations in this paper are summarized in Table II.

4 5 6 7 8

Out put ← ∅ for each (u, V [u]) ∈ I nput1 do for each (w, V [w]) ∈ I nput2 do if tail(u) == head(w) then v = merge1(u, w); V [v] = V [u]&V [w];// & bitwise-and if C(V [v]) ≤ Z ∗ then Out put ← Out put ∪ (v, V [v])

B. Algorithm of Detection According to cardinality threshold m ∗ , we can give a zero bit number threshold Z ∗ . How to obtain Z ∗ according to m ∗ will be described in the next subsection. The measurement proceeds in periods. Each period may be a value between 20 seconds and 300 seconds. In practical application, the period may be selected according to the objective of measurement. For example, in order to detect a host that is scanning at a high speed, the period can be set to a small value; otherwise, a relatively large value is chosen. At the end of each measurement period, we only collect the hash strings whose corresponding vectors contain zero bit number less than or equal to Z ∗ . Denote the string collected by (u, V [u]), where u is a string of 12 bits and V [u] is bit vector Ai [u] of row u. Thus, we get five sets H1, H2 , H3, H4 and H5 composed of (u, V [u]) from the five different hash spaces, respectively. Now, we introduce the string functions: tail, head, merge1 and merge2. For a string w, tail(w) chooses low 4 bits of w and head(w) gets high 4 bits of w. For example, if w = s1 s2 s3 s4 , then tail(w) = s4 , and head(w) = s1 . Functions merge1 and merge2 may merge two overlapped strings into one. For example, for two strings abc and cde, a new string merge1(abc, cde) = abcde can be obtained. Similarly, for two strings acde f gb and bha, an IP address merge2(acdef gb, bha) = acdef gbh may be generated. The process of merging strings is described in Algorithm 2. A basic idea is that two overlapped strings are merged into one.

518

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 11, NO. 3, MARCH 2016

Algorithm 3 Gener ateI P(I nput1, I nput2, Out put) 1 2 3 4

5 6 7 8

Out put ← ∅ for each (u, V [u]) ∈ I nput1 do for each (w, V [w]) ∈ I nput2 do if tail(u) == head(w) and tail(w) == head(u) then v = merge2(u, w); V [v] = V [u]&V [w];// & bitwise-and if C(V [v]) ≤ Z ∗ then Out put ← Out put ∪ (v, V [v])

Algorithm 4 Detect Super poi nts Mergence MergeStri ng(H1, H2 , H12) 3 MergeStri ng(H12 , H3 , H123 ) 4 Gener ateI P(H123, H4 , I P) 5 Filtering 6 Out put ← ∅ 7 for each (v, V [v]) ∈ I P do 8 V [v] = V [v]& A5 [h 5 (v)] 9 if C(V [v]) ≤ Z ∗ then 10 Out put ← Out put ∪ (v, V [v]) 1

2

11 Estimation 12 for 13

each (u, V [u]) ∈ Out put do u is identified as a superpoint with cardinality Car di nali t y(u) = m ln( C(Vm[u]) ) − δ

Line 4 through Line 8 of Algorithm 2 show that two strings may be merged into one if they are tail-head overlapped and the counter of zero bits of the corresponding vector obtained from bit-and operation is not more than Z ∗ . The process of generating IP addresses is shown in Algorithm 3. According to Algorithm 3, some IP addresses (superpoint IDs) can be produced. However, in some cases, certain superpoint IDs that do not really exist in the network may be merged by the algorithms. These falsely merged superpoints are called phantoms. Finally, Algorithm 4 summarizes the detailed process of detecting superpoints. How to compute δ in Line 13 of Algorithm 4 is also presented in the next subsection. In this paper, for the ease of description, the output of h 5 is fixed at 12 bits long. In fact, the output may be longer or shorter. Certainly, δ needs to be recomputed in a similar way. Consequently, δ decreases as the output becomes longer and δ increases as the output becomes shorter. In general, the longer output requires more memory, but it also leads to higher accuracy. How should one determine the optimal value of m for the f function? This is an important practical issue. The choice of m is based on the following factors in the measurement period: the number of total flows, the number of superpoints, the cardinality range of superpoints, the requirement of

measurement accuracy, and the available memory. The measurement accuracy of VBF is improved when m increases; nevertheless, the time and space complexity of VBF increases. We recommend to select a value for m such that α is in the interval [0.02, 0.2], where α = s/4096m is the load factor of A5 . Naturally, the number s of flows needs to be estimated by rule of thumb in advance. C. Cardinality Estimation For a host identifier w, let |w| denote the cardinality of w, i.e., the number of all flows generated by w. Let F(w) = {(w, di )|1 ≤ i ≤ |w|} denote the set of all flows generated by w. According to the update algorithm of VBF, F(w) determines the column positions of the updated bits, i.e., the column positions are a multiset f (F(w)) = { f (w, di )|(w, di ) ∈ F(w)} (multiset, analogous to a set, but elements may appear more than once). Let Cbi t (w) denote the number of distinct elements in f (F(w)). Note that Ai [ j ] is the jth row of Ai (viewed as a bit vector). Let C(Ai [ j ]) denote the number of bits in Ai [ j ] that are 0s. If no other sources are hashed to the same rows as w is hashed, then C(A1 [h 1 (w)]) = C(A2 [h 2 (w)]) = · · · = C(A5 [h 5 (w)]), simply denoted as C(w). Certainly, C(w) = m − Cbi t (w). A fairly accurate estimate of |w| [25] is C(w) . (1) |w| = −m ln m Property 1: max{C(Ai [h i (w)])|1 ≤ i ≤ 5} ≤ C(A1 [h 1 (w)]& A2 [h 2 (w)]& · · · & A5 [h 5 (w)]) ≤ C(w). Proof: Recall that C(Ai [h i (w)]) denotes the number of “0” bits in Ai [h i (w)]. The bitwise and operator & may generate more “0” bits. Thus, C(Ai [h i (w)])(1 ≤ i ≤ 5) ≤ C(A1 [h 1 (w)]& A2 [h 2 (w)]& · · · & A5 [h 5 (w)]). It follows that max{C(Ai [h i (w)])|1 ≤ i ≤ 5} ≤ C(A1 [h 1 (w)]& A2 [h 2 (w)]& · · · & A5 [h 5 (w)]). Because other sources may be hashed to the rows that w is hashed to (that is hash collisions), C(A1 [h 1 (w)]& A2 [h 2 (w)]& · · · & A5 [h 5 (w)]) ≤ C(w).  Denote C(A1 [h 1 (w)]& A2 [h 2 (w)]& · · · & A5 [h 5 (w)]) by C(V [w]). Since we just obtain C(V [w]) instead of C(w) in the measurement, C(V [w]) is used to estimate the cardinality of w. Consider the following estimator C(V [w]) . (2) m Although the estimator (2) is not unbiased, it has some valuable features. For example, according to Property 1, it is always larger than or equal to the estimator (1). Therefore, the deviation must be corrected so that it may become a fair estimator. The value C(V [w]) comes from two parts: one is C(w) from w; the other is dedicated by other flows except w. The effect of the latter part (the noise) can be removed. In general, the first four hash functions of VBF are not truly random. Nonrandom functions slightly affect the value because |w| = −m ln

LIU et al.: DETECTION OF SUPERPOINTS USING A VBF

519

many vectors are saturated by ones. Martinez et al. [42] show that using the bits with smaller d (the absolute difference between the number of 0s and 1s) values for hashing would lead to a randomly better hash distribution. Since hash functions h 3 and h 4 contain the bits with small d values (the active low bit portion) of an IP address, so they can be thought as a random function from the perspective of performance. Moreover, h 5 may be truly random. Therefore, the five functions can be logically considered as two random functions. Equation (1) can be rewritten as C(w) = me

− |w| m

.

(3)

Let s be the number of distinct flows in the measurement s period. Averagely, each vector in a hash space receives 4096 flows. According to (3), they will occupy t = m(1 − e− 4096m ) s

(4)

bit positions (set to 1) of the vector in one hash space. If the corresponding bit positions of some vector in the other hash space are occupied by another flow, the deviation happens (note that we only think there are two hash spaces). Averagely, the number of flows coming into the t bit positions is s st = 4096m . If the δ f flows are uniformly δ f = mt ∗ 4096 hashed into t bit positions (considered as a bitmap vector with size t), then by Equation (3) the number of zero bits δf

is Ut = te− t = te− 4096m in the t bit positions. As a bitmap vector with size m, thes number of “0” bits is U = m − t + Ut = m − t + te− 4096m . Hence, after performing bit-and operation the number of the remaining flows is s

m − t + te− 4096m U = −m ln . (5) δ = −m ln m m Since h 5 is truly random, we use A5 to estimate s as follows s

where V∗ =

4095

s = −4096m ln V∗

C( A5 [i]) . 4096m

i=0

(6)

Substituting (6) into (4) yields

t = m(1 − V∗ ).

(7)

Substituting (6) and (7) into (5) gives m − t + te− 4096m = −m ln(V∗ (2 − V∗ )). (8) m Denote the actual cardinality of w by k. Then, the estimated cardinality kˆ of w is computed as follows: s

δ = −m ln

C(V [w]) )−δ kˆ = −m ln( m C(V [w]) = −m ln( ) + m ln(V∗ (2 − V∗ )). (9) m Replace kˆ and C(V [w]) with m ∗ and z ∗ into the above equation, respectively. We have m ∗ = −m ln(

Z∗ ) − δ. m

(10)

Solving Equation (10) yields z ∗ = me−

m ∗ +δ m

.

(11)

Note that VBF addresses the hash conflict in three phases Mergence, Filtering and Estimation of Algorithm 4, respectively. In Mergence and Filtering phases some noise is removed by bitwise and operations. In Estimation phase the remaining noise is removed by subtracting δ from the estimator. IV. A NALYSIS A. Number of Phantoms Let S = {w|Cbi t (w) ≥ m − Z ∗ }. Denote the number of elements in S by N. Then N approximately equates the number of superpoints. Hence, N is also used to denote the number of superpoints. Denote the number of elements of Hi (1 ≤ i ≤ 5) by σi N. Let σ = max{maxi {σi }, 1}, then σi N ≤ σ N(σ ≥ 1). Property 2: Let V1 be a bit vector of size m whose n bits randomly selected are set to 1, and other bits are set to 0. Let V2 be a bit vector of size m whose r bits randomly selected are set to 1, and other bits are set to 0. Define V = V1 &V2 , where & is the bitwise-and operator. Let x be the random variable for the number of “1” bits in V. Then, random variable x follows a hypergeometric distribution:    m−n n k r −k   . (12) Pr ob(x = k) = m r where k = 0, 1, 2, . . . , t = min(n, r ). Proof: Consider V1 as a population of m elements where n elements are 1’s and m − n are 0’s. That the number of “1” bits in V is k means that k bits of r “1” bits in V2 are exactly in n “1” bit positions of V1 (other r − k are in ‘0’ bit positions of V1 ). The problem can be reduced to the following form. A group of r elements is chosen at random from V1 . We seek the probability on which the group chosen will contain exactly k “1” bits. Note that the chosen group contains   k “1” and r −k “0” bits. The “1”  bits canbe chosen in n m−n different ways and the “0” bits in ways. Since k r −k any choice of k “1” bits may be combined of  with  any choice  n m−n k r −k   . Since “0” bits, the probability Pr ob(x = k) is m r the probability is defined only for k not exceeding n or r , thus k = 0, 1, 2, . . . , t = min(n, r ).  Corollary 1: Consider n ≥ y and r ≥ y, the probability that x is not less than y is    min(n,r)  n m −n k r −k k=y   . (13) Pr ob(x ≥ y) = m r Here, n and r are also random variables with the independent identical distributions. Let p(n) and p(n|y) respectively denote the density function of n and the conditional density function of n for given n ≥ y. We rewrite Pr ob(x ≥ y)

520

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 11, NO. 3, MARCH 2016

TABLE III ρ(y, m) W ITH VARYING a AND G IVEN m = 512

as pr ob(y, n, r, m). For fixed k and m, pr ob(y, n, r, m) is also the function of random variables n and r . Its expectation ρ(y, m) is given by m m  

a string of H123 is falsely merged with a string of H4 is ρ ρ = 256 . Then, x 3i is a binomial random variable with 28 ρ ). Hence, the total number of phantoms parameters (σ N, 256 that are merged in the mergence phase of Algorithm 4 is

pr ob(y, n, r, m) p(n|y) p(r |y). X=

n=y r=y

Corollary 2: Assume n and r are independent random variables with the same Pareto distribution, i.e., p(n|y) = b n1a I[y,m] (n) and p(r |y) = b r1a I[y,m] (r ), where a > 0 and b = m 1 1 , then k=y k a

ρ(y, m) =

m  m  n=y r=y

2

pr ob(y, n, r, m) nba r a .

(14)

As an example, for fixed m = 512 and varying a, ρ(y, m) is shown in Table III. We simply denote ρ(y, m) by ρ, whenever there is no confusion. Property 3: Let X be the number of phantoms that are merged in the mergence phase of Algorithm 4. The mean of X is bounded by ρσ N 2 ρσ 2 N 2 ) . (1 + (15) 162 16 Proof: At first consider the merged H12 from H1 and H2. If the strings falsely merged do not exist, there are at most N strings of length 20 bits that are produced. If two strings are merged, it means that the last four bits of a string are the same as the first four bits of another string. The probability that a string of H1 is falsely merged with a string of H2 is ρ ρ = 16 . Note that strings u ∈ H1 and v ∈ H2 can be 24 concatenated into a string w if and only if tail(u) = head(v) and Cbi t (V [u]&V [v]) ≥ m − Z ∗ . For each string si ∈ H1, let x 1i present the number of strings falsely merged from si and the strings of H2. Then, x 1i is a binomial random variable ρ with parameters (σ N, 16 ). Hence, the total number of strings σ N of H12 is at most X 1 = N + x 1i . E(X) ≤

i=1

Consider the merged H123 from H12 and H3. For each string si ∈ H12, let x 2i present the number of strings falsely merged from si and the strings of H3 . Similarly, the total number of strings of H123 is at most X2 = N +

X1 

X2 

x 3i .

i=1

Then E(X) = E(X 2 )E(x 31) by the independence of the N x 3i and X 2 . Thus, E(X) = E(X 2 ) σρ 256 . Similarly, E(X 2 ) =

ρ N + E(X 1 )E(x 21 ) ≤ σ N + E(X 1 ) σρ16N = σ N(1+ E(X 1 ) 16 )= ρ ) ≤ σ N(1 + ρσ16N )2 . σ N(1 + (N + σ N ρσ16N ) 16 Rearranging now produces inequality (15).  In order to reduce the impact of falsely merged phenomenon, we filter out these falsely merged hosts in the filtering phase of detecting algorithm illustrated in Algorithm 4. For each host w in S, since its hash value in any hash space can attain the threshold, therefore it must pass the filter phase. However, some falsely merged hosts may be filtered out. Finally, all hosts that pass the filter phase are identified as superpoints. Although some phantoms may be filtered out, the others still pass the filter and are identified as superpoints. Next, we analyze the number of phantoms identified as superpoints. Theorem 1: The mean of number Y of phantoms identified as superpoints is bounded by

ρσ N 2 ρ2σ 3 N 3 ) . (1 + (16) 5 16 16 Proof: Let w1 , w2 , . . . , w X be phantoms. There are σ N strings with length 12 bits in the hash set H5. Thus, the probN . ability that any wi may pass the filtering phase is p = ρσ 163 Let Yi be 1  if wi passes the filtering phase, and 0 otherwise. X Then, Y = i=1 Yi . Since X and Yi are independent, hence E(Y ) = p E(X). Combining Property 3, we obtain (16).  The comparison of the actual false positives and phantoms based on (16) for distinct traces is shown in Table IV. It can be seen that the results of computation by (16) is close with those of the actual detection. E(Y ) ≤

B. Mean of kˆ x 2i .

i=1

Now, consider the merged I P from H123 and H4 . For each string si ∈ H123, let x 3i present the number of strings falsely merged from si and the strings of H4 . The probability that

Now, we still consider the first four hash functions as a random function, denoted as h ∗ (a virtual function). For a host w, the flows that belong to w can be considered to be hashed into A∗ [h ∗ (w)] = A1 [h 1 (w)]& A2 [h 2 (w)]& A3 [h 3 (w)]& A4 [h 4 (w)]. Thus, for a host w, its corresponding

LIU et al.: DETECTION OF SUPERPOINTS USING A VBF

521

TABLE IV FALSE P OSITIVES BY O UR M ETHOD FOR D ISTINCT T RACES

bitmap vector V [w] is considered as A∗ [h ∗ (w)]& A5 [h 5 (w)], that is V [w] = A∗ [h ∗ (w)]& A5 [h 5 (w)]. Recall that s presents the number of distinct flows during the measurement period. Let random variables U∗ and V∗ respectively denote the number and fraction of “0” bits in A5 . Let Um and Vm denote the number and fraction of “0” bits in the vector V [w], respectively. Clearly, Vm = Um /m and V∗ = U∗ /4096m. Hence, Equation (9) can be written as kˆ = −m ln Vm + m ln(V∗ (2 − V∗ )) = −m ln Vm + mlnV∗ + m ln(2 − V∗ ).

(17)

Let kˆ1 = −m ln Vm , kˆ2 = −m ln V∗ and kˆ3 = −m ln(2 − V∗ ). Equation (17) becomes kˆ = kˆ1 − kˆ2 − kˆ3 .

(18)

Let B j present the event that the j th bit in V [w] remains “0”, and 1 B j be the corresponding indicator random variable that 1 B j is 1 if B j occurs and 0 otherwise. Let C ∗j present the event that the j th bit in A∗ [h ∗ (w)] is not set to 1 by w, (that means any one of the k contacts generated by w is not hashed onto the j th column), and D ∗j present the event that j th bit in A∗ [h ∗ (w)] is not set to 1 by other sources except w. Likewise, let C j and D j present the corresponding events in A5 [h 5 (w)]. Since the same hash function f is used, C ∗j = C j . Hence, Pr ob{B j }

  = Pr ob{C ∗j D ∗j C j D j } = Pr ob{C j }Pr ob{D ∗j Dj}   1 1 1 )s−k − (1 − )2(s−k) = (1 − )k 2(1 − m 4096m 4096m s 1 k − s ≈ (1 − ) e 4096m (2 − e− 4096m ) m as (s − k), m, k → ∞ k s. (19)

Since  Um is the number of “0” bits in the vector V [w], Um = m−1 j =0 1 B j . Hence, m−1 1 1  1B j E(Vm ) = E(Um ) = m m

according to the estimator in [25], we have

E(V∗ ) = e− 4096m = e−α . 1 e−α (1 − (1 + α)e−α ). V ar (V∗ ) = 4096m s

We expand kˆ2 and p = E(V∗ ) = e−α ,

kˆ3

by

their

Taylor

(21) (22) series

at

kˆ2 = −m ln V∗ V∗ − p (V∗ − p)2 (V∗ − p)3 + = m(α − − + · · · ). p 2 p2 3 p3 (23) kˆ3 = −m ln(2 − V∗ ) = m(− ln(2 − e−α ) + +

V∗ − p (V∗ − p)2 + 2− p 2(2 − p)2

(V∗ − p)3 + · · · ). 3(2 − p)3

(24)

We truncate the above equations after the third term, and obtain E(kˆ2 ) = E(−m ln V∗ ) = m(α +

E(V∗ − p)2 ) 2 p2

eα − α − 1 1 (s + ). (25) 4096 2 E(kˆ3 ) = E(−m ln(2 − V∗ )) E(V∗ − p)2 = m(− ln(2 − e−α ) + ) 2(2 − p)2 1 eα − α − 1 e−2α ( ). = −m ln(2 − e−α ) + (2 − e−α )2 4096 2 (26) =

Now, we investigate 2−e−α . Since 0 < e−α < 1 for ∀α > 0, we have 1 < 2 − e−α < 2. There exists τ > 0 such that (e−α )−τ = 2 − e−α .

(27)

(e−α )−τ > 2 − e−α ⇔ (e−α )−τ = eτ α ≥ eα > 2 − e−α ⇔ (eα − 1)2 > 0.

m−1 1  = Pr ob{B j } m j =0

s 1 k − s ) e 4096m (2 − e− 4096m ). m

s 4096m ,

If τ ≥ 1, then

j =0

≈ (1 −

Let α =

(20)

Hence, 0 < τ < 1. In addition, (27) can be uniquely solved for τ = α1 ln(2 − e−α ).

522

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 11, NO. 3, MARCH 2016

Substituting Equation (27) into (26) yields eα

−α−1 1 (−τ s + e−2(1+τ )α ). 4096 2 Adding together equations (25) and (28), we get E(kˆ3 ) =

1 + e−2(1+τ )α eα − α − 1 (1 − τ )s + 4096 4096 2 1 + e−2(1+τ )α eα − α − 1 . = m(1 − τ )α+ 4096 2 Equation (20) becomes

Thus, (28)

E(kˆ2 + kˆ3 ) =

s 1 k − s ) e 4096m (2 − e− 4096m ) m 1 = (1 − )k e−(1−τ )α . m

(29)

E(Vm ) ≈ (1 −

(30)

The probability for Bi and B j , ∀i, j ∈ [0, . . . , m−1], i = j , to happen simultaneously is Pr ob{B j Bi }

 D j )(Di∗ Di )}    ∗ ∗ Pr ob{C j Ci (D j Di D j Di D ∗j Di D j Di∗ )}  D j Di } Pr ob{C j Ci }(Pr ob{D ∗j Di∗  + Pr ob{D ∗j Di∗ D j Di D ∗j Di∗ D j Di }) Pr ob{C j Ci } 2Pr ob{D j Di } − Pr ob{D j Di }2

+ 2Pr ob{D ∗j Di∗ D j Di } Pr ob{C j Ci } 2Pr ob{D j Di } − Pr ob{D j Di }2

+ 2Pr ob{D j − D j Di }2

= Pr ob{C j Ci (D ∗j = =

=

=

≈ (1 −



2 k −2α(1−τ ) ) e . m

(31)

Let β = mk + (1 − τ )α and q = e−β , then E(Vm ) ≈ e−β .  Since Vm = Um /m and Um = m−1 j =0 1 B j , we have ⎛ ⎞ m−1  1 1 B j )2 ⎠ E(Vm2 ) = 2 E ⎝( m j =0 ⎛ ⎛ ⎞ ⎞ m−1 m−1 1 ⎝ 2 ⎠ 2 ⎝  = 2E 1B j + 2 E 1 Bi 1 B j ⎠ m m j =0

0≤ j AND N D ENOTES THE N UMBER OF S UPERPOINTS

Substituting equations (29), (34), (37), (38) and (39) into (36) yields ˆ V ar (k) m (2 − 2 p)2 α k )+ (e − α − 1) m 4096 (2 − p)2 eβ − 1 − mk ) + 2(mβ + 2 1 + e−2(1+τ )α eα − α − 1 × (m(1 − τ )α + ) 4096 2 (2 − p)2 + p 2 eα − α − 1 − 2m 2 β(1 − τ )α − mβ (2 − p)2 4096 k − m(1 − τ )α(eβ − 1 − ) m m k (eα − α − 1) < m(eβ − 1 − ) + m 4096 eβ − 1 − mk eα − α − 1 )(m(1 − τ )α + ) + 2(mβ + 2 4096 eα − α − 1 − 2m 2 β(1 − τ )α − mβ − m(1 − τ )α(eβ − 1) 4096 k k eα − α − 1 . = m(eβ − 1 − ) + (m + mβ + eβ − 1 − ) m m 4096 (40)

= m(eβ − 1 −

where m is the number of bits in a vector. Thus a moderate amount of SRAM can support very high link speeds. Mvb f = 1.25MB when m = 512. In [43] 72Mbits SRAM is in production and available today. Streaming Speed: For the proposed method, the processing time is determined by the VBF. The calculation of the hash values in VBF can be executed in parallel on hardware, so we can ignore the time needed to obtain hash values. Moreover, CPU consumption time is much shorter than memory access time, so we use the required number of memory accesses to measure the required processing time. In parallel, VBF requires one write to SRAM. Using less than 5ns SRAM in [43], on an OC768 link the packet time of minimum length (40bytes) is 8ns, so the VBF can support 40Gbps links. V. E XPERIMENTS A. Evaluation Metrics

D. Estimation Bias and Relative Mean Error The estimation bias is eβ − 1 − mk 1 + e−2(1+τ )α eα − α − 1 − . (41) 2 4096 2 Table V gives an example of the bias with respect to k, for s = 100000, and m = 256, 512, or 1024. It is seen that the bias is relatively small when compared to the true cardinality. The mean squared error is E(kˆ − k) ≈

ˆ + (E(kˆ − k))2 . E(kˆ − k)2 = V ar (k)

(42)

The Relative Mean Error (RME) of kˆ is defined as follows:  E(kˆ − k)2 ˆ = . (43) RM E(k) k

E. Complexity Analysis of VBF The proposed method has low storage (SRAM) complexity and allows for high speed links. Memory (SRAM) Consumption: Define Mvb f bytes as the required memory size of VBF. We have Mvb f = 5 ∗ 4096 ∗ m/8 = 2560m

In this paper, experimental data are based on the actual network traffic from three different locations of the real-world Internet. They include the MAWI Working Group of the WIDE Project (MAWI, MAWIL) [44], Jiangsu provincial network border of China Education and Research Network (CERNET) [45], and NLANR (IPKS0, IPKS1) [46]. Table VI summarizes all the traces used in experiments. We will use the traces to evaluate the proposed detecting method. The False Negative Ratio (FNR) and False Positive Ratio (FPR) are adopted to evaluate the detection accuracy of the algorithms: FNR =

s− , s

FPR =

s+ s

where s is the number of actual superpoints, s − is the number of actual superpoints being not identified, and s + is the number of non-superpoints being incorrectly identified. The Weighted Mean Difference (WMD) is adopted as evaluation metric for estimated cardinality. Suppose the number of flows of superpoint i is n i and the estimate of this number is nˆ i . The value of WMD is given by:  |n i − nˆ i | W M D = i . i ni

524

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 11, NO. 3, MARCH 2016

Fig. 2. Comparison of accuracy in estimating cardinalities of VBF, Sampled and Bitmap. The given threshold m ∗ is 0.025% of the total flow number. (a) MAVI. (b) IPKS0. (c) IPKS1.

ˆ ˆ = |k−k| Fig. 3. The measured relative error R E(k) vs the calculated relative mean error based on (43) for VBF. The given threshold m ∗ is 0.05% of the k total flow number and m = 512. (a) MAVI. (b) IPKS0. (c) IPKS1.

That is, the sums are over all actual superpoints. When a superpoint is not identified correctly, its nˆ i is 0.

TABLE VII E VALUATION M ETRICS OF E STIMATED R ESULTS F ROM P ERFORMING THE T HREE A LGORITHMS FOR D ISTINCT T RACES . T HE D EFINED T HRESHOLD m ∗ I S 0.1% OF THE T OTAL F LOW N UMBER

B. Comparing With Other Algorithms In this section, the VBF will be compared with the flow Sampled algorithm (Sampled) in [21], and the Bitmap algorithm (Bitmap) in [23]. For Sampled, sample rate p is set 1/8 and the estimated cardinality is given by scaling cardinality sampled by 8. For Bitmap, we set the size of 2D array A to 2MB (512r ows × 16384columns) and flow sampling rate to 1/8. For VBF, we set m to 512. Fig. 2 compares the estimated cardinalities of the superpoints using three algorithms with their actual cardinalities by feeding traces IPKS0, IPKS1, and MAVI respectively, where the x-coordinate is the actual cardinalities of the superpoints, and the y-coordinate is the estimated cardinalities of the superpoints. The diagonal line is a natural standard line which is used to evaluate the estimation performance. If a point is nearer to the diagonal line, then the estimated cardinality is closer to the actual value. It can be seen that most of the points generated by VBF are closer to the diagonal line. This means that VBF is much better than Sampled and Bitmap in estimating cardinality. Fig. 3 shows the relative errors of the estimate of cardinality for the superpoints identified correctly by VBF. The Relative ˆ ˆ = |k−k| Error (RE) is defined as R E(k) k , where k is the actual cardinality of a superpoint and kˆ is the corresponding estimate. We compute the relative errors of the estimates by VBF for

m ∗ = 0.1%, 0.05%, 0.025% of the total flow number and m = 256, 512, 1024 respectively. The experimental results show the relative errors of the estimates of VBF are small. For the limit of space, Fig. 3 only presents the results for m ∗ = 0.05% of the total flow number and m = 512 for the three different traces. Meanwhile, we also compare the RE with RME based on (43). Although there are some superpoints whose REs are slightly larger than the RME, most of REs are very close to the RME line. This means that the measured relative error by VBF is approximate to the relative mean error by theory analysis. In Tables VII and VIII, three methods are compared by computing three evaluation metrics. For Bitmap and VBF,

LIU et al.: DETECTION OF SUPERPOINTS USING A VBF

525

Fig. 4. FNR, FPR, and WMD of VBF for different traces for different thresholds. The given thresholds are 0.025%, 0.05%, and 0.1% of the total flow number of the trace, respectively. The vector size m = 512. (a) FNR. (b) FPR. (c) WMD.

Fig. 5. FNR, FPR, and WMD of VBF for different traces for different vector sizes m = 256, 512, and 1024, respectively. Keep threshold m ∗ = 0.05% of the total flow number fixed. (a) FNR. (b) FPR. (c) WMD.

TABLE VIII

TABLE IX

E VALUATION M ETRICS OF E STIMATED R ESULTS F ROM P ERFORMING THE T HREE A LGORITHMS FOR D ISTINCT T RACES . T HE D EFINED T HRESHOLD m ∗ I S 0.05% OF THE T OTAL F LOW N UMBER

C OMPARISON OF M EMORY R EQUIREMENTS FOR VBF, S AMPLED AND B ITMAP. T HE G IVEN T HRESHOLD I S 0.05% OF THE T OTAL F LOW N UMBER FOR MAVIL

δ = 0.05, b = 2 and Table IX only shows the memory requirement of hash tables, not including the memory usage of linked lists of storing flow IDs or source IPs. For Bitmap, 2D bit array is still 512 × 16384 (2MB) and sampling rate is set to 1/8. The majority of memory requirements are used to record the source IDs sequentially. For VBF, m is set to 1024. Compared with Sampled and Bitmap, VBF is more efficient since it does not store the flow IDs or source IPs. the bit vectors become almost full when the cardinality value is close to 3194 (m ln m, m = 512). To be fair, in computing WMD we only consider the superpoints whose cardinalities are between the threshold and 3000. We see that almost all FNR, FPR, and WMD of VBF are smaller than those of Bitmap and Sampled. Therefore, in terms of detection accuracy our VBF outperforms Bitmap and Sampled. In Table IX, three methods are compared in terms of memory requirements in a very long trace MAVIL. To distinguish the two algorithms of Sampled in [21], we refer to the one-level filtering algorithm and the two-level filtering algorithm as 1LF and 2LF, respectively. For 1LF and 2LF,

C. Testing VBF for Different Parameters We also investigate the impact of different thresholds m ∗ on the performance of the VBF. When threshold m ∗ decreases, the number of superpoints of the trace increases. Fig. 4 presents the experimental results of VBF under different thresholds, where m ∗ varies over 0.025%, 0.05%, and 0.1% of the total flow number of the trace while the bit vector size m is invariant (set to 512). It is seen that FNRs and FPRs of VBF increase when m ∗ decreases for most traces while varying m ∗ does almost not affect the WMDs of VBF. Fig. 5 shows the experimental results when the vector size m is 256, 512, and 1024, respectively. In the experiments,

526

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 11, NO. 3, MARCH 2016

the threshold m ∗ = 0.05% of the total flow number is invariant and the computation of WMDs only involves these superpoints whose cardinalities are less than m ln m. One can see that the FPR and WMD of VBF are sharply reduced when m increases. It means that the measurement accuracy of VBF can greatly be improved while the more memory is available. D. Design of Hash Functions Table I gives the detailed specification of the first four hash functions of VBF. The remaining two hash functions h 5 and f are required to generate uniform output, however how to construct them is not described. In our all experiments, we only use a few bitwise operators to implement h 5 and f . For a SIP = s1 s2 · · · s8 , where si (1 ≤ i ≤ 8) is a 4-bit string, function h 5 is designed as follows: h 5 (SIP) = s1 s2 s3 ⊕ s4 s5 0 ⊕ s6 s7 s8 , where ⊕ presents exclusive OR and s4 s5 0 is the string that s4 s5 is appended by 3 “0” bits. Now, we give the implementation of f in our experiments for m = 512. For an arrival packet (SIP, DIP), f performs the following operations: 1. SIP = htonl(SIP); 2. a = SIP ⊕ DIP; 3. b = (a&0x0000F F F F); a = (a  16); 4. a = a ⊕ b; 5. b = a  7; a = a&0x01F F; 6. a = a ⊕ b; where the byte order of SIP using htonl is inverted in Line 1 and exclusive OR operation is performed in Line 2. Line 3 yields two 16-bit strings and Line 5 obtains two 9-bit strings. Line 6 generates the output value of function f . Although h 5 and f used in the experiments might not be truly random, VBF identifies efficiently and accurately the superpoints in the traces. Therefore, using bitwise operators to implement hash functions is an available method. The computation complexity of such functions is very low. VI. C ONCLUSIONS In this work, we have constructed a Vector Bloom Filter (VBF) and designed the corresponding algorithm to identify superpoints. For each arriving packet, the VBF only needs to set several bits to 1. The computation complexity of the hash functions is low because they only take some consecutive bits as the output. The theoretical analysis shows that false positive ratio originated from phantoms is very low. The experimental results demonstrate that the VBF can precisely and efficiently detect superpoints. Treating an IP address as a bit string, designing hash functions based on bitwise operators is an available method. ACKNOWLEDGMENTS The authors thank Qingbo Yin and Jun Zhang for their valuable comments and suggestions. They also thank Wanlei Zhou and anonymous reviewers for their insightful comments and feedback.

R EFERENCES [1] K. Xu, F. Wang, and L. Gu, “Behavior analysis of Internet traffic via bipartite graphs and one-mode projections,” IEEE/ACM Trans. Netw., vol. 22, no. 3, pp. 931–942, Jun. 2014. [2] T. Li, S. Chen, and Y. Ling, “Per-flow traffic measurement through randomized counter sharing,” IEEE/ACM Trans. Netw., vol. 20, no. 5, pp. 1622–1634, Oct. 2012. [3] J. Zhang, C. Chen, Y. Xiang, W. Zhou, and Y. Xiang, “Internet traffic classification by aggregating correlated naive Bayes predictions,” IEEE Trans. Inf. Forensics Security, vol. 8, no. 1, pp. 5–15, Jan. 2013. [4] J. Zhang, X. Chen, Y. Xiang, W. Zhou, and J. Wu, “Robust network traffic classification,” IEEE/ACM Trans. Netw., vol. 23, no. 3, pp. 1257–1270, Aug. 2015. [5] C. Hu, B. Liu, S. Wang, J. Tian, Y. Cheng, and Y. Chen, “ANLS: Adaptive non-linear sampling method for accurate flow size measurement,” IEEE Trans. Commun., vol. 60, no. 3, pp. 789–798, Mar. 2012. [6] F. Khan, N. Hosein, S. Ghiasi, C.-N. Chuah, and P. Sharma, “Streaming solutions for fine-grained network traffic measurements and analysis,” IEEE/ACM Trans. Netw., vol. 22, no. 2, pp. 377–390, Apr. 2014. [7] P. Wang, X. Guan, J. Zhao, J. Tao, and T. Qin, “A new sketch method for measuring host connection degree distribution,” IEEE Trans. Inf. Forensics Security, vol. 9, no. 6, pp. 948–960, Jun. 2014. [8] M. Rezvani, V. Sekulic, A. Ignjatovic, E. Bertino, and S. Jha, “Interdependent security risk analysis of hosts and flows,” IEEE Trans. Inf. Forensics Security, vol. 10, no. 11, pp. 2325–2339, Nov. 2015. [9] W. Chen, Y. Liu, and Y. Guan, “Cardinality change-based early detection of large-scale cyber-attacks,” in Proc. IEEE INFOCOM, Apr. 2013, pp. 1788–1796. [10] C. A. Shue, A. J. Kalafut, and M. Gupta, “Abnormally malicious autonomous systems and their Internet connectivity,” IEEE/ACM Trans. Netw., vol. 20, no. 1, pp. 220–230, Feb. 2012. [11] Y. Wang, S. Wen, Y. Xiang, and W. Zhou, “Modeling the propagation of worms in networks: A survey,” IEEE Commun. Surveys Tuts., vol. 16, no. 2, pp. 942–960, May 2014. [12] M. Egele, T. Scholte, E. Kirda, and C. Kruegel, “A survey on automated dynamic malware-analysis techniques and tools,” ACM Comput. Surv., vol. 44, no. 2, Feb. 2012, Art. ID 6. [13] D. Moore, V. Paxson, S. Savage, C. Shannon, S. Staniford, and N. Weaver, “Inside the Slammer worm,” IEEE Security Privacy, vol. 1, no. 4, pp. 33–39, Jul./Aug. 2003. [14] D. Plonka, “FlowScan: A network traffic flow reporting and visualization tool,” in Proc. USENIX LISA, Dec. 2000, pp. 305–318. [15] N. Kamiyama, T. Mori, and R. Kawahara, “Simple and adaptive identification of superspreaders by flow sampling,” in Proc. IEEE INFOCOM, May 2007, pp. 2481–2485. [16] G. Cheng, J. Gong, W. Ding, H. Wu, and S. Qiang, “Adaptive sampling algorithm for detection of superpoints,” Sci. China F, Inf. Sci., vol. 51, no. 11, pp. 1804–1821, Nov. 2008. [17] M. Yoon, T. Li, S. Chen, and J.-K. Peir, “Fit a compact spread estimator in small high-speed memory,” IEEE/ACM Trans. Netw., vol. 19, no. 5, pp. 1253–1264, Oct. 2011. [18] P. Wang, X. Guan, W. Gong, and D. Towsley, “A new virtual indexing method for measuring host connection degrees,” in Proc. IEEE INFOCOM, Apr. 2011, pp. 156–160. [19] X. Guan, P. Wang, and T. Qin, “A new data streaming method for locating hosts with large connection degree,” in Proc. IEEE GLOBECOM, Nov./Dec. 2009, pp. 1–6. [20] W. Liu, W. Qu, G. Jian, and L. Keqiu, “A novel data streaming method detecting superpoints,” in Proc. IEEE INFOCOM WKSHPS, Apr. 2011, pp. 1042–1047. [21] S. Venkataraman, D. Song, P. B. Gibbons, and A. Blum, “New streaming algorithms for fast detection of superspreaders,” in Proc. NDSS, Feb. 2005, pp. 149–166. [22] J. Cao, Y. Jin, A. Chen, T. Bu, and Z.-L. Zhang, “Identifying high cardinality Internet hosts,” in Proc. IEEE INFOCOM, Apr. 2009, pp. 810–818. [23] Q. Zhao, A. Kumar, and J. Xu, “Joint data streaming and sampling techniques for detection of super sources and destinations,” in Proc. ACM SIGCOMM IMC, Oct. 2005, p. 7. [24] T. Li, S. Chen, W. Luo, M. Zhang, and Y. Qiao, “Spreader classification based on optimal dynamic bit sharing,” IEEE/ACM Trans. Netw., vol. 21, no. 3, pp. 817–830, Jun. 2013. [25] K.-Y. Whang, B. T. Vander-Zanden, and H. M. Taylor, “A linear-time probabilistic counting algorithm for database applications,” ACM Trans. Database Syst., vol. 15, no. 2, pp. 208–229, Jun. 1990.

LIU et al.: DETECTION OF SUPERPOINTS USING A VBF

[26] C. Estan, G. Varghese, and M. Fisk, “Bitmap algorithms for counting active flows on high-speed links,” IEEE/ACM Trans. Netw., vol. 14, no. 5, pp. 925–937, Oct. 2006. [27] M. Yoon and S. Chen, “Detecting stealthy spreaders by random aging streaming filters,” IEICE Trans. Commun., vols. E94-B, no. 8, pp. 2274–2281, Aug. 2011. [28] G. Cheng and Y. Tang, “Line speed accurate superspreader identification using dynamic error compensation,” Comput. Commun., vol. 36, no. 13, pp. 1460–1470, Jul. 2013. [29] Q. Xiao, Y. Qiao, M. Zhen, and S. Chen, “Estimating the persistent spreads in high-speed networks,” in Proc. IEEE 22nd ICNP, Oct. 2014, pp. 131–142. [30] T. Wellem, G.-W. Li, and Y.-K. Lai, “Superspreader detection system on NetFPGA platform,” in Proc. ANCS, Oct. 2014, pp. 247–248. [31] R. Schweller et al., “Reversible sketches: Enabling monitoring and analysis over high-speed data streams,” IEEE/ACM Trans. Netw., vol. 15, no. 5, pp. 1059–1072, Oct. 2007. [32] P. Wang, X. Guan, T. Qin, and Q. Huang, “A data streaming method for monitoring host connection degrees of high-speed links,” IEEE Trans. Inf. Forensics Security, vol. 6, no. 3, pp. 1086–1098, Sep. 2011. [33] P. Wang, X. Guan, D. Towsley, and J. Tao, “Virtual indexing based methods for estimating node connection degrees,” Comput. Netw., vol. 56, no. 12, pp. 2773–2787, Aug. 2012. [34] T. Qin, X. Guan, W. Li, P. Wang, and M. Zhu, “A new connection degree calculation and measurement method for large scale network monitoring,” J. Netw. Comput. Appl., vol. 41, no. 1, pp. 15–26, May 2014. [35] Y. Liu, W. Chen, and Y. Guan, “Identifying high-cardinality hosts from network-wide traffic measurements,” in Proc. IEEE Conf. Commun. Netw. Secur., Oct. 2013, pp. 287–295. [36] W. Liu, W. Qu, X. He, and Z. Liu, “Detecting superpoints through a reversible counting bloom filter,” J. Supercomput., vol. 63, no. 1, pp. 218–234, Jan. 2013. [37] T. Zseby, M. Molina, N. Duffild, S. Niccolini, and F. Raspall, Sampling and Filtering Techniques for IP Packet Selection, document RFC 5475, Mar. 2009. [38] A. Pagh and R. Pagh, “Uniform hashing in constant time and optimal space,” SIAM J. Comput., vol. 38, no. 1, pp. 85–96, 2008. [39] K.-M. Chung, M. Mitzenmacher, and S. Vadhan, “Why simple hash functions work: Exploiting the entropy in a data stream,” Theory Comput., vol. 9, no. 30, pp. 897–945, Dec. 2013. [40] M. Molina, S. Niccolini, and N. Duffield, “A comparative experimental study of hash functions applied to packet sampling,” in Proc. ITC, Beijing, China, Aug. 2005, pp. 1–10. [41] G. Cheng, W. Zhao, and J. Gong, “XOR hashing algorithms to measured flows at the high-speed link,” in Proc. Int. Conf. Future Generat. Commun. Netw., Dec. 2008, pp. 153–155. [42] C. J. Martinez, D. K. Pandya, and W.-M. Lin, “On designing fast nonuniformly distributed IP address lookup hashing algorithms,” IEEE/ACM Trans. Netw., vol. 17, no. 6, pp. 1916–1925, Dec. 2009. [43] Cypress Semiconductor Corporation. Memory Products. [Online]. Available: http://www.cypress.com/, accessed Nov. 2015. [44] WIDE. MAWI Working Group Traffic Archive. [Online]. Available: http://tracer.csl.sony.co.jp/mawi/, accessed Nov. 2015. [45] JSLAB. IP Trace Distribution System. [Online]. Available: http:// iptas.edu.cn/src/system.php, accessed Nov. 2015. [46] NLANR. Passive Measurement and Analysis. ftp://wits.cs.waikato. ac.nz/pma/long/ipls/3/, accessed Nov. 2015.

527

Weijiang Liu received the Ph.D. degree in computation mathematics from Jilin University, Changchun, China, in 1998. From 2004 to 2006, he was a Postdoctoral Fellow of Postdoctoral Station for Computer Science and Technology with Southeast University, China. He is currently a Professor with the School of Information Technology, Dalian Maritime University, China. He has authored over 50 papers and his current research interests include network measurement, network performance, and network security.

Wenyu Qu received the bachelor’s and master’s degrees from the Dalian University of Technology, China, in 1994 and 1997, respectively, and the Ph.D. degree from the Japan Advanced Institute of Science and Technology, Japan, in 2006. She is currently a Professor with the School of Software, Tianjin University. She was a Professor with Dalian Maritime University, China, from 2007 to 2015. She was an Assistant Professor with the Dalian University of Technology, China, from 1997 to 2003. Her research interests include cloud computing, computer networks, and information retrieval. She has authored over 80 technical papers in international journals and conferences. She is on the committee board for a couple of international conferences.

Jian Gong received the B.S. degree from Nanjing University, and the M.S. and Ph.D. degrees from Southeast University, China, all in computer science. He is currently a Professor with the School of Computer Science and Engineering, Southeast University. He has authored over 100 papers and his current research interests include network measurement, network performance, and network security.

Keqiu Li (SM’12) received the bachelor’s and master’s degrees from the Department of Applied Mathematics, Dalian University of Technology, China, in 1994 and 1997, respectively, and the Ph.D. degree from the Graduate School of Information Science, Japan Advanced Institute of Science and Technology, in 2005. He also has two-year postdoctoral experience with the University of Tokyo, Japan. He is currently a Professor with the School of Computer Science and Technology, Dalian University of Technology. He has authored over 100 technical papers. He is an Associate Editor of the IEEE T RANSACTIONS ON PARALLEL AND D ISTRIBUTED S YSTEMS and the IEEE T RANSACTIONS ON C OMPUTERS . His research interests include Internet technology, data center networks, cloud computing, and wireless networks.