2014 IEEE International Conference on Big Data
A Summarization Paradigm for Big Data Zubair Shah University of New South Wales Canberra, Australia Email:
[email protected]
Abdun Naser Mahmood University of New South Wales Canberra, Australia Email:
[email protected]
which means that the count after removing the count of all its descendant HHHs exceeds φN , where φ is a user specified parameter and N is the number of the packets processed so far. In the literature the solutions of such type of problems are compared in terms of space usage and update cost that is often bounded using the error ∈ [0,1] induced by proposed solution in the estimation made with the probability δ.
Abstract—We have developed an efficient summarization paradigm for data drawn from hierarchical domain to construct a succinct view of important large-valued regions (“heavy hitters”). It requires one pass over the data with moderate number of updates per element of the data and requires lesser amount of memory space as compared to existing approaches for approximating hierarchically discounted frequency counts of heavy hitters with provable guarantees. The proposed technique is generic that can make use of existing state-of-the-art sketch-based or count-based frequency estimation approaches. Any algorithm from both of these families can be coupled as a subroutine in the proposed framework without any substantial modifications. Experimental as well as theoretical justifications have been provided for its significance.
II. OVERVIEW OF THE M ETHODOLOGY We have modeled the problem of φ-HHH as a continuous unbounded stream S = {R1 , R2 , R3 · · · }. Each element Ri in S is characterized by a set of attributes drawn from hierarchy that has height hi . Each Ri can be generalized according to its attribute hierarchies using ≺ relation. For instance Ri ≺ Rip means Ri is generalized to Rip . Such type of generalization when performed on IP data with octet level generalization induces a mathematical lattice structure shown in Fig. 1. In overlap semantics where each node corresponds to the count of its corresponding sub-lattice, a rescaling of φ is required to avoid over counting, as it can be seen in Fig. 1 that how a count of 1 at base node is added up at each node and increased as one progresses up in the lattice. Qd The number of nodes η in the lattice is equal to η = i=1 (1 + hi ) and the height λ of the lattice Pd is λ = 1 + i=1 hi where d is the number of dimensions. Formally − approximate HHH is defined in [1] as follow:
Keywords-Hierarchical Heavy Hitters, Data Summarization, Big Data
I. I NTRODUCTION “Big Data” embraces both incredible promise, and substantial challenges as widely acknowledged in the scientific and business communities. An effective technique when dealing with it is through summarization, such as sampling, and sketches etc. Instead of working on large and complex raw data directly, these techniques enable various data analytics tasks to make use of carefully created summaries, which improve their scalability and efficiency. The processing of such data is challenging because it often involves to deal with variety, volume, and velocity. The data may have attributes which can take values from multiple hierarchies such as time, geographic location, IP addresses etc. Analyzing such data at multiple aggregation levels simultaneously is much more challenging than analyzing flat data. For example, computing total of transactions per minute, per hour, per day and son on could be easy, however, computing total of transactions per each time granularity, per each geographic location, and per collections of the product/transaction simultaneously is much more complex. The complexity increase as the number of dimensions (attributes) or the depth of hierarchies increases. From IP stream, computation of Hierarchical Heavy Hitter (HHH) traffic at source and destination IP addresses at different aggregation level is of particular interest to network monitoring team for management and decision making purposes. Such concept is known as φ-HHH computation,
978-1-4799-5666-1/14/$31.00 ©2014 IEEE
Definition 1. − approximate HHH Let φ is the support and N is the current count of records processed so far, FH is the set of all HHH, FHl ⊆ FH is set of HHH at level l of the lattice where 0 ≤ l ≤ λ, and ∈ [0, φ] is error tolerance then: • FH0 the HHH at level 0 are simply the heavy hitters of S, and • FHl are HHH at level l of lattice and if f (Rip ) = P f (Ric ) is the frequency of Rip ∈ FHl then f (Ric ) is the frequency of Ric ∈ / FHl−1 and Ric ≺ Rip and f (Rip ) ≥ (φ − )N , and Sλ • FH = l=0 FHl Computation of HHH is an inductive process. The count of HHH at any node must contains the count of all its descendent nodes except the nodes that are heavy hitters
61
Figure 1.
form of string with value of each attribute separated by “-” followed by “|” and a level vector L. The heap is compressed periodically to remove records whose frequency is bellow φN and are given beck to insertP acket() with the current count maintained in heap. Thus, there are two sources that produces the incoming records for the insertP acket() subroutine, the stream S and the compress() subroutine. The records from S has all its attributes in their basic form i-e no attribute is generalized, however, the records from compress() subroutine can be at any level of the lattice hierarchy. Clearly, any two elements of the lattice in Fig. 1 are either comparable or incomparable under ≺ relation. If elements are incomparable then their HHH computation is independent and hence are not required to consider. If the elements are comparable (i-e holds ancestor descendent relation) then the count of ancestor depends on the descendent but not the other way around. Hence it is required only to adjust the count of ancestors when a descendent become HHH or when it is no more HHH. With this in mind, at high level our framework can be visualized as exchanging elements between two data structures heap H and freqEst E. The operations performed on elements between H and E always follow reverse chronological order (descendent before ancestors) and can be summarized in the following four operations. 1) When an element exist in E and new element arrives then its count is increased in E. 2) When an element exist in E and become frequent then it is moved to H and its count is removed from all of its ancestors in E as well as in H if one exist in H. 3) When an element exist in H and new element arrives then its count is increased in H but not in E. 4) When an element exist in H and become infrequent then it is first removed from H and then moved to E and its count is added to all of its ancestors in E as well as in H if one exist in H.
Lattice induced by two dimensions (source and destination IP)
themselves. At each level it is sufficient to take care of its immediate descendents because the immediate descendents have count of their immediate descendents and so on. The computation of HHH with exact frequencies will require space linear to the size of input.Therefore, in the data stream model with constraint on the resources and processing each record once only, an approximate answer with error in precision is acceptable. Any algorithm that can solve HHH problem must have to satisfy the following two properties. • Accuracy: For any record Ri ∈ FH the algorithm should provide guaranteed precision f (Ri ) ≥ (φ−)N . • Coverage: For any record Ri ∈ / FH the algorithm should ensure f (Ri ) < (φ − )N . The Accuracy property enforces that each record output by the algorithm must be HHH and the Coverage property enforces that no HHH is missed by the algorithm. Enforcing Accuracy and Coverage guarantees that the result produce by the algorithm is correct. Our algorithm has a subroutine insertP acket() that takes as parameters the record Ri , the count c associated with this record, a reference to heap, and a reference to frequency estimation algorithm freqEst, where freqEst can be any frequency estimation technique such as sketch based techniques like Count-Min Sketches (CMS) [2] or count-based technique like Lossy Counting [3]. The heap is required to maintain the set of HHH and the freqEst is used to estimate the frequency of records that are given to it during the processing of our algorithm. In IP network traffic the count c is the size of payload in bytes. The record Ri is in the
Lemma 1. For any two records Ri ∈ S and Rj ∈ S, if ≺ generates two ancestors Ri ≺ Rip , Rj ≺ Rjp such that Rip = Rjp then f (Rip ) = f (Rjp ) = f (Ri ) + f (Rj ). And the probability of Rip = Rjp increases at higher level of generalization. Lemma 2. Our algorithm satisfies the Accuracy and Coverage properties of HHH. Lemma 3. Our algorithm requires O( φκ + 1 log(ηN )) space for computing HHHs when LC is used as a subroutine. Lemma 4. Our algorithm requires O(H + η log(ηN )) updates per packet to compute HHHs when LC is used as a subroutine. Lemma 5. Our algorithm requires O( φκ + 1 log ηN δ ) space for computing HHH when CMS is used as a subroutine.
62
Lemma 6. The update cost of our algorithm per packet is O(H + η log 1δ ) when CMS is used as a subroutine.
T1 are those when our framework used LC as a subroutine and T2 includes results when our framework used CMS as a subroutine. The results are better then existing corresponding approaches. The space usage is dominated by the space used by E (roughly more 99%) as can be verified from Table I that how the space usage changes with the change in φ as the variation in φ only affect the heap space that is required to hold HHH, this shows the effectiveness of our proposed framework as it requires a little more space than the space required by frequency estimation algorithms. In general T1 is very efficient in terms of space usage as well as time required to process single packet. The time required to process a single packet seems high but this is acceptable since when we consider generalization of IP at bit level then for each packet the number of nodes η = 25. Which means that each packet has to be checked at 25 different prefix combinations as shown in lattice figure.
We have omitted the proofs because of the space restriction and will appear in the full version of this paper. III. O UR C ONTRIBUTION 1) We have proposed a generic framework that can make use of existing state-of-the-art sketch-based such as Count-Min Sketch (CMS) [2] or count-based such as Lossy Counting (LC) [3] frequency estimation approaches without any significant modifications. The proposed framework requires lesser amount of memory for φ-HHH computation when compared with existing approaches [1]. 2) For computing φ-HHH the proposed framework can make use of LC with the space required O( φκ + 1 log(ηN )), where φκ is the number of φHHH Qd in the data and can be bounded by κ ≤ i=1 (1 + hi )/maxi (1 + hi ). 3) The framework can also use a single CMS to compute φ-HHH requiring space O( φκ + 1 log ηN δ ). 4) The update cost of proposed framework is O(H + η log(ηN ) using LC, and O(H + η log( 1δ )) using CMS. The per packet update cost for H is hard to calculate as it depends on how many elements are moved to H and how frequent the compress operation over heap is performed. However, an estimated number of updates packet can P in H at most per P be O((P κ/φ + f (Ri ))/ηN ), where f (Ri ) is the sum of the frequencies of Ri ∈ FH and P is the periodicity with whichP compress operation is performed. Clearly, (P κ/φ + f (Ri ))/ηN depends on both P and the P underlying data distribution and usually (P κ/φ+ f (Ri ))/ηN ≤ 1 for practical data sets and moderate value of P.
V. C ONCLUSIONS We have proposed an efficient summarization paradigm for big data tackling volume and velocity for certain domains which produce data that is hierarchical in nature. This includes query estimation over OLAP in data warehouses, network traffic monitoring, and other business that requires hierarchical aggregate view of their data at different levels of aggregation. Our proposed framework is scalable in terms of volume and velocity of the data for moderate dimensions. Also it is efficient in terms of space and is comparable in terms of update cost when compared to the best existing approaches with theoretical and experimental evaluation. In this short report we have omitted the proofs and complete experimental results which will appear in the full version of this work. R EFERENCES [1] Graham Cormode, Flip Korn, S Muthukrishnan, and Divesh Srivastava. Finding hierarchical heavy hitters in streaming data. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(4):2, 2008.
IV. E VALUATION Here we have provided some results to demonstrate the effectiveness of the proposed framework. We have tested our framework with different data set containing various hierarchies and different number of dimensions. However, the results reported here are for two dimensional IP network traffic (more than 10 Millions packets) with source and destination IP addresses when each IP address is considered to be generalizable to octet level IP hierarchy. The results in φ 0.1 0.2 0.3 0.4
T1 3.473 3.472 3.472 3.470
T2 11.308 11.308 11.307 11.306
Table I S PACE U SAGE IN MB FOR = 0.00001 AND δ = 0.99
T1 6.4×e− 5
[2] Graham Cormode and S Muthukrishnan. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms, 55(1):58–75, 2005. [3] Gurmeet Singh Manku and Rajeev Motwani. Approximate frequency counts over data streams. In Proceedings of the 28th international conference on Very Large Data Bases, pages 346–357. VLDB Endowment, 2002.
T2 8.0×e−5
Table II T IME IN SECONDS PER PACKET, PC USED C ORE I 7 PROCESSOR 3.4 GH, AND RAM 8 GB
63