Computing Hierarchical Summary from Two ...

1 downloads 0 Views 1MB Size Report
3. http://spark.apache.org/. 4. https://www-01.ibm.com/software/data/infosphere/. 5. http://spark.apache.org/. 6. http://flink.apache.org. 7. http://storm.apache.org/.
1

Computing Hierarchical Summary from Two-dimensional Big Data Streams Zubair Shah, Abdun Naser Mahmood, Member, IEEE, Michael Barlow, Member, IEEE, Zahir Tari, Member, IEEE, Xun Yi, Member, IEEE, Albert Y. Zomaya Fellow, IEEE Abstract—There are many application domains, where hierarchical data is inherent, but surprisingly, there are few techniques for mining patterns from such important data. Hierarchical Heavy Hitters (HHH) and multilevel and Cross-Level Association Rules (CLAR) mining are well-known hierarchical pattern mining techniques. The problem in these techniques; however, is that they focus on capturing only global patterns from data but cannot identify local contextual patterns. Another problem in these techniques is that they treat all data items in the transaction equally and do not consider the sequential nature of the relationship among items within a transaction; hence, they cannot capture the correlation semantic within the transactions of the data items. There are many applications such as clickstream mining, healthcare data mining, network monitoring, and recommender systems, which require to identify local contextual patterns and correlation semantics. In this work, we introduce a new concept, which can capture the sequential nature of the relationship between pairs of hierarchical items at multiple concept levels and can capture local contextual patterns within the context of the global patterns. We call this notion Hierarchically Correlated Heavy Hitters (HCHH). Specifically, the proposed approach finds the correlation between items corresponding to hierarchically discounted frequency counts. We have provided formal definitions of the proposed concept and developed algorithmic approaches for computing HCHH in data streams efficiently. The proposed HCHH algorithm have deterministic error guarantees, and space bounds. It requires O(  η ) memory, where η is a small constant, and p ∈ p s [0,1], s ∈ [0,1] are user defined parameters on upper bounds of estimation error. We have compared the proposed HCHH concept with existing hierarchical pattern mining approaches both theoretically as well as experimentally. Index Terms—Hierarchical Patterns, Association Rules, Frequent Patterns

F

1

I NTRODUCTION

I

N today’s world of Big Data, there is a huge demand for efficient techniques to mine interesting patterns from hierarchical data [1]. There are two main challenges in analyzing hierarchical data, to provide concept hierarchy of the data and to design efficient algorithms for mining hierarchical data. In many application domains, the concept hierarchy is either already known, such as IP addresses and product hierarchy; provided by experts or can be computed using various algorithms [2]. Hence, in this work, we assume that the concept hierarchy is available. Many domains contain hierarchical data, but surprisingly, there are limited techniques for mining such data. Examples of well-known hierarchical pattern mining techniques are hierarchical heavy hitters [1] and multilevel and cross-level association rules mining [2]. However, these approaches often cannot capture the correlation semantics when data is monitored in the form of pairs of hierarchical items. Mining correlations at multiple concept levels have many applications that have significant business value, in-

cluding clickstreams mining, network monitoring, healthcare data analysis and recommender systems. This work introduces a new concept, which is useful for capturing the sequential nature of the relationship between pairs of hierarchical items at multiple concept levels (called HCHH), which is different from existing hierarchical mining notions. To illustrate this concept, let us assume that the pairs of hierarchical items are referred to as the primary and secondary items. For example, Fig. 1 shows the concept hierarchy of a pair of items (location and product) in a clickstream from an ecommerce website, where location is primary item and product is secondary item. The aim of this study is to find a summary consisting of a set of hierarchical pairs with significant sequential correlation between pairs of items at multiple concept levels of location and product. This concept in formally defined in Section 2, but first, here few motivating scenarios are presented to show the utility of the proposed concept.

Zubair Shah is with the Centre for Health Informatics at the Australian Institute of Health Innovation, Macquarie University, Australia. Michael Barlow is with School of Engineering and IT, University of New South Wales Canberra, Australia. A.N. Mahmood is with the Department of Computer Science,School of Engineering and Mathematical Sciences, La Trobe University, Melbourne, Australia. Zahir Tari and Xun Yi are with the School of Computer Science and Information Technology, Royal Melbourne Institute of Technology, Melbourne, Australia. A. Y. Zomaya is with the Centre for Distributed and High Performance Computing, School of Information Technologies, the University of Sydney, Australia.

Web sites and online advertisers spend a large sum of money for market research on potential online customers; including their demographics, their interest in specific products, and their frequency of visits and purchases. This information can be obtained through Clickstream analysis; a process of analysing users’ clickstream (information about different links in a web page that the user visits), which typically contains user id (if logged in), timestamp, visitors IP address (raw and geocoded), visited URL, and navigation of visitors within the website.

1.1

Clickstream Analysis

2

200

USA

Product Hierarchy (secondary item)

Location Hierarchy (primary item)

43

125

75

NY

CA

rrelation Hierarchical Co

New York

20

Utica

41

Albany

Los Angeles

35

San Diego

7

Dress shoes

32

Shoes

40 64

75 Apparels

13 Sandals

Clothes

23

14 Casual shoes

Sheath dress

18

Wrap dress

Fig. 1: An example of hierarchically correlated heavy hitters, where location is the primary item and product is the secondary item. To analyze such clickstreams, we extract pairs of items, such as (location, product) from clickstream. The correlation between these pairs at various concept levels can reveal many interesting patterns. An interesting query on such clickstream data (see Fig. 1) could be to find a list of highly visited products (or categories such as shoes) for all cities (states or countries) that constitute a major proportion of the overall website visitors. For instance, in this example, for a 25% threshold (on the primary item); the major website visitors’ locations are New York (city), NY (state) and CA (state), which are represented by nodes that are marked as Grey. For a 20% threshold (on secondary items with respect to primary items), highly visited products among the visitors from CA are casual shoes (Adidas), shoes (all shoes) and clothes that are represented by nodes that are marked Orange. For a 25% threshold on items in primary dimension and a 20% threshold on items in secondary dimension with respect to primary items, a few rules that can be generated by HCHH approach are: 1. hCA ⇒ casual shoes (sup. 37%, hier. corr. 30%)i 2. hCA ⇒ shoes (sup. 37%, hier. corr. 26%)i 3. hCA ⇒ clothes (sup. 37%, hier. corr. 42%)i For instance, in the first rule (sup. 37%, hier. corr. 30%) means that CA has 37% support with respect to the whole dataset and among all other products visited by users from CA, casual shoes have 30% hierarchical correlation with CA. Note that the proposed definition of HCHH uses a discounted strategy to avoid generation of redundant rules. For instance, if a pair of items has a significant correlation (above the user-defined threshold) at a certain concept level, then the count of those pairs would not be passed to its ancestors. For example, see the second rule; although the total count of shoes (with respect to CA) is 43, but its discounted count is only 20 (after deducting the count 23 of its descendant node “casual shoes”); therefore the hierarchical correlation of shoes to CA after discounting is just 26%. This type of summary describes the correlation of products with users from various demographic backgrounds. It can provide an important insight of such hierarchical data, which can be used for dynamic customers segmentation to help marketing and understanding of customers’ preferences.

1.2

Network Monitoring

A large volume of network traffic is generated every day. Network monitoring requires analyzing different aspects of streaming traffic in real time [3]. Consider that for each packet the destination address is the primary item, and the source address is the secondary item. Then, HCHH can find those sources that co-occur most frequently with the corresponding destination (where a source or a destination can be a single IP address or network address). For instance, at a higher level of IP hierarchy, HCHH can reveal interesting correlations between sources (e.g., universities, research labs, and large organizations) and their corresponding popular destinations (e.g., search engines, publishers, social networks, and video providers). HCHH can report useful correlation for traffic planning, identifying interesting trends, and shifts in popularity of sources or destinations, especially when several IP sources are seen to share the same primary item (i.e., destination IP address). For example, it can also be useful for a service provider who is interested to find out which of its services are receiving the most visitors, and the location of these visitors that are highly correlated with the heavily visited servers. This information may be used to improve visitor experience by creating, relocating or adding redundant servers at a closer geo-location, or customizing the offered services tailored to the needs of frequent visitors. 1.3

Contributions

The contributions of this paper are as follows: (1) This work introduces the concept of HCHH; its formal definition and semantic properties. It also compares the proposed concept at definition level with existing hierarchical notions such as hierarchical heavy hitters and multilevel and cross-level association rules. The paper also highlights the differences and provides several motivating scenarios to show where the proposed concept can be useful. (2) The paper presents a memory optimal correlated heavy hitter algorithm that works on flat (non-hierarchical) data, which is later used to develop a hierarchy-aware algorithm. The flat algorithm is designed by using two Space Saving [4] sketches SS p,s and SS p . The first sketch SS p,s processes the pairs from the stream directly, while the second sketch SS p processes infrequent pairs that are deleted from the first sketch SS p,s .

3

This algorithm assumes pairs of items in a stream without any hierarchy. (3) Next, a hierarchy-aware algorithm is designed, which exploits a lattice-like structure constructed from hierarchical pairs of elements from the stream. The hierarchy-aware algorithm generalizes each incoming pair of items to various nodes of the lattice, where, in each of those nodes it uses an instance of the flat algorithm to process generalized pairs of elements from the stream. The proposed hierarchy-aware algorithm provides deterministic estimation accuracy using O( pηs ) memory, where η is a small constant derived from data, and p ∈ [0,1], s ∈ [0,1] are user defined parameters. (4) The paper compares HCHH concept experimentally with existing similar hierarchical notions using a clickstream dataset from an ecommerce website. The evaluation demonstrates that the results produced by HCHH concept are one average 92% dissimilar from hierarchical heavy hitters and multilevel and crosslevel association rules. The paper also presents the results of the hierarchy-aware algorithm using other real world Internet traffic trace datasets; results shows that the proposed hierarchy-aware algorithm can compute HCHH efficiently concerning memory usage as well as output quality. The proposed hierarchy-aware algorithm uses about 73% less memory when compared to a benchmark algorithm. 1.4

Roadmap

The remainder of this paper is organized as follows. Section 2 formulates the HCHH problem and Section 3 describes our proposed algorithms and provides theoretical aspects of bounds on space and accuracy guarantees. The implementation details and experimental results are given in Section 5. Section 6 discusses existing work in the area of data streams, specifically on summarization, and the paper concludes in Section 7.

2

P ROBLEM D EFINITION

Let the set of transactions T = {(τ1 , w1 ), (τ2 , w2 ), (τ3 , w3 ) · · · } be a continuous stream, where each transaction τi is a tuple (pi , si ) of pairs such that pi is a primary hierarchical item and si is a secondary hierarchical item, wi is the weight associated with τi , and N is the sum of the weights of the transactions processed so far. Details on symbols used in this paper is given in Table 1. Definition 1. (Frequencies) The frequency of p, fp is defined as: X fp = wi ∀pi =p

The frequency of a (p, s) pair, fp,s , is defined as: X fp,s = wi ∀pi =p∧∀si =s

To develop the concept of HCHH, we first define the concept of flat Correlated Heavy Hitters (CHH) using formal notations of correlated frequencies of items described above [5], [6]. Definition 2. (Correlated Heavy Hitters) Given user defined parameters φp ∈ [0, 1] and φs ∈ [0, 1], the CHH problem is to compute a set Fc so that for all pairs (p, s) ∈ Fc , (fp ≥ φp N ) ∧ (fp,s ≥ φs fp ).

TABLE 1: Frequently used notations Notation p, s pa , sa N fp , fˆp fp,s , fˆp,s p s φp φs hp hs η λ HCHH CHH HHH CLAR HCHHA OCHHA

Description Primary and Secondary items in the transaction ancestors of Primary item p and secondary item s sum of the weights of the transactions processed so far true and estimated frequency of primary item p true and estimated frequency of secondary item s given primary item p user defined error parameter on primary dimension user defined error parameter on secondary dimension user defined threshold on primary dimension user defined threshold on secondary dimension height of the hierarchy of primary items height of the hierarchy of secondary items the number of nodes in the lattice build using hp and hs the height of the lattice build using hp and hs abbr. of hierarchically correlated heavy hitters abbr. of correlated heavy hitters abbr. of hierarchical heavy hitters abbr. of multilevel and cross-level association rules abbr. of hierarchically correlated heavy hitters algorithm abbr. of optimal correlated heavy hitters algorithm

Since, the exact set of CHH cannot be found for streaming data (i.e., with limited memory and using one pass over the data), therefore, the concept of online (approximate) identification of CHH is introduced next. Definition 3. (Online CHH identification problem) Given user defined parameters p , s , φp ∈ [p , 1] and φs ∈ [s , 1], the online CHH identification problem is to compute a set Fc so that the following conditions are met; 1. Accuracy: fˆp ≤ fp + p N , fˆp,s ≤ fp,s + s fp 2. Coverage: ∀(p, s) ∈ / Fc , (fp < φp N ) ∨ (fp,s < φs fp ) Lemma 1. Using formulation of CHH from definition 3, false positives are possible, but false negatives are not. Lemma 1 says that under Definition 3 no correlated heavy hitters will be missed, however, due to limited memory for storing the unbounded stream, some uncorrelated heavy hitters may be reported. This error can be controlled using the parameters p ∈ [0, 1], s ∈ [0, 1] known as relative estimation error parameters, which also defines how much memory will be required by the algorithm. This means that the memory required is an inverse function of the estimation error; the lower the error set by the user, the higher the memory is required by the algorithm. The parameters φp ∈ [p , 1], φs ∈ [s , 1] are known as query (or threshold) parameters. 2.1

Generalizing item hierarchies

Here we illustrate the concept of generalization using primary item hierarchy hp , and secondary item hierarchy hs . As an example, we consider a transaction from network IP data with a destination IP address as p and a source IP address as s. Consider an example transaction (1.2.3.4/32, 5.6.7.8/32) where 1.2.3.4 and 5.6.7.8 are two IP addresses used for illustration only. The notation x.x.x.x/b is often used to represent IP addresses where b indicates the number of bits important to distinguish a given IP address. For example, if b = 32, then all the 4x8 = 32 bits are significant. If b = 24, then only the first 24 bits (the 24 MSB bits) are significant and the remaining 8-bits are disregarded. The height of hierarchy of both destination and source IP is the

4

same hp = 4 and hs = 4 when a byte-wise hierarchy is considered, shown in Fig. 2. Each transaction τ can be generalized using item hierarchies. We refer to the generalized transaction τa as the ancestor of the transaction τ from which it has been generalized. The generalization can be performed using function GeneralizeTo(.). For instance, GeneralizeTo(τ , L) is a generalization step of transaction τ to label L. We represent each label of the hierarchy by L, as a vector of two integers, where entries Lp ∈ L and Ls ∈ L represent the levels of hierarchies of the items p and s respectively. For example, L =(4, 4) shows the transaction, where both items are not generalized. Similarly, L =(3, 3) represents an ancestor transaction τa where both p and s are generalized to level 3. Thus, function GeneralizeTo(1.2.3.4/32,5.6.7.8/32, (3, 2)) returns a transaction τa = (1.2.3.0/24,5.6.0.0/16), where item p is at level 3 and item s is at level 2 of their hierarchies. It is important to note that the generalization, such as GeneralizeTo(1.2.3.0/24,5.6.0.0/16, (4, 2)) can not be computed, because the transaction (1.2.3.0/24,5.6.0.0/16) is currently at label (3, 2), and the function GeneralizeTo(1.2.3.0/24,5.6.0.0/16, (4, 2)) requires to roll down (a concept used in OLAP [7]) in the hierarchy. However, in this case rolling down the hierarchy is not possible because transactions at higher level of generalization could be rolled down to a large numbers of possible entries (i.e., there is no single descendant). By enumerating each possible combination of Lp and Ls using GeneralizeTo(.) function, we can build a lattice like structure as shown in Fig. 2. In general, the number of nodes η in lattice build from hp and hs are η = hp × hs , and the height λ of the lattice built from hp and hs is λ = hp +hs −1. The levels of the lattice (not to confuse with the levels of p and s) are numbered 1 (top most node of the lattice) to λ (bottom most node of the lattice), and represented by an integer l. 2.2

Hierarchically Correlated Heavy Hitters

In this paper, we propose the concept HCHH as a hierarchical correlation concept to capture the sequential correlation at various hierarchical concept levels. We provide the formulations of the HCHH as follows. Let pa and sa be the ancestors of p and s at level L respectively. For the sake of convenience, let the notation ≺ represents descendant– ancestor relationship, for instance p ≺ pa represents that pa is the generalization of p. Also, let p 4 pa represents (p ≺ pa ) ∨ (p = pa ), fpa represents the frequency of pa , and fpa ,sa represents the frequency of the pair (pa , sa ), then HCHH can be defined as follows; Definition 4. (Hierarchically Correlated Heavy Hitters) Let φp ∈ [0, 1] and φs ∈ [0, 1] be the user defined parameters, Fm is the set of all HCHH, Fml ⊆ Fm is the set of HCHH at level l of the hierarchy, where 1 ≤ l ≤ λ, then: 1) 2)

Fmλ is the set of HCHH, which is computed by simply computing the CHH of T (using Definition 3) at level λ. Fml is the set of HCHH at level l of the hierarchy, which is computed as follows: ∀(pa , sa ) ∈ Fml , (fpa ≥ φp N ) ∧ (fpa ,sa ≥ φs fpa )

where

f pa =

X (p4pa )∧(p∈F / m

fp l+1

)

X

fpa ,sa =

fp,s

(p4pa )∧(s4sa )∧((p,s)∈F / m

l+1

3)

Fm =

S

l

)

Fm l

There are several differences between this definition of HCHH and existing hierarchical summarization approaches including Hierarchical Heavy Hitters (HHH) [1], [8] and multilevel and Cross-Level Association Rules (CLAR) mining [2], [9]. First, existing hierarchical pattern mining algorithms cannot capture certain semantics that are highly useful for many applications like clickstream and user webnavigational patterns mining, healthcare data mining, network monitoring, and recommender systems. Existing approaches for mining patterns from hierarchical data can only capture global patterns and often miss interesting local contextual patterns. Identifying local contextual patterns within in the context of the global patterns often reveals many interesting underlying phenomena of the business trends. For instance, for ecommerce business applications, the correlation of products with users from various demographic backgrounds can be used for dynamic customers’ segmentation to help marketing and understanding of customers’ preferences. Such correlation can be found by identifying a list of highly visited products (or categories such as shoes), for all cities (states or countries) that constitute a significant proportion of ecommerce website visitors. Here, the global patterns are cities (states or countries) that represent a major portion of the site visitors, and their corresponding local contextual patterns are associated lists of highly visited products (or categories such as shoes). Second, existing hierarchical pattern mining algorithms treat the elements of the pair equally and do not take into account the sequential nature of items’ relationship within a pair. Thus they cannot capture correlation at various concept levels when data is monitored in the form of hierarchically correlated pairs. Finally, the experimental evaluation also shows that the results produced by HCHH are on average 92% dissimilar to both CLAR and HHH; demonstrating that these notions are describing very distinct phenomena. Definition 4 require space linearly proportional to the size of input for computing HCHH with exact frequencies. In the data stream model with constraint on the resources, an online version of this problem is defined as follows; Definition 5. (Online HCHH identification problem) Given a data stream of transactions τi ∈ T , and user define parameters p ∈ [0, 1], s ∈ [0, 1], φp ∈ [p , 1] and φs ∈ [s , 1], then HCHH identification problem is to output a set Fm of correlated pairs of prefixes (pa , sa ) : (p 4 pa ) ∧ (s 4 sa ), and approximate bounds on the frequency of each item (pa , sa ) ∈ Fm , such that the following conditions are satisfied; 1. Accuracy: fˆpa ≤ fpa + p N , fˆpa ,sa ≤ fpa ,sa + s fpa 2. Coverage: ∀ (pa , sa ) ∈ / Fm , (fpa < φp N ) ∨ (fpa ,sa < φ s f pa )

5 1.0 .0.0/ 8,5.0 .0.0/8 (1,1)

1.0 .0.0/ 16,5. 6.0.0 /16 (1,2)

1.0.0.0/8

1

1

5.0.0.0/8 1.0 .0.0/ 8,5.6 .7.0/2 4 (3,1)

1.2.0.0/16

2

2

1.2 .0.0/ 16,5. 6.0.0 /16 (2,2)

1.2 .3.0/ 24,5. 0.0.0 /8 (3,1)

5.6.0.0/16 1.0 .0.0/ 8,5.6 .7.8/3 2 (1,4)

1.2.3.0/24

3

3

5.6.7.0/24

1.2.3.4/32

4

4

5.6.7.8/32

hp

1.2 .0.0/ 16,5. 0.0.0 /8 (2,1)

1.2 .0.0/ 16,5. 6.7.8 /32 (2,4)

hs (a)

1.2 .3.0/ 24,5. 6.0.0 /16 (3,2)

1.2 .0.0/ 16,5. 6.7.0 /24 (2,3)

1.2 .3.0/ 24,5. 6.7.0 /24 (3,3)

1.2 .3.0/ 24,5. 6.7.8 /32 (3,4)

1.2 .3.4/ 32,5. 0.0.0 /8 (4,1)

1.2 .3.4/ 32,5. 6.0.0 /16 (4,2)

1.2 .3.4/ 32,5. 6.7.0 /24 (4,3)

1.2 .3.4/ 32,5. 6.7.8 /32 (4,4)

(b)

Fig. 2: A lattice-like structure build from two-dimensional IP addresses (e.g., source IP address and destination IP address).

3

T HE P ROPOSED A LGORITHMS

This section describes the proposed algorithms for flat (i.e., CHH) as well as hierarchically correlated heavy hitters (i.e., HCHH). First, we propose an algorithm for flat datasets (e.g., when p, and s are not hierarchical items), which we call Optimal CHH Algorithm (OCHHA). Next, we propose a hierarchy-aware algorithm, which is based on OCHHA, and we call it Hierarchically CHH Algorithm (HCHHA). 3.1

The Optimal CHH Algorithm (OCHHA)

This section describes OCHHA in detail, and develops theoretical bounds on memory usage and deterministic error guarantees in answer qualities. OCHHA creates two Space Saving [4] sketches, SS p,s and SS p . We have used Space Saving because it is one of the optimal algorithms for computing simple heavy hitters and is widely used in many advanced applications. The sketch SS p,s contains a set of, at most, ks = s1p tuples of the form [(p, s), gp,s , ∆p,s ], where pair (p, s) is the key of the tuple, gp,s is the approximate frequency of pair (p, s), ∆p,s is the error of approximation of pair (p, s), which is always less than p s N . Similarly, the sketch SS p contains a set of, at most, kp = 1p tuples of the form [p, gp , ∆p ], where primary item p is the key of the tuple, gp is the approximate frequency of item p, ∆p is the error of approximation of item p, which is always less than p N . High level concept of OCHHA: we process each pair (p, s) from the stream S using sketch SS p,s , which maintains a summary of size ks . The main idea behind OCHHA is to process, any infrequent pair that is deleted from the sketch SS p,s by another sketch SS p , which maintains a summary of size kp . The difference between sketch SS p,s and sketch SS p is that, SS p only processes the infrequent primary items p that are deleted from sketch SS p,s , hence,

sketch SS p contains primary items p that belong to the tail/infrequent pairs of stream S . Now we describe OCHHA in more detail. For each incoming tuple (pi ,si ,wi ), OCHHA performs one of the following actions (see Algorithm 1 in attached Appendix 8.2); 1) if the pair (pi , si ) is present in SS p,s , OCHHA increments the corresponding counter using gpi ,si + = wi ; 2) if the pair (pi , si ) is not present in SS p,s , then the pair with the smallest counter in SS p,s (say (p, s)) is removed from SS p,s ; next, it is replaced by the pair (pi , si ); next, the 0 counter of the pair (pi , si ) is set to gpi ,si = gp,s + wi ; and 0 finally, the error of the pair (pi , si ) is set to ∆pi ,si = gp,s . Next, the deleted pair (p, s) from SS p,s is processed by the second sketch SS p as follows: the primary item p is taken out from the deleted pair (p, s); the count c of the pair (p, s) 0 is calculated from sketch SS p,s using c = gp,s − ∆p,s ; and then the primary item p with count c is given to sketch SS p (see lines 10-11 of Algorithm 1 in attached Appendix 8.2), where item p is processed by SS p in the same manner as the pair (pi , si ) is processed by sketch SS p,s . To find the approximate frequency of primary item p at query time, we have to calculate the estimated frequency using both sketches (see lines 25-26 of Algorithm 1 in attached Appendix 8.2); X fˆp = (gpi ,si − ∆pi ,si ) + gpi ∀(pi ,si )∈SS p,s :pi =p

The approximate frequency of pair (p,s) can be found simply from sketch SS p,s using fˆp,s = gp,s . Next, in order to provide bounds on the accuracy of OCHHA, we proceed as follows: let N del represent the sum of all items processed by SS p (i.e., the sum of count of deleted pairs from sketch SS p,s ), then OCHHA has the following estimation guarantees;

6

Lemma 2. [Accuracy] OCHHA algorithm provides the following accuracy guarantee.

fˆp ≤ fp + p N del , fˆp,s ≤ fp,s + s fp For proof see online Appendix 8.1. The overall memory required by OCHHA is 1p + s1p = O( s1p ), which is optimal. For arbitrary positive counter updates, using a suitable min-heap based implementation of Space Saving, OCHHA updates will take O(log s1p ) time, and lookups will require O(1) time. When all updates are unitary (of the form wi = 1), both insertions and lookups can be processed in O(1) time using the Stream Summary data structure [4]. Based on the information available in two sketches SS p,s and SS p , OCHHA outputs any pair (p,s) when the frequency of the pair (p,s) satisfies the following threshold (see lines 28 of Algorithm 1 in attached Appendix 8.2);

(Fp ≥ φp N ) ∧ (gp,s ≥ φs (Fp − p N del )) where Fp = fˆp and gp,s = fˆp,s . The following lemma provides useful observation on properties of the CHH pairs that are output by OCHHA. Lemma 3. [Coverage] OCHHA satisfies the coverage property from definition 3. For proof see online Appendix 8.1. Theorem 1. OCHHA satisfies all the requirements of definition 3 using memory O( p1s ). Theorem 1 follows directly from lemmas 2 and 3. 3.2

The Hierarchical Algorithms

A na¨ıve way of computing HCHH, using OCHHA, would be to find CHH over all the prefixes of all the pairs in the data stream and then to discard extraneous pairs in a post-processing step. We argue that this na¨ıve strategy can be improved substantially in practice (concerning the space usage and the answer quality) by designing the hierarchy-aware algorithms that we describe next. In Section 3.2.1 we present the na¨ıve algorithm to compute HCHH, and in Section 3.2.2, we develop an efficient hierarchyaware algorithm (i.e., HCHHA) with deterministic error guarantees. We consider cash register stream computational model, where new transactions only arrive, and there are no deletions of previously seen transactions. 3.2.1

Na¨ıve Algorithm

We first describe a na¨ıve algorithm that calls OCHHA as a subroutine. We will use the na¨ıve algorithm as a baseline to compare various results. This algorithm maintains information for every label (node) in the lattice, by maintaining η (the number of nodes in the lattice) independent instances of OCHHA, where each one computes (approximately) CHH for that node in the lattice (see Fig 2 for an example of the lattice). The na¨ıve algorithm works as follows. For every update τ , we compute all generalizations of τ and insert each one separately into respective OCHHA in the lattice. The output of the na¨ıve algorithm is the union of all the outputs from each instance of OCHHA, which is a

superset of the approximate HCHH (see Definition 5) and satisfies the accuracy and coverage requirements for HCHH. Nevertheless, it is a costly strategy concerning space usage, as well as update costs. Since, we use η independent instances of OCHHA and insert each update into each of these η instances, the na¨ıve algorithm has O( pηs ) overall space bound. Moreover, since OCHHA processes each update in constant time, the na¨ıve algorithm requires O(η) updates to process each τ . 3.2.2 Hierarchically CHH Algorithm (HCHHA) The main idea behind the HCHHA is to create a twodimensional lattice (similar to Fig 2b) with η nodes, and in each node maintain a list of CHH using a frequency discounting strategy (more details to follow). For a given HCHH query, the algorithm outputs HCHH by starting from the lowest level of the lattice and progressing towards the top. At each level, it outputs all the CHH pairs by properly discounting the frequencies of CHH maintained at lower levels of the lattice using Definition 5. HCHHA maintains a sketch SS L p,s (that we have described for OCHHA algorithm) at each node of the lattice, where L represents the label of the node in the lattice. Notice, unlike the na¨ıve algorithm, the HCHHA does not maintain SS L p sketches at each node of the lattice, except the nodes on the left most diagonal of the lattice (such as nodes (4,4), (3,4), (2,4) and (1,4) in Fig 2b). This is because, in the na¨ıve algorithm, the sketch SS L p maintains the same information at each diagonal of the lattice. Since we keep two sketches (similar to OCHHA) at various nodes in the lattice, therefore, we first explain the operations on sketch SS L p,s in the lattice and then explain operations on sketch SS L p in the lattice. Each incoming pair from the stream (4,4) at the lowest level of the is inserted into sketch SS p,s lattice (e.g., node (4,4) in Fig 2b). Whenever a pair becomes (4,4) infrequent and is deleted from sketch SS p,s at the lower level, its frequency is added to its ancestors in sketches (3,4) SS p,s and SS (4,3) at the immediate higher level; such p,s process is known as generalization as explained in Section 2.1. The deletion can happen at higher levels too when a pair becomes infrequent there, which continues upwards towards the root node. Since, the deleted pair has to pass its count to two immediate ancestors (or at most n ancestors for n-dimensional data), which introduces the “over-counting” problem; well known in multidimensional HHH computation [1], [8]. The problem occurs when the count of a descendant node traverses up through multiple paths of a lattice and ends up in a single ancestor node. This means that the same child node’s frequency is counted multiple times at an upperlevel node, which is clearly an error. This over-counting can be controlled using an inclusion-exclusion strategy defined in [1], [8], which inserts the count of each deleted pair into its immediate parents, and subtracts its count from a common grandparent. This is done for each deleted pair while moving up in the lattice hierarchy to effectively solve the over-counting problem. More specifically, for each incoming τi = (pi , si , wi ), HCHHA inserts (pi , si ) into SS (4,4) p,s at the lowest node of (4,4) the lattice, if SS p,s returns no deleted pair then HCHHA

7 𝑺𝑺𝟑,𝟏 𝒑,𝒔 𝒑, 𝒔 = 𝒂. 𝒃. 𝒄.∗, 𝒘.∗ 𝒇𝒑𝒂 ,𝒔𝒂 = 𝟔𝟎, 𝑭𝒑𝒂 ,𝒔𝒂 = 𝟐𝟎

𝑺𝑺𝟐,𝟑 𝒑,𝒔 𝒑, 𝒔 = 𝒂. 𝒃.∗, 𝒘. 𝒙. 𝒚.∗ 𝒇𝒑𝒂 ,𝒔𝒂 = 𝟔𝟎, 𝑭𝒑𝒂 ,𝒔𝒂 = 𝟐𝟎

𝑺𝑺𝟐,𝟒 𝒑,𝒔

𝑺𝑺𝟐,𝟒 𝒑

𝒑, 𝒔 = 𝒂. 𝐛.∗, 𝒘. 𝒙. 𝒚. 𝒛 𝒇𝒑𝒂 ,𝒔𝒂 = 𝟐𝟎, 𝑭𝒑𝒂 ,𝒔𝒂 = 𝟎 𝒇𝒑𝒂

𝑺𝑺𝟑,𝟑 𝒑,𝒔 𝒑, 𝒔 = 𝒂. 𝒃. 𝒄.∗, 𝒘. 𝒙. 𝒚.∗ 𝒇𝒑𝒂 ,𝒔𝒂 = 𝟒𝟎, 𝑭𝒑𝒂 ,𝒔𝒂 = 𝟐𝟎

𝒑 = 𝒂. 𝐛.∗ = 𝟖𝟎, 𝑭𝒑𝒂 = 𝟐𝟎 𝑺𝑺𝟑,𝟒 𝒑,𝒔

𝑺𝑺𝟑,𝟒 𝒑

𝒑, 𝒔 = 𝒂. 𝐛. 𝐜.∗, 𝒘. 𝒙. 𝒚. 𝒛 𝒇𝒑𝒂 ,𝒔𝒂 = 𝟐𝟎, 𝑭𝒑𝒂 ,𝒔𝒂 = 𝟎 𝒑 = 𝒂. 𝐛. 𝐜.∗ 𝒇𝒑𝒂 = 𝟔𝟎, 𝑭𝒑𝒂 = 𝟒𝟎 𝑺𝑺𝟒,𝟒 𝒑,𝒔

𝑺𝑺𝟒,𝟒 𝒑

𝒑, 𝒔 =𝒂. 𝒃. 𝒄.𝒅, 𝒘. 𝒙. 𝒚.𝒛 𝒇𝒑𝒂 ,𝒔𝒂 = 𝟐𝟎, 𝑭𝒑𝒂 ,𝒔𝒂 = 𝟐𝟎

𝒑 =𝒂. 𝐛. 𝐜.𝒅 𝒇𝒑𝒂 = 𝟐𝟎, 𝑭𝒑𝒂 = 𝟐𝟎

Fig. 3: An illustration of HCHHA to process a two-dimensional stream of IP address pairs that contain entries {(p, s)}, where p is the primary item corresponding to the destination IP address and s is the secondary item corresponding to the source IP address and output HCHH from the stream. Both p and s are hierarchical items at octet-based granularity. The exact HCHH contain a set of ordered pairs of destination-source IP address prefixes. The HCHH set contains those sources which co-occur most frequently for their corresponding destinations at various levels of their hierarchies. The terms fpa and Fpa represent the full and discounted count of primary item p respectively. Similarly, the terms fpa ,sa and Fpa ,sa represent the full and discounted count of pair pa , sa respectively. The stream contains 20 repetitions of the pairs (a.b.c.d, w.x.y.z), followed by pairs (a.b.c.i, w.x.y.i), (a.b.i.d, w.x.y.i), and (a.b.c.i, w.i.y.z), where i = 1, 2, , · · · , 20. The alphabets a, b, w, y, etc. are representing different integers between 10 and 255. To demonstrate the main concept, we have skipped the steps of insertion of pairs and only highlighted the HCHH pairs found for the specific stream described above (where N = 80) by the HCHHA at various nodes of the lattice for φp = 0.25 and φs = 0.5. continues to process next pair from the stream. Otherwise, HCHHA performs the following actions (see Algorithm 2 in attached Appendix 8.2); let SS (4,4) returns a deleted p,s pair (p, s) having count c (where c is defined previously in OCHHA) then: (1) The deleted pair (p, s) from SS (4,4) p,s is generalized to parent labels L = (3, 4) and L = (4, 3) and grandparent label L = (3, 3). Then the generalized pair (pa , s) and (p, sa ) with count c are given to sketches SS (3,4) p,s (4,3) and SS p,s respectively at the parent nodes of the lattice, and the generalized pair (pa , sa ) with count −c is given to sketch SS (3,3) at the grandparent node of the lattice. p,s The insertion of a pair (p, s) into SS L p,s at any ancestor label L may cause SS L to return another deleted pair, p,s which is generalized to its parent and grandparent labels in similar manner as explained for deleted pair (p, s) above. (2) Also, the primary item p is taken out from the pair (p, s) which is deleted from SS (4,4) p,s , and then the primary item p with count c is given to sketch SS (4,4) . Note that this p operation is specific to node (4,4) only; i.e., at no other node this operation happens. However, when infrequent items p are deleted from sketch SS (4,4) , they are generalized p and inserted to sketch SS (3,4) . Similarly, infrequent items p p (3,4) deleted from sketch SS p are generalized and inserted to

sketch SS p(2,4) , and so on. The output procedure of HCHHA (see Algorithm 3 in attached Appendix 8.2) is a complex subroutine that outputs HCHH pairs after properly discounting the counts of HCHH at lower level of the lattice (see Figure 3). It iterates through all the levels of the lattice, starting at the bottom level of the lattice and work towards the top, using the inclusion-exclusion strategy mentioned above. At each level it iterates through all the labels (nodes) belonging to that level. At each label (node) it iterates through all the items L maintained by sketches SS L p,s and SS p within that label of the lattice. The subroutine output any prefix pair whose estimated count exceeds the thresholds φp and φs . For any record (pa , sa ) that belongs to any level and any label at that level, the subroutine computes Fpa and Fpa ,sa , which are the sum of the counts of all descendants p : p 64 pa ∈ Fm and (p, s) : (p, s) 64 (pa , sa ) ∈ Fm respectively. In the case of computing Fpa ,sa the subroutine checks if any pair has contributed twice to an ancestor pair, then the subroutine removes that count of the repeated pair from the ancestor pair (see loop at lines 17-22 of the pseudo-code given in Algorithm 3 in attached Appendix 8.2). Notice that this is not required for computing Fpa , since an item p forms a tree instead of a lattice, and the sketches SS L p for storing items p are maintained at the nodes on the left most diagonal

8

of the lattice (such as nodes (4,4), (3,4), (2,4) and (1,4) in Fig 2b). Next, we provide formal aspects of the proposed algorithm HCHHA for computing discounted counts of approximate HCHH items. Lemma 4. If Fpa is the count of items p or their descendants, such that: X Fp a = gp (gp hs compare to when hp < hs , as shown in Figure 7 (Left).

5.3

Comparisons of HCHHA and Na¨ıve algorithm

This section presents a comparison of HCHH algorithms with na¨ıve algorithm concerning output quality, memory usage and update cost using Trace1 and Trace2 streams described in Table 2. 5.3.1

Output Quality

One of the most important issues with approximate algorithms, such as in the case of HCHHA, is that these techniques are not guaranteed to find the exact set of HCHH items. It is expected that the output may miss some HCHH and potentially might include some extra (i.e., non-HCHH items). Thus to evaluate the proposed algorithm, we have compared them using Precision-Recall analysis (using both flat measures such as Dice coefficient and hierarchical measures like Optimistic Genealogy Measure (OGM) [15] and Balanced Genealogy Measure (BGM) [15]) taking the exact output as a frame of reference. In the first set of experiments, all the three measures are plotted against the stream length to observe the performance of the algorithms for varying values of N . Importantly, we have performed experiments using both Trace1 and Trace2 streams. The results for output quality is same for both streams, but the results for memory usage and update performance are different for Trace1 and Trace2 streams. Hence, we have reported output quality of the algorithms using Trace1 stream, and memory usage and update performance of the algorithms using both Trace1 and Trace2 streams. Figure 8 compares the output quality of HCHHA against the na¨ıve algorithm for various parameter settings. It can be seen that HCHHA outperforms na¨ıve algorithm using all the three measures. In general, the Dice scores of the algorithms are lower than the OGM and BGM scores. This is because the Dice score shows the exact matching between true answer and approximate answer. The other two measures, in addition to exact matching also take into account the ancestor-descendant relationships between the correct answer and approximate answer. In the case of approximate HCHH answers, if an algorithm does not output an exact HCHH pair, it is highly likely that the algorithm will output its ancestor HCHH pair because the deleted descendant

12 1200

9000

φp Constant, φs Decreasing

hp Constant, hs Decreasing 8000

hs Constant, hp Decreasing

1000

φs Constant, φp Decreasing

Cardinality of HCHH set

Cardinality of HCHH set

7000

800

600

6000 5000 4000 3000

400 2000 1000

200

0.1, 0.01

0.01, 0.1

0.05, 0.01

0.01, 0.05

0.01, 0.01

0.01, 0.01

0.005, 0.01

0.01, 0.005

0.01, 0.001

0.001, 0.01

1, 4

4, 1

2, 4

4, 2

3, 4

4, 3

4, 4

4, 4

hp , hs

φp,

φs

0

0

Fig. 7: An illustration of the number of HCHH items that are found using the proposed approach. The figure on left shows the number of HCHH items found for varying the heights of the hierarchies of primary and secondary items, i.e., hp and hs . The figure on right shows the number of HCHH items found for varying the thresholds on primary and secondary dimensions, i.e., φp and φs . HCHH items are generalized and combined at the ancestor nodes in the lattice. Therefore, overall both the OGM and BGM scores of the algorithms are higher than their corresponding Dice scores. For all parameter values of  and φ, HCHHA performs consistently better than the na¨ıve algorithm using all the three measures. Figure 8 shows that, on average, for higher values of φp and φs (i.e., 0.02) the na¨ıve algorithm produces 55% exact answers, and for lower values of φp and φs (i.e., 0.002), it produces 50% exact answers. Conversely, on average, for higher values of φp and φs (i.e., 0.02), HCHHA produces 95% exact answers, and for lower values of φp and φs (i.e., 0.002), it produces 78% exact answers. These values indicate that concerning exact matching of true HCHH answers and approximate HCHH answers, both the algorithms produce good results for higher values of φp and φs . Specifically, concerning exact matching, the results provided by HCHHA are much better compared to the results produced by the na¨ıve algorithm. In a similar manner, the OGM and BGM scores of HCHHA are also better than the OGM and BGM scores of the na¨ıve algorithm. We conclude that in terms of output quality HCHHA produces better quality results as compare to the na¨ıve algorithm. The evaluation metrics clearly demonstrate that the na¨ıve algorithm outputs a significant number of false hierarchically correlated pairs; while HCHHA outputs virtually no false hierarchically correlated pairs. 5.3.2 Update Efficiency Table 5 compares the update performance of HCHHA and na¨ıve algorithm for p = 0.01, s = 0.01 and p = 0.001, s = 0.001. It can be seen that, for higher values of p and s (e.g., 0.01) both algorithms perform more updates than for lower values of p and s (e.g., 0.001). This is because, for higher values of , the algorithms have lower memory (because memory is an inverse function of ); therefore they need to delete the records more frequently, as compared to lower values of . For all of our experiments, the number of updates performed by na¨ıve algorithm is significantly higher than the HCHHA. The number of updates performed by na¨ıve algorithm is about two times more than the number of

TABLE 5: Speed comparison of HCHHA and na¨ıve algorithm using Trace1 and Trace2 streams. In Table a and Table b results reported are for two different parameter settings. (a) p = 0.01, s = 0.01

Algos na¨ıve HCHHA

Trace1 (8.14 m) # Updates(m) 174.40 88.53

Trace1 (8.14 m) Time(s) # Updates(m) Time(s) 265 135

Trace2 (48.18 m) na¨ıve HCHHA

1023.83 518.42

(b) p = 0.001, s = 0.001

169.52 82.02

258 125

Trace2 (48.18 m) 1557 789

975.65 470.24

1484 715

updates performed by HCHHA. For parameter values p = 0.01, s = 0.01, na¨ıve algorithm performed about 1.97 times more updates than HCHHA, but for higher values p = 0.001, s = 0.001, na¨ıve algorithm performed about 2.1 times more updates than HCHHA for both Trace1 and Trac2 streams. The reason for higher update requirements by na¨ıve algorithm is that it generates η ancestors for each incoming transaction of the data stream and inserts them into corresponding η data structures at various nodes of the lattice. This clearly shows that the na¨ıve strategy for finding HCHH pairs from data streams is computationally expensive as compared to the hierarchyaware strategy employed by HCHHA. Table 5 also provides the comparison of HCHHA and na¨ıve algorithm in terms of processing time. The number of updates performed by HCHHA and na¨ıve algorithm provide a good indication of their relative efficiency, but the actual run time of these algorithms provide a clearer picture of their algorithmic complexities. Concerning processing time, na¨ıve algorithm is the slowest between the two algorithms. In terms of processing time, the difference between the two algorithms follows the same trend as that of updates. For instance, for parameter values p = 0.01, s = 0.01, na¨ıve algorithm took about 1.96 times more processing time than HCHHA. But for higher values p = 0.001, s = 0.001, na¨ıve algorithm took about 2 times more processing time

13

εp = εs = 0.01 and φp = φs = 0.02

εp = εs = 0.001 and φp = φs = 0.002

1

0.8

Naive HCHHA

0.95

0.75

0.9

0.65

0.8

Dice Score

Dice Score

Naive HCHHA

0.7

0.85

0.75 0.7

0.6 0.55

0.65 0.5

0.6 0.45

0.55 0.5

0

1

2

3 4 5 Number of Records: N

6

7

0.4

8

0

1

2

3 4 5 Number of Records: N

6

x 10

(a)

6

7

8 6

x 10

(b)

εp = εs = 0.01 and φp = φs = 0.02

εp = εs = 0.001 and φp = φs = 0.002

1.05

1

1

0.95

0.95 0.9

Naive HCHHA

0.85

0.85

BGM Score

BGM Score

0.9

0.8

Naive HCHHA

0.8 0.75

0.75 0.7

0.7

0.65

0.65

0

1

2

3 4 5 Number of Records: N

6

7

8

0

1

2

3 4 5 Number of Records: N

6

x 10

(c)

6

7

8 6

x 10

(d) εp = εs = 0.001 and φp = φs = 0.002

εp = εs = 0.01 and φp = φs = 0.02

0.95

1 0.95

0.9

0.9

0.85 0.8

Naive HCHHA

0.8

OGM Score

OGM Score

0.85

0.75

Naive HCHHA

0.75 0.7

0.7

0.65

0.65 0.6

0.6

0.55

0.55

0

1

2

3 4 5 Number of Records: N

6

7

8

0

1

2

6

x 10

(e)

3 4 5 Number of Records: N

6

7

8 6

x 10

(f)

Fig. 8: Comparison of HCHHA and na¨ıve algorithm using Dice, OGM and BGM scores. than HCHHA for both Trace1 and Trac2 streams. This further clarifies the complexity difference between na¨ıve and hierarchy-aware strategies. It can also be seen from Table 5 that HCHHA and na¨ıve algorithm are a little more efficient to process skewed data as compared to non-skewed data (Trace1 stream is a bit sparse, whereas Trace2 stream is skewed). For instance, na¨ıve algorithm took around 32.5 seconds to process 1 million transactions in Trace1 stream; nevertheless, in the case of Trace2 stream, it took just around 32 seconds to process 1 million records. Similarly, HCHHA took about 16.7 seconds to process 1 million transactions from Trace1, but 16 seconds to process 1 million records from Trace2. The reason for this phenomena is that in the case of skewed data, the incoming transactions often match the stored records and do not require any search or deletion of stored records. To summarize, HCHHA is about two times faster and more efficient than the na¨ıve algorithm; confirming that the

na¨ıve strategy to compute HCHH pairs from the data stream is computationally expensive as compared to hierarchyaware strategy. 5.3.3

Memory Usage

Table 6 provides the comparison of memory usage for HCHHA and na¨ıve algorithm for p = 0.01, s = 0.01 and p = 0.001, s = 0.001. It provides memory usage in terms of the number of records maintained by algorithms in their data structures as well as run-time memory usage concerning actual bytes consumed. It can be seen that the memory usage of HCHHA is significantly less than the memory usage of na¨ıve algorithm. The na¨ıve algorithm requires the highest memory as it stores every ancestor of each incoming record while HCHHA does not store every ancestor. From Table 6, we can find that on average, na¨ıve algorithm requires about 1.6 times more memory than the memory required by HCHHA.

14

TABLE 6: Memory comparison of HCHHA and na¨ıve algorithm using Trace1 and Trace2 streams. In Table a and Table b results reported are for two different parameter settings. (b) p = 0.001, s = 0.001

(a) p = 0.01, s = 0.01

Trace1 (8.14 m) Trace1 (8.14 m) Mem(#R) Mem(KB) Mem(#R) Mem(KB)

Algos na¨ıve HCHHA

3079 1980

419403.95 12675 255847.56 7968

Trace2 (48.18 m) na¨ıve HCHHA

3018 1884

1726234.67 1085287.54

Trace2 (48.18 m)

411027.38 12425 241413.24 7793

1692186.62 1061453.77

7

Max. Memory Usage (KBytes)

2

x 10

OCHHA HCHHA Naive

1.5

1

0.5

0.002, 0.002

0.003, 0.002

0.01, 0.002

0.005, 0.002

0.002, 0.003

0.003, 0.003

0.01, 0.003

0.005, 0.003

0.002, 0.005

0.003, 0.005

0.005, 0.005

0.01, 0.005

0.002, 0.01

0.003, 0.01

0.005, 0.01

εp, εs

0

0.01, 0.01

that the na¨ıve strategy can be improved substantially in practice concerning the memory requirements by designing the hierarchy-aware algorithms like HCHHA. We conclude that HCHHA outperforms na¨ıve algorithm concerning all evaluation metrics, i.e., output quality, memory requirements and update cost. Therefore, we recommend that the advanced hierarchy-aware algorithms are more useful for computing HCHH pairs from streaming data.

εp, εs

Fig. 9: Memory comparison of HCHHA, OCHHA, and na¨ıve algorithms for various parameter settings.

It can also be seen from Table 6 that the memory usage of the algorithms for different streams is almost the same. The memory usage of all the algorithms for Trace1 stream is a little higher than the memory usage for Trace2 stream. The reason is that Trace2 stream is a little skewed compared to Trace1 stream. For skewed streams, the incoming records often match the stored records and do not require to create any additional counters. In the skewed stream, many records have lower counts, which are deleted by algorithms during the compression step and the memory consumed by those records is reclaimed. Figure 9 compares the memory usage of HCHHA and na¨ıve algorithm using a range of parameters settings. It clearly demonstrates that HCHHA needs less memory than na¨ıve algorithm for various parameter settings. In general, for lower values of p and s , the memory requirements of the approximate algorithms increase. However, this increase is worse for na¨ıve algorithm compared to HCHHA. The Figure 9 also provides memory requirements of OCHHA as a frame of reference, which considers no hierarchy in the data. We can see that the HCHHA has a significant memory advantage over the na¨ıve approach, and requires only a little higher memory than the OCHHA, even with complex internal design to accommodate the hierarchical nature of the data. HCHHA uses about 73% less memory than the benchmark na¨ıve algorithm. Thus, we emphasize

6 6.1

R ELATED W ORK Data Stream Mining

Several studies on computing heavy hitters have evolved from simple tracking of items and their frequencies to more complex notions; such as hierarchical heavy hitters [1], [16]; computing correlated aggregates, e.g., min, max and average [17]; time-decayed correlated aggregates [18]; correlated and conditional heavy hitters [6], [19], frequent itemsets and association rules [2]. Previous studies on correlated aggregates considered queries in a different form than CHH queries. For instance, the query first applies a selection predicate along the primary dimension p such as p ≥ c or p < c, where c is provided at query time. Next, the query performs selection on secondary dimension s. The difference between such formulation and the CHH concept is that, in CHH the selection predicate along primary dimension involves frequencies fp and extracts heavy hitters, rather than a simple comparison. For example, Gehrke et al [20] proposed technique based on adaptive histograms to find correlated aggregates, such that the aggregate along the primary dimension is min, max or average, and the aggregate along the secondary dimension is sum or count. The drawback of this technique is that it does not provide any guarantees. Ananthakrishna et al. [21] have proposed an algorithm based on quantile summary to find correlated sum or counts with error bounds, however, the approach is not suitable for finding heavy hitters. A more interesting case has been considered by Cormode et al [18]; they focused on time-decayed correlated aggregates, where items in the stream are weighted based on arrival time of the items. Although their work considered sum aggregates, it cannot directly be applied to find heavy hitters or CHH. Studies on distributed stream processing focus on various issues, including benchmarking IoT streams [22], monitoring of distributed quantiles [23], dynamic continuous queries [24], real time queries [25], [26] and distributed remote invocations [27]. Nevertheless, these approaches do not focus on hierarchical streams [28]; hence, they do not solve the problems that are specific to hierarchical data streams. There are techniques that summarize hierarchical data, such as HHH [1], [16], multilevel and CLAR [2]. However, these notions treat the two elements of the pair equally–not taking into account the sequential nature of items’ relationship within the pairs–thus cannot discover the hierarchical correlation semantics between items at multiple concept levels. Hence, we introduce the concept of HCHH that captures the correlation between pairs of items at multiple concept levels.

15

6.2

State-of-the-Art Technologies

Data streams possess at least two core properties of big data, volume and velocity of the data. Streams may also have variety in certain cases, such as streams from social networking sites or Twitter streams that contain heterogeneous data. A wide range of approaches have been proposed to deal with these big data issues [29], [30], [31]. Recently, parallel and distributed architectures such as Apache Hadoop and Google MapReduce have become increasingly popular due to their ability to handle volume and velocity issues of big data [32], [33], [34]. State-of-the-art techniques have been proposed to extend these technologies to meet certain application specific demands. For instance, authors in [29] have proposed a highly-sophisticated architecture to deal with time-critical big data applications by extending Apache Spark3 (to support time critical map-reduce). Currently majority of the available state-of-the-art technologies for stream processing such as IBM InfoSphere4 , Spark5 , Flink6 and Storm7 serves as a general-purpose middleware providing support for designing analytic applications on top of them. There are many important data stream mining problems such as finding frequent items, association rules, quantiles, inner products, self-join and frequency statistics, which have seen numerous research effort [1], [8], [17], [20], [21]. It is important to understand that the algorithms proposed to solve these problems do not compete with state-of-theart stream processing technologies (IBM InfoSphere, Spark, Flink and Storm), rather these algorithms can run on top of these technologies to improve their performance. Similarly, the algorithms proposed in this work solve certain data stream mining problems which can also benefit from these technologies, but the proposed algorithms do not complement the functions of state-of-the-art stream processing technologies.

7

C ONCLUSION

Finding Hierarchically Correlated Heavy Hitters is useful for many online applications that require capturing the semantics in the form of hierarchically correlated pairs. It can be helpful in network monitoring, anomaly detection and data analysis for business intelligence and planning. This paper described a new notion of HCHH and proposed algorithms to solve the HCHH problem in two-dimensional streams, where items have been drawn from different hierarchies, such as Time, Locations, Products, Clickstreams and IP addresses. We have compared the HCHH concept with existing hierarchical pattern mining approaches including HHH and CLAR mining conceptually, as well as experimentally. Our experiment shows that the results produced by HCHH are on average 92% dissimilar to both CLAR and HHH; demonstrating that these notions are conceptually describing very different phenomena. We have also described a na¨ıve algorithm and compared it against the proposed HCHHA considering streaming environment. We found that HCHHA outperforms the benchmark algorithm 3. http://spark.apache.org/ 4. https://www-01.ibm.com/software/data/infosphere/ 5. http://spark.apache.org/ 6. http://flink.apache.org 7. http://storm.apache.org/

on every evaluation metrics— on average HCHHA outputs 34% more accurate hierarchically correlated items, using 73% less run-time memory and 49.5% less processing time compared to na¨ıve algorithm. We recommend that the advanced hierarchy-aware algorithms such as HCHHA can be more useful for computing HCHH pairs from streaming data.

R EFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14]

[15] [16] [17] [18] [19] [20] [21]

[22]

Z. Shah, A. N. Mahmood, and M. Barlow, “Computing discounted multidimensional hierarchical aggregates using modified misra gries algorithm,” Performance Evaluation, vol. 91, pp. 170–186, 2015. J. Han and Y. Fu, “Mining multiple-level association rules in large databases,” Transactions on Knowledge and Data Engineering, vol. 11, no. 5, pp. 798–805, 1999. L. Fegaras, “Incremental query processing on big data streams,” IEEE Transactions on Knowledge and Data Engineering, vol. 28, no. 11, pp. 2998–3012, 2016. A. Metwally, D. Agrawal, and A. El Abbadi, “Efficient computation of frequent and top-k elements in data streams,” in Database Theory-ICDT 2005. Springer, 2005, pp. 398–412. B. Lahiri and S. Tirthapura, “Finding correlated heavy-hitters over data streams,” in (IPCCC). IEEE, 2009, pp. 307–314. B. Lahiri, A. Mukherjee, and S. Tirthapura, “Identifying correlated heavy-hitters in a two-dimensional data stream,” Data Mining and Knowledge Discovery, pp. 1–22, 2015. E. Malinowski and E. Zim´anyi, “Olap hierarchies: A conceptual perspective,” in Advanced Information Systems Engineering. Springer, 2004, pp. 477–491. G. Cormode, F. Korn, S. Muthukrishnan, and D. Srivastava, “Finding hierarchical heavy hitters in streaming data,” (TKDD), vol. 1, no. 4, p. 2, 2008. J. Han and Y. Fu, “Discovery of multiple-level association rules from large databases,” in VLDB, vol. 95, 1995, pp. 420–431. M. Charikar, K. Chen, and M. Farach-Colton, “Finding frequent items in data streams,” in Automata, Languages and Programming. Springer, 2002, pp. 693–703. R. Berinde, P. Indyk, G. Cormode, and M. J. Strauss, “Spaceoptimal heavy hitters with strong error bounds,” (TODS), vol. 35, no. 4, p. 26, 2010. J. Micheel, I. Graham, and N. Brownlee, “The auckland data set: an access link observed,” in Proc. of access networks and systems, 2001, pp. 19–30. W. Lin, S. A. Alvarez, and C. Ruiz, “Efficient adaptive-support association rule mining for recommender systems,” Data mining and knowledge discovery, vol. 6, no. 1, pp. 83–105, 2002. C. W.-k. Leung, S. C.-f. Chan, and F.-l. Chung, “An empirical study of a cross-level association rule mining approach to coldstart recommendations,” Knowledge-Based Systems, vol. 21, no. 7, pp. 515–529, 2008. P. Ganesan, H. Garcia-Molina, and J. Widom, “Exploiting hierarchical domain structure to compute similarity,” ACM Transactions on Information Systems (TOIS), vol. 21, no. 1, pp. 64–93, 2003. Z. Shah, A. N. Mahmood, and M. Barlow, “Computing hierarchical summary of the data streams,” in Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 2016, pp. 168–179. S. Tirthapura and D. P. Woodruff, “A general method for estimating correlated aggregates over a data stream,” in (ICDE). IEEE, 2012, pp. 162–173. G. Cormode, S. Tirthapura, and B. Xu, “Time-decayed correlated aggregates over data streams,” Statistical Analysis and Data Mining: The ASA Data Science Journal, vol. 2, no. 5-6, pp. 294–310, 2009. K. Mirylenka, G. Cormode, T. Palpanas, and D. Srivastava, “Conditional heavy hitters: detecting interesting correlations in data streams,” VLDB Journal, vol. 24, no. 3, pp. 395–414, 2015. J. Gehrke, F. Korn, and D. Srivastava, “On computing correlated aggregates over continual data streams,” in ACM SIGMOD Record, vol. 30, no. 2. ACM, 2001, pp. 13–24. R. Ananthakrishna, A. Das, J. Gehrke, F. Korn, S. Muthukrishnan, and D. Srivastava, “Efficient approximation of correlated sums on data streams,” Transactions on Knowledge and Data Engineering, vol. 15, no. 3, pp. 569–572, 2003. A. Shukla, S. Chaturvedi, and Y. Simmhan, “Riotbench: A realtime iot benchmark for distributed stream processing platforms,” arXiv preprint arXiv:1701.08530, 2017.

16

[23] Z. Shah, A. Mahmood, Z. Tari, and A. Y. Zomaya, “A technique for efficient query estimation over distributed data streams,” IEEE Transactions on Parallel and Distributed Systems, 2017. [24] Z. Deng, X. Wu, L. Wang, X. Chen, R. Ranjan, A. Zomaya, and D. Chen, “Parallel processing of dynamic continuous queries over streaming data flows,” IEEE Transactions on Parallel and Distributed Systems, vol. 26, no. 3, pp. 834–846, 2015. [25] P. Basanta-Val, N. Fern´andez-Garc´ıa, A. J. Wellings, and N. C. Audsley, “Improving the predictability of distributed stream processors,” Future Generation Computer Systems, vol. 52, pp. 22–36, 2015. [26] P. Basanta-Val, N. Fernandez-Garcia, L. Sanchez-Fernandez, and J. A. Fisteus, “Patterns for distributed real-time stream processing,” IEEE Transactions on Parallel and Distributed Systems, 2017. [27] P. Basanta-Val, N. Fern´andez-Garc´ıa, and L. S´anchez-Fern´andez, “Predictable remote invocations for distributed stream processing,” Future Generation Computer Systems, 2017. [28] Z. Shah, A. Anwar, A. N. Mahmood, Z. Tari, and A. Y. Zomaya, “A spatio-temporal data summarization paradigm for real-time operation of smart grid,” IEEE Transactions on Big Data, 2017. [29] P. Basanta-Val, N. C. Audsley, A. J. Wellings, I. Gray, and N. Fern´andez-Garc´ıa, “Architecting time-critical big-data systems,” IEEE Transactions on Big Data, vol. 2, no. 4, pp. 310–324, 2016. [30] P. Li, S. Guo, T. Miyazaki, X. Liao, H. Jin, A. Zomaya, and K. Wang, “Traffic-aware geo-distributed big data analytics with predictable job completion time,” IEEE Transactions on Parallel and Distributed Systems, 2016. [31] B. Data, “Principles and best practices of scalable realtime data systems,” N. Marz J. Warren. Henning, 2014. [32] H. Song, P. Basanta-Val, A. Steed, M. Jo et al., “Next-generation big data analytics: State of the art, challenges, and future research topics,” IEEE Transactions on Industrial Informatics, 2017. [33] J. Ekanayake, S. Pallickara, and G. Fox, “Mapreduce for data intensive scientific analyses,” in eScience, 2008. eScience’08. IEEE Fourth International Conference on. IEEE, 2008, pp. 277–284. [34] K. Shvachko, H. Kuang, S. Radia, and R. Chansler, “The hadoop distributed file system,” in Mass storage systems and technologies (MSST), 2010 IEEE 26th symposium on. IEEE, 2010, pp. 1–10.

Zubair Shah received a B.S. degree in computer science from University of Peshawar (Pakistan), an M.S. degree in computer system engineering from Politecnico di Milano (Italy), and a PhD PLACE degree from the University of New South Wales PHOTO (Canberra, Australia). He has been a lecturer HERE from 2012-2013 at the City University of Science & Technology, Peshawar, Pakistan. He is currently a Postdoctoral Research Fellowship at Australian Institute of Health Innovation, Macquarie University, Australia. His research interests include Data Mining & Summarization, Big Data Analytics, and Machine Learning. He has published his work in various IEEE Transactions and A-tier international journals and conferences.

Abdun Naser Mahmood received PhD from the University of Melbourne, Australia, in 2008; the MSc (Research) degree in computer science and the B.Sc. degree in applied physics PLACE and electronics from the University of Dhaka, PHOTO Bangladesh, in 1999 and 1997, respectively. He HERE has been working as an academic in Computer Science since 1999. He is currently in the School of Engineering and IT, University of New South Wales. He has been a lecturer since 2000, an Assistant Professor since 2003 at the University of Dhaka. He was received Postdoctoral Research Fellowship at the Royal Melbourne Institute of Technology (2008-2011). In 2011, he joined UNSW as a Lecturer in Computer Science until he joined La Trobe University, Melbourne in 2017, where he is currently working as a Senior Lecturer in Cyber Security. His research interests include data mining techniques for scalable network traffic analysis, anomaly detection, and industrial SCADA security. He has published his work in various IEEE Transactions and A-tier international journals and conferences.

PLACE PHOTO HERE

Michael Barlow is a computer scientist. He received his Ph.D. from UNSW Australia in 1991. Following which he worked at the University of Queensland (Australia) and Nippon Telegraph & Telephones Human Communication Laboratories in Japan. His research interests straddle the areas of simulation, virtual environments, machine learning, serious games, and human computer interaction. He has produced more than 100 papers in quality international journals and conferences.

Zahir Tari is a full professor in Distributed Systems at RMIT University (Australia). He received a bachelor degree in Mathematics from University of Algiers (USTHB, Algeria) in 1984, PLACE MSc in Operational Research from University of PHOTO Grenoble (France) in 1985 and PhD degree in HERE Computer Science from University of Grenoble (France) in 1989. Prof Tari’s expertise is in the areas of system’s performance (e.g. Web servers, P2P, Cloud) as well as system’s security (e.g. SCADA systems, Cloud). He is the co-author of six books (John Wiley, Springer) and he has edited over 25 conference proceedings. Prof Tari is also a recipient of over 9M$ in funding from ARC (Australian Research Council) and lately part of a successful 7th Framework AU2EU (Australia to European) bid on Authorisation and Authentication for Entrusted Unions. Finally, Zahir was an associate editor of the IEEE Transactions on Computers (TC), IEEE Transactions on Parallel and Distributed Systems (TPDS) and IEEE Magazine on Cloud Computing.

Xun Yi is a professor with the School of Computer Science and IT, RMIT University, Australia. His research interests include applied cryptography, computer and network security, mobile and PLACE wireless communication security, and privacy PHOTO preserving data mining. He has published more HERE than 150 research papers in international journals, such as IEEE Transactions Knowledge and Data Engineering, IEEE Transactions Wireless Communication, IEEE Transactions Dependable and Secure Computing, and conference proceedings. Recently, he has led a few of the Australia Research Council (ARC) Discovery Projects.

Albert Y. Zomaya is the Chair Professor of High Performance Computing & Networking in the School of Information Technologies, University of Sydney, and he also serves as the Director PLACE of the Centre for Distributed and High PerforPHOTO mance Computing. Professor Zomaya published HERE more than 600 scientific papers and articles and is author, co-author or editor of more than 20 books. He is the Founding Editor in Chief of the IEEE Transactions on Sustainable Computing and serves as an associate editor for more than 20 leading journals. Professor Zomaya served as an Editor in Chief for the IEEE Transactions on Computers (2011-2014). Professor Zomaya is the recipient of the IEEE Technical Committee on Parallel Processing Outstanding Service Award (2011), the IEEE Technical Committee on Scalable Computing Medal for Excellence in Scalable Computing (2011), and the IEEE Computer Society Technical Achievement Award (2014). He is a Chartered Engineer, a Fellow of AAAS, IEEE, and IET. Professor Zomaya’s research interests are in the areas of parallel and distributed computing and complex systems.

1

8 8.1

A PPENDIX

Which means neither p nor any ancestor of p belongs to set Fm , that is: X Fpa = gp

Proofs

Lemma 2. [Accuracy] OCHHA algorithm provides the following accuracy guarantee.

(p∈F / m )∧(p64pa ∈Fm )

Since,

fˆp ≤ fp + p N del , fˆp,s ≤ fp,s + s fp

0

0

∀p ∈ Fm , gp ≥ φp N ⇒ ∀p ∈ / Fm , g p < φp N

Proof. Observe that, the only error that we have in the estimation of primary item p is due to items that are deleted from sketch SS p , thus, the error in estimation of frequency of primary item p is relative to N del . Since, sketch SS p uses O( 1p ) memory, therefore, this error is bounded by p N del , i.e., fˆp ≤ fp + p N del . Similarly, the sketch SS p,s provides fˆp,s ≤ fp,s +p s N guarantees due to Space Saving algorithm [4]. Since, s fp ≥ s p N , therefore, the sketch SS p,s provides fˆp,s ≤ fp,s + s fp guarantees.

Therefore, X

Fp a =

gp

(gp

Suggest Documents