Top-K patterns from binary datasets in presence of noise, as ... attracted a lot of attention in the Data Mining commu- ...... using the LUCS-KDD DN software [5].
Mining Top-K Patterns from Binary Datasets in presence of Noise Claudio Lucchese∗
Salvatore Orlando†
Raffaele Perego‡
Abstract The discovery of patterns in binary dataset has many applications, e.g. in electronic commerce, TCP/IP networking, Web usage logging, etc. Still, this is a very challenging task in many respects: overlapping vs. non overlapping patterns, presence of noise, extraction of the most important patterns only. In this paper we formalize the problem of discovering the Top-K patterns from binary datasets in presence of noise, as the minimization of a novel cost function. According to the Minimum Description Length principle, the proposed cost function favors succinct pattern sets that may approximately describe the input data. We propose a greedy algorithm for the discovery of Patterns in Noisy Datasets, named PaNDa, and show that it outperforms related techniques on both synthetic and realworld data.
1 Introduction Many popular applications – for example, in the electronic commerce, TCP/IP networking, Web usage logging, or mobile communications domains – generate daily huge amounts of transactions. Such huge datasets can be stored as binary matrices, where each row is a transaction, and the 0-1 bits corresponds to the presence/absence of a given binary attribute (item) in each transaction. The binary attributes are in many cases not independent, as the transactions to which they belong generally model real-life phenomena and events. Patterns, i.e., collections of items whose occurrences are somehow related to each other [15], can be mined from these binary datasets, in order to understand common behavior and derive interesting knowledge about the people/phenomena/events that generated them. Finding significant patterns may be very complex, and algorithms for efficiently and effectively solving this problem attracted a lot of attention in the Data Mining community. In particular the problem of detecting the most “important” patterns embedded in data, often known as Top-K patterns, is a particularly challenging task. In Figure 1 we show a toy binary dataset made of 50 items (columns) and 1000 transactions (rows). Thanks to a suitable re-ordering of rows and column, at first glance it is easy to observe the presence of three patterns (corresponding to the three dark rectangles). ∗ I.S.T.I.-C.N.R.,
Pisa, Italy. of Computer Science, Univ. of Venice, Italy. ‡ I.S.T.I.-C.N.R., Pisa, Italy. † Dept.
Figure 1: A synthetic binary dataset: rows correspond to transactions, columns correspond to items.
These Top-3 patterns consist of three overlapping sets of items occurring together in three overlapping sets of transactions. Nevertheless, the translation of our simple intuition into an actual pattern mining algorithm is very challenging. The overlapping of patterns and the presence of noise in real-world data (which is simulated by random flips in Figure 1) may hide these patterns. Indeed, no one of the algorithms proposed in literature is able to discover that the dataset in Figure 1 simply contains three patterns plus some noise. Frequent itemset mining algorithms are not able to find those patterns due to the presence of noise: not all the items of a pattern may always occurs simultaneously in the set of supporting transactions/observations. A way to solve this issue is to relax the definition of support, as for Error Tolerant Itemsets (ETIs) [3, 16, 17]. But also ETIs have a number of drawbacks: they are too many, and they contain a lot of spurious itemsets containing only noisy occurrences. Among the dataset tiling techniques, a recent proposal is Hyper+ [19], which tries to cover with at most K pattern every item occurring in the dataset. Since the algorithm assumes that no noise is present, it tries to cover every noisy bits as well, and thus fails in finding the Top-3 patterns in Figure 1. Other interesting approaches for solving the prob-
165
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
lem of detecting the Top-K patterns in a noisy binary dataset are based on matrix decomposition. In this regards, it is worth discussing the Discrete Basis Problem [13], which has partially inspired the novel framework proposed in this paper. The algorithm proposed to solve that problem is called ASSO [13], and yields a set of K basis vectors, i.e., sets of correlated attributes, but also determines how these vectors must be combined to express each rows/transactions of the binary dataset. The output of the algorithm can also be interpreted a set of K patterns, i.e. itemsets and related transactions, which can approximately cover the input data. The parameter K forces ASSO to introduce some errors, i.e. sets of bits correctly or incorrectly covered by the mined patterns. Therefore Asso greedily discovers a set of K basis that minimizes such error, but unfortunately fails in discovering the Top-3 patterns of Figure 1, as discussed in the Section 4. In this paper we thus introduce a new Top-K Pattern Discovery Problem, whose formulation is based on the minimization of a cost function, which takes into account both the error and the complexity of the patterns being discovered. We claim that the previous proposals, like Asso, may generate over-fitted models. In fact Asso minimizes only the error, and this leads to the extraction several patterns that try to cover noisy occurrences. Our work contains several original contributions. First, our problem formulation exploits a novel cost model used to evaluate the goodness of the set of patterns extracted. By minimizing our cost function we are able to detect the Top-K patterns, i.e. itemsets and their supporting transactions, according to the Minimum Description Length (MDL) principle [14]. Second, we propose PaNDa1 , a new efficient mining algorithm for the discovery of Patterns in Noisy Datasets. In particular, the PaNDa robustness is due to its ability to effectively deal with both false positives, i.e. pairs of item/transaction present in the cover but not in the data, and false negatives, i.e. pairs of item/transaction present in the data but not in the cover. Finally, we show that PaNDa outperforms Asso in terms of the quality of the extracted patterns. Several tests were conducted, by exploiting both synthetic and real-world datasets, with the aim of validating the effectiveness of our proposed algorithm. The paper is structured as follows. Section 2 introduces our cost function and formalize the Top-K Pattern Discovery Problem in noisy binary datasets. Section 3 presents PaNDa, a robust and efficient algorithm for solving our problem. The experimental settings and the 1 The
code is available at http://hpc.isti.cnr.it/~claudio
results of the tests conducted are analyzed in Section 4. Finally, Section 5 discusses the related works, while Section 6 draws some conclusions. 2 Problem Statement Definition 1. (Transactional dataset D) Let N ×M D ∈ {0, 1} be the binary representation of a transactional dataset, composed of N transactions {t1 , . . . , tN }, ti ⊆ I, where I = {a1 , . . . , aM } is a set of M items. We have that D(i, j) = 1 iff aj ∈ ti , D(i, j) = 0 otherwise. We argue that any observed data D can be explained by the occurrence of a set of patterns Π, plus some noise. A pattern P ∈ Π is defined as a set of items occurring in a set of transactions, and can be represented by a couple M N P = hPI , PT i, where PI ∈ {0, 1} and PT ∈ {0, 1} . The observed dataset D is indeed obtained as the or-ing of the various patterns Π, plus some noise. If this is the case, we say that Π is a pattern model of D. Definition 2. (Pattern Model) Let D be the binary representation of the input data, a set of patterns Π is called a pattern model of D, if: _ D = (PT · PIT ) ⊕ N P ∈Π
where is the logical or operation, PT · PIT is the outer product of the two binary vectors (thus (PT · PIT ) ∈ N ×M {0, 1} ), ⊕ is the element-wise xor operation, and N ×M N ∈ {0, 1} models the noise which acted on the data by flipping some uncorrelated bits. W
The above model has many similarities with other proposals in related fields such as probabilistic latent semantic analysis, factor analysis, and topic models, where the keyword pattern is replaced by aspect, unobserved class variable, factor, or latent topic, etc. In Section 5 we illustrate in detail differences and commonalities with our model. Also in our model the set of patterns Π is latent. It is unobserved by a user (and even obfuscated due to the presence of noise), but it is crucial for understanding the data. Our goal is thus to discover the set of patterns Π that better explains a dataset D. Unfortunately, there are several different pattern sets and noise matrices that can explain the data according to Definition 2. We choose to pick the best pattern set among all the possible candidates according to the the Minimum Description Length (MDL) principle [14]. When choosing among different models, which are somehow encoded, the MDL principle suggests to adopt the model with the minimum code length. To this end, we define two
166
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
encoding cost functions γP : {0, 1}N × {0, 1}M → R and γN : {0, 1}N ×M → R, respectively for patterns P ∈ Π and for N , and require that the selected pattern set Π, and the resulting noise matrix, to have the minimum costs. Finally, we force Π to have cardinality at most K. In the following we also assume the existence of a (hidden) true pattern set Ω that actually generated the data, and we argue that the pattern set Π, selected according to the MDL principle, closely approximates Ω. We can finally introduce the Top-K Pattern Discovery Problem.
Note that, according to Definition 2, the noise 4
4
matrix N can be computed in terms Π as N = D ⊕ Π, and accounts of both false positives and negatives: 4
False Positives if D(i, j) = 0 and Π (i, j) = 1 4
False Negatives if D(i, j) = 1 and Π (i, j) = 0 Therefore, in both the false positives and false negative cases, we have that N (i, j) = 1.
2.1 Boolean matrix decomposition. The problem of discovering a suitable pattern model Π according to Definition 2 can also be seen as a Boolean Problem 1. (Top-K Pattern Discovery Problem) matrix decomposition problem. N ×M N ×K Given a binary dataset D ∈ {0, 1} , and an integer If |Π| = K, let PT ∈ {0, 1} and PI ∈ K, find the pattern set Π, |Π| ≤ K, that defines a {0, 1}K×M be two Boolean binary matrices, where P T pattern model of D and minimizes the following cost: is obtained by juxtaposing the K column vectors PT , whereas PI is obtained by juxtaposing the K row X γ(Π, D) = γP (PT , PI ) + γN (N ) vectors PIT . More specifically, for each P = hPI , PT i ∈ P ∈Π Π, if PIT appears in the h-th row of PI , then PT has where γP (x, y) = kxkF + kykF and γN (z) = kzkF . The to appear in h-th column of PT . Then the following Frobenius norm k · k when applied to either a binary equation holds: F
vector or a binary matrix simply counts the number of 1 bits it contains. The rationale of the two cost functions γP and γN is to penalize items that are not clustered together. Suppose that a large “rectangle” of 1’s is present in D. The mining algorithm has to decide whether to consider this rectangle being noise or an actual pattern. In the first case, the algorithm incurs in a cost equal to the number of bits, i.e. the area of the rectangle. In the second case, the rectangle is described just using its items and transactions, and the cost is thus equal to its semiperimeter (plus 1). Being the second option more cost effective, i.e. with a smaller encoding cost, the rectangle is promoted to a pattern. From an information theoretical point of view, patterns in data are useful in generating a compressed description, the remainder is supposed to be noise. The pattern model allows both false positives and false negatives to occur in the pattern set extracted Π. A false positive occurs when an item which is not present in D is covered by a pattern in Π, while a false negative occurs when an item present in D is not covered by Π. More formally, we call expected ground truth how the dataset would look like if noise was not present. The expected ground truth deriving from the discovered pattern set Π is: _ 4 (PT · PIT ) Π = P ∈Π
D =
_
(PT · PIT ) ⊕ N = PT ◦ PI ⊕ N
P ∈Π
where ◦ denotes the Boolean product of two matrices, i.e. a matrix product with addition and multiplication being replaced by the logical or and and operations respectively. The above equation holds because, according to 4
Definition 2, each element (i, j) of matrix Π = W T (P · T PI ) is obtained by or -ing the correspondP ∈Π ing binary entries (i, j) of all the matrices (PT · PIT ) ∈ N ×M {0, 1} . Therefore, we can write: 4
Π (i, j)
=
_
PT (i) · PI (j)
P ∈Π
Note that the above equation exactly corresponds to the Boolean dot-product that involves the i-th row of PT and the j-th row column of PI , and thus produces the entry (i, j) of the Boolean matrix product PT ◦ PI . The problem of finding suitable matrices PT and PI is also discussed in [13], where the Discrete Basis Problem is introduced. The rows of PI are called basis, since every transaction in D can be obtained by or-ing a subset of those basis. Each row of matrix PT encodes which basis are needed for any given transaction. Indeed, the same pattern model described in Definition 2 is adopted, but with a different cost function.
167
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
The authors aim at minimizing the noise N only, without taking into account any cost associated with the extracted pattern set Π. Their formulation can be thus reduced to Definition 1, with γP (·) = 0. Note that the decision version of the Discrete Basis Problem was proven to be NP-Complete. The authors propose a greedy algorithm, called Asso for approximating the best solution for the Discrete Basis Problem. First, a set of candidate basis is created by using global statistics. The i-th candidate basis is composed of item ai plus any other item aj having correlation with ai greater than a user defined threshold τ . Then, K basis are iteratively selected by greedily minimizing the cost function. In the present work, we generalize the framework proposed in [13] with a novel cost function, which takes into account the complexity of the discovered pattern set. We claim that minimizing N is not a sufficient criteria, since trying to cover every bit in D means to cover also occurrences introduced by noise, and it is thus likely to lead to over-fitting issues. We argue that the exploitation of the MDL principle may solve such issues, thus supporting the extraction of an improved pattern set.
reduces the cost of the model, it is added to the current pattern set Π. These two steps are described in detail in the following sections. Regardless the user-provided parameter K, PaNDa stops generating new patterns if they do not improve the cost of the model (line 6). The algorithm uses a particular view DR of the dataset D for the discovery of a new core pattern (line 10). This view is called residual dataset and it is computed by setting D(i, j) = 0 for any i,j such that PT (i) = 1 and PI (j) = 1 for some previously discovered pattern P . The rationale is that we want to discover new patterns that explain a portion of the database that was not already covered by any previous pattern. 3.1 Extraction of dense cores The procedure Find-Core extracts a core pattern C from the residual dataset DR , and returns an ordered list E of items that are later used to extend C. The procedure is described in Algorithm 2. The extension list E is initialized to the empty set. In order to reduce the search space, items in DR are sorted (line 3). We discuss a number of possible orderings in Section 3.3. The core pattern C is initialized with the first item s1 in the ordered list S, and its supporting transactions in DR . The remaining items in S are processed one by one. The current item sh is used to create a new candidate pattern C∗ (lines 8-10). Since in this phase we do not allow noise, when we add an item to a pattern (line 9), as a consequence we can have a reduction in the number of supporting transactions (line 10). If C ∗ reduces the cost of the pattern set with respect to C, then C ∗ is promoted to be the new candidate (line 12) and it is used in the subsequent iteration. Otherwise, sh is appended to the extension list E (line 14), and it can be used later to extent the extracted dense core. Eventually, every item occurring in DR has been processed only once, and either it was added to the dense core C, or it was appended to the extension list E.
3 Algorithm PaNDa The solution space of Problem 1 is extremely large. There are 2M times 2N candidate patterns – where M is the number of items I, and N is the number of transactions – out of which only a few must be selected. To tackle such a large search space, we propose a greedy algorithm named PaNDa. It adopts two heuristics. First the problem of discovering a pattern is decomposed into two simpler problems: discovering a core pattern [4] and extending it to form a good approximate pattern. Second, rather than considering all the 2M possible combination of items, these are sorted and processed one by one without backtracking. Similarly to AC-Close [4], we assume that, even in presence of noise, a true pattern P ∈ Ω occurs in D with a smaller core pattern. That is, given a pattern P = hPI , PT i, there exists C = {CI , CT } such that Algorithm 1 PaNDa algorithm. 1: Π ← ∅ . the current collection of patterns CT (i) = 1 ⇒ PT (i) = 1, CI (j) = 1 ⇒ PI (j) = 1 and 2: D ← D . the residual data yet to be explained CT (i) = 1 ∧ CI (j) = 1 ⇒ D(i, j) = 1. Then, given R 3: for iter ← 1, . . . , K do C, PaNDa adds items and transactions to C, until the 4: C, E ← Find-Core(DR , Π, D) largest extension of C that minimizes the overall cost γ 5: C + ← Extend-Core(C, E, Π, D) is found. 6: if γ(Π, D) < γ(Π ∪ C + , D) then Algorithm 1 gives an overview of the PaNDa al7: break . cost cannot be improved any more gorithm. It iterates two main steps at most K times, 8: end if where K, provided by the user, is the maximum num9: Π ← Π ∪ C+ ber of patterns to be extracted. During each iteration, DR (i, j) ← 0 ∀i, j s.t. CT+ (i) = 1 ∧ CI+ (j) = 1 first a core pattern C is extracted (line 4), and then it 10: + + 11: end for is extended to form a new pattern C (line 5). If C
168
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Algorithm 2 Core Pattern discovery. 1: function Find-Core(DR , Π, D) 2: E←∅ . Extension list 3: S = {s , . . . , s } ← sort-items-in-db(D 1 M R)
4: C ← CT = 0N , CI = 0M 5: CI (s1 ) ← 1 6: CT (i) ← 1 ∀i s.t. DR (i, s1 ) = 1 7: for all h ← 2, . . . , M do 8: C∗ ← C . create a new candidate 9: CI∗ (sh ) ← 1 10: CT∗ (i) ← 0 ∀i s.t. DR (i, sh ) = 0 11: if γ(Π ∪ C ∗ , D) ≤ γ(Π ∪ C, D) then 12: C ← C∗ 13: else 14: E.append(sh ) 15: end if 16: end for 17: return C, E 18: end function Algorithm 3 Extension of a Core Pattern. 1: function Extend-Core(C, E, Π, D) 2: while E 6= ∅ do . add a new item 3: e ← E.pop() 4: C∗ ← C 5: CI∗ (e) ← 1 6: if γ(Π ∪ C ∗ , D) ≤ γ(Π ∪ C, D) then 7: C ← C∗ 8: end if . add new transactions 9: for i ∈ {1, . . . , N } s.t. CT∗ (i) = 0 do 10: C∗ ← C 11: CT∗ (i) ← 1 12: if γ(Π ∪ C ∗ , D) ≤ γ(Π ∪ C, D) then 13: C ← C∗ 14: end if 15: end for 16: end while 17: return C 18: end function
The procedure returns C and E for further processing. Being a greedy approach, the procedure may not find the best core pattern, i.e. the one minimizing the cost. However, since the procedure is invoked K times, this risk is amortized and the probability of getting a good set of core patterns out of K runs will be sufficiently large. In addition, we will consider in Section 3.4 a randomization strategy to avoid local minima. This makes it possible to successfully exploit a greedy and simple strategy as the one we just described.
A prefix-tree based data structure is used to store the transaction of DR , which are sorted according to descending items frequency. A prefix-tree allows to easily test the goodness of a new candidate, i.e. when a new item is added, by processing only a subset of its branches. 3.2 Extension of dense cores Given a dense core C = hCI , CT i, the procedure Extend-Core tries to add items to CI and transactions to CT , possibly introducing some noise, as long as the overall representation cost is reduced. The procedure is described in Algorithm 3. An extension list E ⊂ I is given. The first item e is removed from the list E, and it is added to CI thus forming a new pattern C ∗ (line 5). If this pattern does not improve the cost of representing Π and N (line 6), then item e is disregarded. The set CT is extended analogously. A transaction ti , which has not yet been considered in CT , is used to create a new candidate pattern C ∗ (line 11). If the new pattern carries any improvement over the previous representation (line 12), that C ∗ is promoted and considered for further extensions in place of the old C. The above two steps are repeated as long as there is a new item in E. There are some interesting subtleties in the evaluating process of new pattern C ∗ . First C ∗ may introduce some noise. If the item e is added to CI∗ , then e may not occur in every transaction CT∗ , and similarly when extending CT . The noise cost thus increases by the number of missing items, i.e. false positives. At the same time, the extension may cover a number of items that were already considered as noise by some previous patterns, and this noisy bits are not accounted again in the cost function γN . If the balance between items covered and noise introduced justifies the increased representation cost of C ∗ , then the new pattern C ∗ is accepted. Second, the convenience of C ∗ depends on the amount of data it covers, that was not already covered by other patterns in Π. For example, suppose that C ∗ does not introduce any noise, and that all the items covered thanks to the extension were already covered by another pattern in Π. Then, the only variation in the cost function occurs because of the cost of representing Π, which increases thus making the extended pattern C ∗ non convenient. The algorithm stores a vertical representation of the dataset, where tid-lists are stored for each item, i.e. the list of the transactions containing a given item. This allows to quickly check the cost incurred when adding a new item or a new transaction to C ∗ . This data
169
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
structure is updated after a new pattern is discovered, The parameter τ works as an computational temperin order to distinguish items that were already covered ature, controlling the amount of randomization allowed. by the current pattern set Π. The larger τ , the larger the probability of having the largest score for an item with high support. The param3.3 Sorting items eter τ is decreased exponentially at each randomization The ordering of the items chosen in Alg. 2 at line 3 round. affects the whole algorithm. Indeed, items are associAfter that R candidate patterns have been found, ated with a score, and then ordered by descending score. only the best is evaluated for its inclusion in the pattern Such ordering is a greedy choice that may hurt the qual- set Π. ity of the extracted patterns. We consider four scoring strategies: 4 Experiments • Frequency. The score of an item is given by its 4.1 Experiments on synthetic data Evaluating the goodness of the patterns discovered frequency in DR . by a mining process is a very difficult task. The • Couples frequency. As in [20], items are sorted patterns present in a given dataset are unknown, and according to their contribution to the frequency therefore there is no ground truth that we can use as of the 2-itemsets in DR . The score of item aj is a benchmark. However, Gupta, et al. [8] proposed given by the sum of the supports of every itemset a collection of synthetic datasets, along with a set of {aj ∪ ak } occurring in DR . evaluation measures to evaluate the goodness of the • Correlation. The score is given by the correlation patterns extracted in presence of noise. The idea is with the current candidate pattern C. to synthesize a dataset by embedding a given set of • Prefix child. The itemset C is identified by a set patterns – i.e., the ground truth Ω – and then introduce of paths on the prefix-tree starting from its root. noise by randomly flipping a given percentage of bits in Given the deepest nodes in those paths, the support the corresponding binary representation. An example of their children is cumulated to compute a score. of these synthetic datasets is shown in Figure 1. Since the embedded true patterns occurring in the The rationale of the frequency strategy is clear: the dataset are known, it is possible to compare the patterns most frequent item has the highest probability of being extracted by any mining algorithm. Due to space part of a core pattern. This is also the least expensive constraints, we only use three out of the eight datasets proposed, namely the datasets identified by numbers ordering to compute. The other three strategies aim at grouping together 5, 6, and 8. Respectively, a number of two, three, and items with high correlation. The couples frequency four patterns are embedded in these datasets. Moreover, based order is slightly more expensive, since the whole there are overlaps between the embedded patterns. All prefix-tree must be processed. The correlation based the datasets have 50 items and 1000 transactions. These order must be re-computed every time the itemset C is datasets allow for repeatability of experiments, and they updated (i.e. after line 10 of Alg.2), and a processing is can be used to detect and understand the weaknesses of required of those portions of the prefix-tree supporting an algorithm. The authors of [8] propose four evaluation measures C. Finally, the prefix child based strategy approximates the correlation one by reducing the number of nodes of that only consider the items of a discovered pattern, the prefix-tree visited to re-calculate a new ordering. In i.e. only PI for each discovered pattern P ∈ Π. We the last two strategies, when C = ∅ the ordering is given are instead interested in evaluating an improved quality measure, which takes into account both the sets PI and by the items’ frequency. PT . Given the true patterns Ω and the discovered ones 3.4 Randomizing PaNDa To reduce the risks of a greedy strategy we introduce Π, we measure the goodness Π on the basis of how well some randomization. The lines 4 and 5 of Alg. 1 are Π matches Ω. We say that an occurrence is correctly repeated R times, where R is the number of randomiza- retrieved iff: 4 4 tion rounds. During each round the ordering of items Π (i, j) = 1 and Ω (i, j) = 1 is perturbed as follows. Each item is associated with a randomized score computed by drawing a random num- Note that if this holds, we cannot derive that D(i, j) = ber between 0 and its original score raised to the power 1, since this bit might be flipped due to the noise. of τ . Then, items are re-ordered according to their ranWe adapt the usual precision and recall measures to domized score. this setting, and compute the pattern-based F-measure
170
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Figure 3: Patterns extracted by PaNDa from data6.
Figure 2: Model cost of the several variants of the PaNDa algorithm. On top of each bar, the number of discovered patterns.
Fp as follows: P recp (Π) =
X4 4 4 Π (i, j)· Ω (i, j) / k Π kF i,j
Recp (Π) =
X4 4 4 Π (i, j)· Ω (i, j) / k Ω kF i,j
Fp (Π) =
2 · P recp (Π) · Recp (Π) P recp (Π) + Recp (Π)
In the following we use the cost function γ(Π, D) and the pattern-based F-measure Fp (Π) to evaluate the variants of the PaNDa algorithm, and to compare the best of those with the Asso algorithm. In Figure 2 we report the normalized model costs, i.e. γ(Π, D)/kDkF , for the four variants of the PaNDa algorithm resulting from the four item ordering strategies, and their randomized versions. The test was run on the dataset data5 with a noise level of 4%, i.e. each bit in D was randomly flipped with probability 4%. The parameter K was set to infinity and the number of randomization rounds R to 20. The first observation is that the introduction of randomization brings a significant benefit. This is evident in dataset data5 where the cost of the model is reduced from about 0.8 to 0.6. The second observation regards the number of discovered patterns, which are shown on top of each bar. Generally this number increases with the randomized approach. We deduce that randomization helps in finding smaller dense regions, which can be added to the extracted model Π, still decreasing the total cost. We conclude that the most promising variants of PaNDa are: (1) the randomized frequency and (2) the randomized correlation. The former has a very simple
and slightly faster implementation, and the introduction of randomization does not increase the number of patterns. The latter has a better performance in terms of cost of the model, but it always generates the largest number of patterns. Therefore, for the remainder of the paper we focussed on those two variants only. Next, we compare PaNDa against the Asso algorithm [13]. Even if we know exactly how may patterns have been embedded in each synthetic dataset, this information is not available in real data analysis tasks. So we asked the algorithms to extract the best K = 15 patterns from each dataset. The rationale is to avoid any biasing due to an information that usually is not known in advance. In Figures 4.a-c we evaluate the ability to minimize the cost of the model representation as defined in Problem 1, as a function of the amount of random noise (percentage of flipped bits) embedded in the datasets. The three algorithms have very similar behavior, with no significant differences. The only exception regards dataset data5, where Asso produces a representation with a much larger cost for small noise levels. The above comparison may seem unfair, since Asso optimizes a different cost function, namely the amount of noise kN kF . For this reason, in Figures 4.d-f we report the normalized model error kN kF /kDkF produced by the three algorithms on the same datasets. In this case, the Asso algorithm outperforms PaNDa, and the differences appear more significant than in the previous set of plots. Again, Asso is not able to find a good pattern set for data5 when no error is present. Recall that our primary goal is to approximate the true pattern set Ω hidden in the datasets. Figures 4.g-i show the pattern-based F-measure Fp as a function of the noise embedded in the datasets. In every experiment, both the two variants of the PaNDa algorithm
171
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
show consistently better performance than Asso. Indeed, PaNDa reaches a value of Fp about 10% larger in every experiment. We draw the following conclusion. Asso is able to find a good covering of the given dataset, i.e. to minimize N . On the other hand, PaNDa can discover the true patterns embedded in the data. In Figure 4.1 we show the actual patterns extracted by PaNDa from data6 after 6% random flips. The dataset is same shown in Figure 1. The PaNDa algorithm was able to discover the three embedded patterns and their supporting transaction. Only a few spurious transactions were added, because they contain large portion of patterns’ items due to noise.
clustering F-measure Fc is thus defined as follows: Fc (Π∗ ) =
X Nj j
N
max F (i, j) i
where Nj is the number of transactions in the j-th class. The function Fc averages the F-measures of the best extracted clusters for each actual class j. The second measure is the Jaccard index J(Π∗ ): J(Π∗ ) = J11 /(J11 + J10 + J01 ) where J11 is the number of transaction pairs (ti , tj ) appearing together in the same cluster as well as in a class, J10 is the number of pairs appearing together in a cluster but belonging to different classes, and J01 is the number of pairs belonging to different clusters but to the same class. We measured the quality of the clusters extracted from 22 different datasets, which have been discretized using the LUCS-KDD DN software [5]. The class label was removed in advance. Also in this case we asked each algorithm the top 15 patterns. Table 4.2 reports the average length (kPI kF ) and support (kPT kF ) of the patterns in the extracted set Π before the post-processing, and the values of the two quality indices Fc and J. Asso performs best only on four datasets, while the randomized correlation variant of PaNDa performs best on average: it provides an Fc value 18% better than Asso, and the improvement is 65% for the J index. We also report the geometric mean, since this is less affected by outliers: the gains are respectively 17% and 83% respectively for the Fc and J measures with respect to Asso. An interesting observation is that the PaNDa variants extract much smaller patterns, with less than half of the items. This is because the size of the patterns’ description is a cost to be minimized. In fact, overlaps are penalized since they cover the same data, thus producing a less redundant pattern set. It is worth underlining that shorter patterns are more human understandable, and possibly have a higher prediction power. We do not report the running time of the algorithms, since, even if PaNDa is slightly slower, the average running time is below one second on average. In conclusion, we proved that on both synthetic and real world data, our pattern model may help in finding those true patterns hidden in the data.
4.2 Experiments on real data We conducted a set of experiments on a collection of real-world datasets taken from the UCI Machine Learning Repository [1]. This repository contains labelled binary and non-binary data, which are mainly used to validate classification and clustering tasks. Indeed, a pattern P identifies a cluster of correlated transactions PT , and therefore we evaluated the pattern set Π as it was the result of a clustering task. Hereinafter, we use the term clusters to address the transaction sets corresponding to the extracted patterns, and the term classes to address the partitioning of transactions in the benchmark dataset. Unfortunately, the patterns discovered by both PaNDa and Asso may overlap, i.e. a transaction may belongs to different patterns. Conversely, in the test datasets all the transactions exclusively belong to a single class. In order to allow a sound comparison between the discovered clusters and the actual classes present in the data, we post-processed the output of the two algorithms, so as to associate each transaction with a single best matching pattern. For each transaction ti , all patternsPP ∈ Π are ranked according to the 2 overlap score: j PI (j) · D(i, j) . In case of ties, the pattern P with the largest kPI kF is preferred. Finally, transaction ti is retained by the first ranked pattern, and removed from the others. Eventually, every transaction will be associated with one pattern only, thus obtaining a partitional clustering as required. The evaluation of the updated pattern set Π∗ was carried out by means of two external indices to measure the goodness of clustering [18]. Let F (i, j) be the usual F-measure of the i-th extracted cluster with respect to 5 Related Work the j-th actual class embedded in the dataset. The We classify related works in three large categories: matrix decomposition based, database tiling, frequent 2 The cardinality of the set-intersection between the itemset itemsets based. All of them have many similarities with identified by PI and transaction ti the Top-K Pattern Discovery Problem. However, there
172
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Model Cost on data5
1.0
Model Cost on data6
1.0
0.6
Normalized Cost
0.8
Normalized Cost
0.8
Normalized Cost
0.8
0.6
0.4
0.6
0.4
0.2
0.4
0.2
0.2
RND + Frequency RND + Correlation ASSO 0.0 0
2
4
Noise %
6
8
RND + Frequency RND + Correlation ASSO 10
0.0 0
2
4
Noise %
(a)
8
10
0.4
0.2
0.2
6
8
0.0 0
0.2
2
4
Noise %
(d)
6
8
ASSO RND + Frequency RND + Correlation 10
0.0 0
1.00
Pattern-based F-measure on data6
1.00
0.90
0.90
0.90
0.85
0.80
Fp
0.95
Fp
0.95
0.70 0
0.85
0.80
0.75
ASSO RND + Frequency RND + Correlation 2
4
Noise %
4
Noise %
6
8
10
0.70 0
6
8
10
(f)
0.95
0.75
2
(e)
Pattern-based F-measure on data5
10
Model Error on data8
ASSO RND + Frequency RND + Correlation 10
8
0.4
ASSO RND + Frequency RND + Correlation Noise %
6
0.6
0.4
1.00
Noise %
Normalized Error
0.6
4
4
0.8
Normalized Error
Normalized Error
0.6
2
2
1.0
0.8
0.0 0
0.0 0
(c)
Model Error on data6
1.0
0.8
Fp
6
RND + Frequency RND + Correlation ASSO
(b)
Model Error on data5
1.0
Model Cost on data8
1.0
Pattern-based F-measure on data8
0.85
0.80
0.75
ASSO RND + Frequency RND + Correlation 2
4
Noise %
6
8
10
0.70 0
ASSO RND + Frequency RND + Correlation 2
4
Noise %
6
8
(g) (h) (i) Figure 4: Comparison of PaNDa and Asso on different synthetic datasets, on varying the noise level.
173
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
10
Table 1: Evaluation of PaNDa and Asso on 22 real-world datasets. dataset adult anneal breast connect4 cylBands dematology ecoli flare glass heart hepatitis horseColic ionosphere iris mushroom pageBlocks penDigits pima ticTacToe waveform wine zoo arit. mean geom. mean
PaNDa RND + Frequency supp Fc J 6992.47 0.75 0.64 133.13 0.69 0.55 136.00 0.70 0.55 11442.40 0.64 0.50 111.33 0.66 0.50 52.67 0.34 0.19 62.50 0.46 0.30 221.67 0.68 0.45 31.07 0.57 0.34 50.53 0.49 0.35 38.38 0.76 0.64 69.13 0.59 0.38 58.93 0.72 0.58 17.30 0.83 0.61 1347.73 0.65 0.45 790.00 0.86 0.81 1660.00 0.48 0.23 72.58 0.69 0.54 159.5 0.42 0.15 894.13 0.63 0.42 24.50 0.87 0.60 35.33 0.41 0.25 1109.15 0.63 0.46 +16% +58% 4.83 166.98 0.61 0.42 +17% +75%
len 3.73 4.93 2.89 4.93 7.93 4.00 3.38 4.47 3.93 3.87 6.75 5.20 13.93 3.30 6.20 4.57 6.87 3.58 2.87 5.00 5.07 7.17 5.21
PaNDa RND + Correlation len supp Fc J 3.93 7291.53 0.71 0.54 3.27 136.13 0.69 0.60 2.89 136.00 0.70 0.55 4.67 12144.40 0.64 0.50 6.67 107.40 0.67 0.50 3.60 55.87 0.34 0.20 2.62 42.92 0.45 0.29 3.40 244.73 0.76 0.64 3.73 29.40 0.54 0.32 3.07 51.20 0.53 0.40 4.87 25.07 0.76 0.65 3.73 66.80 0.63 0.43 10.27 65.07 0.73 0.60 2.29 16.79 0.94 0.79 5.80 1443.93 0.63 0.46 4.22 616.00 0.86 0.81 5.53 1718.93 0.46 0.22 2.80 59.07 0.69 0.55 2.00 153.27 0.30 0.12 4.53 945.40 0.63 0.42 4.13 26.80 0.89 0.65 5.44 25.33 0.43 0.25 4.25 1154.6 0.64 0.48 +18% +65% 3.96 158.53 0.61 0.43 +17% +83%
len 11.67 11.93 9.00 37.40 28.80 10.13 5.87 8.07 6.73 11.47 17.07 9.80 24.33 2.90 16.53 9.67 5.67 7.87 2.07 16.20 7.07 14.93 12.51
Asso supp 10950.60 175.47 350.25 20210.00 135.60 79.60 41.73 273.00 30.60 61.07 27.33 81.00 53.00 24.00 1245.47 398.93 1858.40 83.40 246.33 999.80 28.47 20.00 1698.82
Fc 0.50 0.43 0.65 0.34 0.41 0.51 0.69 0.46 0.50 0.38 0.53 0.46 0.58 0.70 0.43 0.89 0.46 0.66 0.36 0.36 0.67 0.85 0.54
J 0.19 0.14 0.38 0.13 0.17 0.22 0.52 0.18 0.19 0.15 0.23 0.16 0.25 0.52 0.18 0.84 0.21 0.36 0.13 0.15 0.33 0.73 0.29
10.18
186.95
0.52
0.24
Expectation-Maximization algorithm can be designed to find ZT and ZI such that N is minimized. This formulation is very similar to Definition 2. However, PLSI was not formulated for binary inputs: D, ZT , ZI and N are assumed to be real valued. Recall that the model was first designed to model occurrence of terms in a document, and thus multiple occurrences of the same term are allowed. This is not true for a binary, dataset, where an item can either occur or not. In other words, in one model pattern occurrences are sum-med, in the other they are or-ed. Other similar approaches for non binary inputs have been studied, such as Latent Dirichlet allocation (LDA) [2], Independent Component Analysis (ICA) [11], Non-negative Matrix Factorization, etc. However, evaluations studies suggest that these models cannot be trivially adapted to binary datasets [9]. An improvement over these models has been proposed in [15] and [13]. In [15] a different generative model is proposed: an item occurs if it is generated by at least one class variable. This makes the model close to an or-ing of class variables. The proposed algorithm, called Lift, is able to discover patterns when dominant items are present, i.e. items that are generated with high probability by one class variable only. We already discussed the Discrete Basis Problem [13], and the proposed greedy algoirthm Asso, which makes use of items’ correlation statistics. This Z∈Z where the matrices ZT and ZI result from the juxtapo- happens to be a limitation in many cases, since global sition of the vectors ZT and ZI , for every Z ∈ Z. An statistics may be too general to catch local correlations. are four features that are considered altogether only by our framework: (a) our model is specifically tailored for binary datasets, (b) our model allows for overlapping patterns, (c) our model minimizes the error with respect to the original dataset and the encoding cost of the pattern set discovered, (d) frequent itemsets need not to be extracted from the original dataset. In the following, we illustrate in more detail the algorithms falling in the aforementioned categories. Matrix decomposition based. The methods in this class aim at finding a product of matrices that describes the input data with a smallest possible amount of error. Probabilistic latent semantic indexing (PLSI) [10] is a well known technique that solves the above decomposition problem. PLSI was initially devised to model co-occurrence of terms in a corpus of documents D. The core of PLSI is a generative model called aspect model, according to which occurrence of a words can be associated with an unobserved class variable Z ∈ Z called aspect, topic, or latent variable. The model generates each transaction ti as follows. First, a topic Z ∈ Z is picked with probability ZT (i), and then an item j is picked with probability ZI (j). The whole database results from the contribution of every class variable, and thus we can write: X D = ZT · ZIT + N = ZT ZI + N
174
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
Both [15] and [13] aim at minimizing the noise matrix N only. Database tiling. Database tiling algorithm are tailored for the discovery of large patterns in binary datasets. They differ on the notion of pattern adopted. The maximum K-tiling problem introduced in [6] requires to find the set of K tiles, possibly overlapping, having the largest coverage of the given database D. In this work, a tile is meant to be an pattern hPI , PT i such that if PT (i) = 1 ∧ PI (j) = 1 then D(i, j) = 1. Therefore, this approach is not able to handle the noise present in the database. Co-clustering [12] is a borderline approach between tiling and matrix decomposition. It is formulated as a matrix decomposition problem, and therefore its objective is to approximate the input data by minimizing N . However, it does not allow for overlapping patterns. In this regards, its output is similar to a matrix block diagonalization. According to [7], tiles can be hierarchical. A basic tile is indeed a hyper-rectangle with density p. A tile might contain several non overlapping sub-tiles, i.e., exceptional regions with larger or smaller density. Differently from our approach, low-density regions are considered as important as high density ones, and inclusion of tiles is preferred instead of overlapping. Finally, the authors of [19] propose a model similar to the one described in Definition 2, but they assume that no noise is present in the dataset. Therefore, γN is always 0. The problem of finding Π is mapped to the problem of finding an optimal set covering of the input dataset. First frequent itemsets Fα are extracted from D, with some minimum support threshold α. Each frequent itemset F ∈ Fα identifies a candidate pattern C = hCT , CI i such that CI (j) = 1 ⇔ aj ∈ F and CT (i) = 1 ⇔ ti ⊇ F . The algorithm Hyper+ is greedy algorithm that first selects a number of candidates that cover the whole D with minimum cost, and then recursively merges the two candidates that introduce the smallest error, until only K patterns are left. Differently from our approach, Hyper+ allows only the presence of false positives. False negatives are not allowed, since the input data must be entirely covered. Also, since only the extracted patterns contribute to the cost function, this is not affected by the number of false positives. For this reason Hyper+ tends to create a large number of false positives, and therefore patterns that poorly describe the input data. Consider the case K = 1, rather than finding the best pattern, Hyper+ will extract
a single pattern that covers whole dataset, i.e. P = 1M , 1N . Frequent itemsets based. The frequent itemset mining problem requires to discover those itemsets
supported by at least σ times in the database D. Formally a pattern P = {PI , PT } is frequent if kPT kF ≥ σ and: (5.1)
PT (i) = 1 ∧ PI (j) = 1 ⇒ D(i, j) = 1.
This notion of support does not allow missing items. Generalization of frequent itemsets for dealing with noisy databases, is typically achieved by relaxing the notion of support. Weak error tolerant itemset (ETI) [3] are the first example of such a generalization. The pattern P = hPI , PT i is said to be weak ETI iff kPT kF > σ and the implication in Eq. 5.1 is violated at most ·kPI kF ·kPT kF times, where is an error tolerance threshold. In extreme cases, a transaction may not contain any item of the pattern, and still be included in the supporting set. To avoid the possibility of spurious transactions, strong ETI are introduced such that and error tolerance threshold r is enforced on any transactions/row of the database: ti may support P if it contains at least r · kPI kF items of P . Unfortunately, the enumeration of all weak/strong ETI requires to explore the full itemsets space without any pruning. To overcome this limitation of ETI mining algorithms, in [16] the notion of dense itemset is introduced. An itemset is said dense if it is a weak ETI and all of its non empty subsets are weak ETI. This sort of downward closure allows for an Apriori-like algorithm which can find all the dense itemsets in a database. Approximate Frequent Itemsets (AFI) [17] are an extension of strong ETI, where a row-wise and a columnwise tolerance thresholds are enforced. The pattern P = hPI , PT i is valid if it is a strong ETI and every item i of P is supported by at least c · kPT kF transactions. In this case, a relaxed anti-monotone property can be exploited, and thus the solution set can be extracted without exploring the full itemset space. In [4] the notion of closed approximate frequent itemsets (AC-AFI) is introduced. A pattern is said to be AC-AFI if (a) it is an AFI, (b) there is not superset supported by the same transactions, and (c) it encloses a core pattern. Given a pattern P = hPI , PT i, a core pattern C = hCI , CT i is such that CT (i) = 1 ⇒ PT (i) = 1, CI (j) = 1 ⇒ PI (j) = 1 and CT (i) = 1 ∧ CI (j) = 1 ⇒ D(i, j) = 1. In fact core patterns are also frequent pattern, they can be quickly extracted and extended by adding items and transactions as long as the resulting pattern is still an AC-AFI. We adopted the notion of core pattern in the formulation of the PaNDa algorithm in Sec. 3. Frequent itemets based algorithms require a demanding data processing, and, more importantly, they tend to generate a large and very redundant collection
175
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.
of patterns. The cost model embedded in our frame- References work implicitly limits the pattern explosion, since a large number of patterns does not minimizes the repre[1] A. Asuncion and D.J. Newman. UCI machine learning sentation cost. The cost model may also allow to rank repository, 2007. [2] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. patterns according to the contributed cost reduction. Latent dirichlet allocation. J. Mach. Learn. Res., In [8] all of the above frequent itemsets based algo3:993–1022, 2003. rithms, plus some of their extensions are evaluated over [3] C. Cheng, U. Fayyad, and P.S. Bradley. Efficient a collection of synthetic datasets. This collection, which discovery of error-tolerant frequent itemsets in high we also used to evaluate our algorithm, constitutes the dimensions. In Proc. of KDD, pages 194–203, 2001. first benchmark for approximate frequent itemsets min[4] H. Cheng, P. S. Yu, and J. Han. Ac-close: Efficiently ing algorithms. mining approximate closed itemsets by core pattern 6 Conclusion In this paper we have discussed a new problem formulation for discovering the Top-K patterns from binary datasets. We have shown that this problem aims to decompose the matrix in term of patterns plus noise. The noise models not only false negative (1 bits that are not covered from any pattern) but also false positives (0 bits covered by a pattern). Since a noisy input matrix can be decomposed in terms of several pattern sets and noise, we propose an encoding function that allows a cost to be associated with the patterns and the noise detected. This cost is used for evaluating the goodness of the set of patterns extracted, or, in an equivalent way, of the matrix decomposition. Finally, according to the MDL principle, we argue that the best pattern model is the one that minimize the overall cost function charactering the matrix decomposition. Solving our problem is complex, so we have proposed a new greedy algorithm, called PaNDa, which extracts the pattern model from a binary datasets, by minimizing the aforementioned cost model. We compared PaNDa with Asso, a greedy algorithm specifically tailored for binary datasets, inspired to a similar framework. Asso aims at finding a Boolean decomposition of a binary matrix, by minimizing a cost function that, unlike PaNDa, only accounts for noise. We argue that this approach may generate over-fitted models. In fact, minimizing only the noise cost can lead to patterns that try to cover the noisy 1 bits present in the data. The experiments conducted on both synthetic and real-world datasets have showed that PaNDa outperforms Asso. PaNDa can better deal with noise in the data, due to its cost function that also accounts for the complexity of the patterns extracted. In conclusion, our results show that the proposed cost model, which balances the amount of noise estimated in a dataset with the length of the description of the discovered patterns, is able to get close to the set of true patterns actually embedded in both synthetic and real-world datasets.
176
[5] [6] [7]
[8]
[9]
[10] [11] [12] [13]
[14]
[15]
[16] [17]
[18] [19]
[20]
recovery. In Proc. of ICDM, pages 839–844. IEEE Computer Society, 2006. F. Coenen. The lucs-kdd discretised/normalised arm and carm data library, 2003. F. Geerts, B. Goethals, and T. Mielik¨ ainen. Tiling databases. Discovery Science, pages 278–289, 2004. A. Gionis, H. Mannila, and J.K. Sepp¨ anen. Geometric and combinatorial tiles in 0-1 data. In Proc. of PKDD, pages 173–184, 2004. R. Gupta, G. Fang, B. Field, M. Steinbach, and V. Kumar. Quantitative evaluation of approximate frequent pattern mining algorithms. In Proc. of KDD, pages 301–309. ACM, 2008. Heli Hiisila and Ella Bingham. Dependencies between transcription factor binding sites: Comparison between ica, nmf, plsa and frequent sets. In Proc. of ICDM, pages 114–121, 2004. Thomas Hofmann. Probabilistic latent semantic indexing. In Proc. of SIGIR, pages 50–57. ACM, 1999. A. Hyv¨ arinen, J. Karhunen, and E. Oja. Independent component analysis. John and Wiley, 2001. Tao Li. A general model for clustering binary data. In Proc. of KDD, pages 188–197. ACM, 2005. P. Miettinen, T. Mielikainen, A. Gionis, G. Das, and H. Mannila. The discrete basis problem. IEEE TKDE, 20(10):1348–1362, Oct. 2008. Jorma Rissanen. Stochastic Complexity in Statistical Inquiry Theory. World Scientific Publishing Co., Inc., River Edge, NJ, USA, 1989. J.K. Sepp¨ anen, E. Bingham, and H. Mannila. A simple algorithm for topic identification in 0-1 data. In Proc. of PKDD, pages 423–434, September 2003. Jouni K. Sepp¨ anen and Heikki Mannila. Dense itemsets. In Proc. of KDD, pages 683–688, 2004. M. Steinbach, P.N. Tan, and V. Kumar. Support envelopes: a technique for exploring the structure of association patterns. In Proc. of KDD, 2004. Pang-Ning Tan, Michael Steinbach, and Vipin Kumar. Introduction to Data Mining. Addison-Wesley, 2006. Y. Xiang, R. Jin, D. Fuhry, and F. F. Dragan. Succinct summarization of transactional databases: an overlapped hyperrectangle scheme. In Proc. of KDD, pages 758–766. ACM, 2008. Mohammed J. Zaki and Ching-Jui Hsiao. Efficient algorithms for mining closed itemsets and their lattice structure. IEEE TKDE, 17(4):462–478, April 2005.
Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.