Oct 19, 1999 - Supported in part by NSF Grant IIS-9811904. Department of ... Shustek Fellowship, part of the Stanford Graduate Fellowship program, and NSF Grant IIS-9811904. 1 ..... Thus, the user can monitor the progress of the algorithm and .... days to the Sun Microsystems Web server (www.sun.com). The columns in ...
Finding Interesting Associations without Support Pruning Edith Cohen
Mayur Datary Rajeev Motwanik
Shinji Fujiwaraz Aristides Gionisx Piotr Indyk{ Jerey D. Ullman Cheng Yangyy October 19, 1999
Abstract
Association-rule mining has heretofore relied on the condition of high support to do its work eciently. In particular, the well-known a-priori algorithm is only eective when the only rules of interest are relationships that occur very frequently. However, there are a number of applications, such as data mining, identi cation of similar web documents, clustering, and collaborative ltering, where the rules of interest have comparatively few instances in the data. In these cases, we must look for highly correlated items, or possibly even causal relationships between infrequent items. We develop a family of algorithms for solving this problem, employing a combination of random sampling and hashing techniques. We provide analysis of the algorithms developed, and conduct experiments on real and synthetic data to obtain a comparative performance analysis.
1 Introduction A prevalent problem in large-scale data mining is that of association-rule mining, rst introduced by Agrawal, Imielinski, and Swami [1]. This challenge is sometimes referred to as the marketbasket problem due to its origins in the study of consumer purchasing patterns in retail stores, but the applications extend far beyond this speci c setting. Suppose we have a relation R containing n tuples over a set of boolean attributes A1 ; A2; : : :; Am . Let I = fAi1 ; Ai2 ; : : :; Aik g and J = fAj1 ; Aj2 ; : : :; Ajl g be two sets of attributes. We say that I ) J is an association rule if the following two conditions are satis ed: support | the set I [ J appears in at least an s-fraction of the tuples; and, con dence | amongst the tuples in which I appears, at least a c-fraction also have J appearing in them. The goal is to identify all valid association rules for a given relation. AT&T Shannon Lab, Florham Park, NJ. Department of Computer Science, Stanford University. Supported by School of Engineering Fellowship and NSF Grant IIS-9811904. z Hitachi Limited, Central Research Laboratory. This work was done while the author was on leave at the Department of Computer Science, Stanford University. x Department of Computer Science, Stanford University. Supported by NSF Grant IIS-9811904. { Department of Computer Science, Stanford University. Supported by Stanford Graduate Fellowship and NSF Grant IIS-9811904. k Department of Computer Science, Stanford University. Supported in part by NSF Grant IIS-9811904. Department of Computer Science, Stanford University. Supported in part by NSF Grant IIS-9811904. yy Department of Computer Science, Stanford University. Supported by a Leonard J. Shustek Fellowship, part of the Stanford Graduate Fellowship program, and NSF Grant IIS-9811904. y
1
To some extent, the relative popularity of this problem can be attributed to its paradigmatic nature, the simplicity of the problem statement, and its wide applicability in identifying hidden patterns in data from more general applications than the original market-basket motivation. Arguably though, this success has as much to do with the availability of a surprisingly ecient algorithm, the lack of which has stymied other models of pattern-discovery in data mining. The algorithmic eciency derives from an idea due to Agrawal et al [1, 2], called a-priori, which exploits the support requirement for association rules. The key observation is that if a set of attributes S appears in a fraction s of the tuples, then any subset of S also appears in a fraction s of the tuples. This principle enables the following approach based on pruning: to determine a list Lk of all k-sets of attributes with high support, rst compute a list Lk?1 of all (k ? 1)-sets of attributes of high support, and consider as candidates for Lk only those k-sets that have all their (k ? 1)subsets in Lk?1 . Variants and enhancements of this approach underlie essentially all known ecient algorithms for computing association rules or their variants. Note that in the worst case, the problem of computing association rules requires time exponential in m, but the a-priori algorithm avoids this pathology on real data sets. Observe also that the con dence requirement plays no role in the algorithm, and indeed is completely ignored until the end-game, when the high-support sets are screened for high con dence. Our work is motivated by the long-standing open question of devising an ecient algorithm for nding rules that have very high con dence, but for which there is no (or extremely weak) support. For example, in market-basket data the standard association-rule algorithms may be useful for commonly-purchased (i.e., high-support) items such as \beer and diapers," but are essentially useless for discovering rules such as that \Beluga caviar and Ketel vodka" are almost always bought together, because there are few people who purchase either of the two items. We develop a body of techniques which rely on the con dence requirement alone to obtain ecient algorithms. One motivation for seeking such associations, with high-con dence but without any support requirement, is that most rules with high-support are obvious and well-known, and it is the rules of low-support that provide interesting new insights. Not only are the support-free associations a natural class of patterns for data mining in their own right, they also arise in a variety of applications such as: copy detection | identifying identical or similar documents and web pages [4, 13]; clustering | identifying similar vectors in high-dimensional spaces for the purposes of clustering data [6, 9]; and, collaborative ltering | tracking user behavior and making recommendations to individuals based on similarity of their preferences to those of other users [8, 16]. Note that each of these applications can be formulated in terms of a table whose columns tend to be sparse, and the goal is to identify column pairs that appear to be similar, without any support requirement. There are also other forms of data mining, e.g., detecting causality [15], where it is important to discover associated columns, but there is no natural notion of support. The notion of con dence is asymmetric or uni-directional, and it will be convenient for our purpose to work with a symmetric or bi-directional measure of interest. At a conceptual level, we view the data as a 0/1 matrix M with n rows and m columns. Typically, the matrix is fairly sparse and we assume that the average number of 1s per row is r and that r 0, and k 2? c? log ? . Then, for all pairs of columns ci and 2
1
1
cj , we have the following two properties. a) If S (cj ; cj ) s c, then Sb(ci; cj ) (1 ? )s with probability at least 1 ? . b) If S (cj ; cj ) c, then Sb(ci; cj ) (1 + )c with probability at least 1 ? .
We sketch the proof of the rst part of the theorem; the proof of the second part is quite similar and is omitted. Fix any two columns ci and cj having similarity S (cj ; cj ) s . Let Xl be a random variable that takes on value 1 if hl (ci) = hl (cj ), and value 0 otherwise; de ne X = X1 + : : : + Xk . By Proposition 1, E [Xl] = S (ci; cj ) s ; therefore, E [X ] ks . Applying the Cherno bound [12] with the random variable X , we obtain that 2 E [X ]
2 ks
2 kc
e? 2 e? 2 < : To establish the rst part of the theorem, simply notice that Sb(ci; cj ) = X=k. Theorem 1 establishes that for suciently large k, if two columns have high similarity (at least s ) in M then they agree on a correspondingly large fraction of the Min-Hash values in c; conversely, if their similarity is low (at most c) in M then they agree on a correspondingly M c. Since Mc can be computed in a single pass over small fraction of the Min-Hash values in M the data using O(km) space, we obtain the desired implementation of the rst phase (signature Prob[X < (1 ? )ks] Prob[X < (1 ? )E [X ]] e?
2
computation). We now turn to the task of devising a suitable implementation of the second phase (candidate generation).
3.1 Candidate Generation from Min-Hash Values
Having computed the signatures in the rst phase as discussed in the previous section, we now wish to generate the candidate column-pairs in the second phase. At this point, we have a k m c containing k Min-Hash values for each column. Since k jCj j, we have E (jSigij j) E (jSigjij). We assume that Prob[jSigij j > jSigjij] 0 or PSince k P [jSig j = y j jSig j = x] 1. Then, the above equation becomes ji ij y=x
E [jSigi \ Sigj j] = =
Xk Xk Prob[jSig
x=0 y=x k
ij j = x]Prob[jSigji j = y j jSigij j = x]x
X Prob[jSig
x=0
ij j = x]x
= E [Sigij ]:
Xk Prob[jSig
y=x
ji j = y j jSigij j = x]
Thus, we obtain the estimator E [jSigi \ Sigj j] kjCij j=jCij. We use this estimate to calculate jCij j and use that to estimate the similarity since we know jCij and jCj j. We compute jSigi \ Sigj j using the hash table technique that we have described earlier in Section 3.1. The time required to compute the hash values is O(jM j + mk log n log k) as described earlier, and the time for computing jSigi \ Sigj j is O(kSm2).
4 Locality-Sensitive Hashing Schemes In this section we show how to obtain a signi cant improvement in the running time with respect to the previous algorithms by resorting to Locality Sensitive Hashing (LSH) technique introduced by Indyk and Motwani [11] in designing main-memory algorithms for nearest neighbor search in high-dimensional Euclidean spaces; it has been subsequently improved and tested in [7]. We apply the LSH framework to the Min-Hash functions described in earlier section, obtaining an algorithm for similar column-pairs. This problem diers from nearest neighbor search in that the data is known in advance. We exploit this property by showing how to optimize the running time of the algorithm given constraints on the quality of the output. Our optimization is input-sensitive, i.e., takes into account the characteristics of the input data set. The key idea in LSH is to hash columns so as to ensure that for each hash function, the probability of collision is much higher for similar columns than for dissimilar ones. Subsequently, the hash table is scanned and column-pairs hashed to the same bucket are reported as similar. Since the process is probabilistic, both false positives and false negatives can occur. In order to reduce the former, LSH ampli es the dierence in collision probabilities for similar and dissimilar pairs. In order to reduce false negatives, the process is repeated a few times, and the union of pairs found during all iterations are reported. The fraction of false positives and false negatives can be analytically controlled using the parameters of the algorithm. Although not the main focus of this paper, we mention that the LSH algorithm can be adapted to the on-line framework of [10]. In particular, it follows from our analysis that each iteration of our algorithm reduces the number of false negatives by a xed factor; it can also add new false positives, but they can be removed at a small additional cost. Thus, the user can monitor the progress of the algorithm and interrupt the process at any time if satis ed with the results produce so far. Moreover, the higher the similarity, the earlier the pair is likely to be discovered. Therefore, the user can terminate the process when the output produced appears to be less and less interesting.
4.1 The Min-LSH Scheme
c We present now the Min-LSH (M-LSH) scheme for nding similar column-pairs from the matrix M c into l sub-matrices of dimension of Min-Hash values. The M-LSH algorithm splits the matrix M 9
c has dimension k m, and here we assume that k = lr. Then, for each of r m. Recall that M the l sub-matrices, we repeat the following. Each column, represented by the r Min-Hash values in the current sub-matrix, is hashed into a table using as hashing key the concatenation of all r values. If two columns are similar, there is a high probability that they agree in all r Min-Hash values and so they hash into the same bucket. At the end of the phase we scan the hash table and produce pairs of columns that have been hashed to the same bucket. To amplify the probability that similar columns will hash to the same bucket, we repeat the process l times. Let Pr;l (ci ; cj ) be the probability that columns ci and cj will hash to the same bucket at least once; since the value of P depends only upon s = S (ci; cj ), we simplify notation by writing P (s).
Lemma 2 Assume that columns ci and cj have similarity s, and also let s be the similarity threshold. For any 0 < ; < 1, we can choose the parameters r and l such that:
For any s (1 + )s, Pr;l(ci; cj ) 1 ? For any s (1 ? )s, Pr;l(ci; cj )
Proof: By Proposition 1, the probability that columns ci, cj agree on one Min-Hash value is exactly s and the probability that they agree in a group of r values is sr . If we repeat the hashing process l times, the probability that they will hash at least once to the same bucket would be Pr;l(ci; cj ) = 1 ? (1 ? sr )l. The lemma follows from the properties of the function P . em Lemma 2 states that for large values of r and l, the function P approximates the unit step function translated to the point C = s , which can be used to lter out all and only the pairs with similarity at most s . On the other hand, the time/space requirements of the algorithm are proportional to k = lr, so the increase in the values of r and l is subject to a quality-eciency trade-o. In practice, if we are willing to allow a number of false negatives (n? ) and false positives (n+ ), we can determine optimal values for r and l that achieve this quality. Speci cally, assume that we are given (an estimate of) the similarity distribution of the data, de ned as d(si ) to be the number of pairs having similarity si . This is not an unreasonable assumption, since we can approximate this distribution by sampling a small fractionPof columns and estimating all pairwise similarity. The expected number of false negatives would be si s0 d(si )(1 ? P (si )), and the expected number of false positives would be Psi