On Minimal Infrequent Itemset Mining - Semantic Scholar

4 downloads 0 Views 166KB Size Report
tion Sciences, Minnesota State University, Mankato, MN 56001, USA ... of Computer Science,. University of Manchester, Oxford Rd., Manchester, M13 9PL, UK.
On Minimal Infrequent Itemset Mining David J. Haglin and Anna M. Manning Abstract—A new algorithm for minimal infrequent itemset mining is presented. Potential applications of finding infrequent itemsets include statistical disclosure risk assessment, bioinformatics, and fraud detection. This is the first algorithm designed specifically for finding these rare itemsets. Many itemset properties used implicitly in the algorithm are proved. The problem is shown to be N P-complete. Experimental results are then presented.

I. I NTRODUCTION Because of its importance to finding association rules, much attention has been given to the problem of finding itemsets that appear frequently in a dataset (cf. [1]–[4]). There have been hundreds of papers as well as workshops at conferences (e.g. FIMI’03 and FIMI’04 at IEEE ICDM’03 and IEEE ICDM’04) devoted to this subject. The definition of frequent often includes the notion of an integer threshold parameter, τ , delineating those itemset patterns considered frequent — appearing at least τ times in the dataset — from those patterns considered infrequent. While it is possible to consider very small values of τ , the focus is most often placed on finding very frequent patterns. Relatively less attention has been paid to infrequent itemsets. Yet they have many potential applications, including: 1) statistical disclosure risk assessment where rare patterns in anonymized census data can lead to statistical disclosure; 2) bioinformatics where rare patterns in microarray data may suggest genetic disorders; and 3) fraud detection where rare patterns in financial or tax data may suggest unusual activity associated with fraudulent behavior. In this paper we present a new algorithm for finding minimal infrequent patterns. This is the first algorithm designed specifically for finding minimal infrequent itemsets. It is based upon the SUDA2 algorithm developed for finding minimal unique itemsets (itemsets with no unique proper subsets) [8], [9]. We then show that the minimal infrequent itemset problem is N P-complete. Finally, experimental results are presented. II. P ROBLEM S PECIFICATION Let I = {i1 , i2 , . . . , iL } be a set of items. An itemset is a subset I ⊆ I. The cardinality of I, denoted by |I|, is the number of items in the itemset. As a shorthand, we will David J. Haglin is with the Department of Computer and Information Sciences, Minnesota State University, Mankato, MN 56001, USA ([email protected]), fax: +1 507-389-6376. Anna M. Manning is with the School of Computer Science, University of Manchester, Oxford Rd., Manchester, M13 9PL, UK ([email protected]), fax: +44 161-275-6204.

write c-itemset to mean an itemset of cardinality c. A dataset, D = {t1 , t2 , . . . , tR } , is a collection of R transactions (sometimes called records) of the form ti = (i, Ti ), where i is the transaction identifier (TID) and Ti ⊆ I. We denote by |D| the number of transactions in the dataset. Given an itemset I, a transaction T is said to contain I if I ⊆ T . The support set of an itemset I with respect to the dataset D is D(I) = {ti ∈ D : I ⊆ Ti }. The support of an itemset I in dataset D is the cardinality of the support set of I. That is, SuppD (I) = |D(I)|. The relative support of an itemset, defined as SuppD (I)/|D|, is a number between 0 and 1 inclusive. Given a dataset D and an integer threshold τ , we say an itemset I is: τ -occurrent τ -frequent τ -infrequent

if if if

|D(I)| = τ |D(I)| ≥ τ |D(I)| < τ

To describe an itemset as unique, we can either say it is 1-occurrent or it is 2-infrequent. In addition, we say an itemset is: • • • •

minimal τ -occurrent if it is τ -occurrent and all of proper subsets are (τ + 1)-frequent; minimal τ -infrequent if it is τ -infrequent and all of proper subsets are τ -frequent; maximal τ -occurrent if it is τ -occurrent and all of proper supersets are τ -infrequent; and maximal τ -frequent if it is τ -frequent and all of proper supersets are τ -infrequent.

its its its its

Since there are datasets known to produce exponentially many τ -frequent itemsets [7], many strategies to “compress” the output have been considered. One obvious strategy is to find only maximal τ -frequent itemsets. Similarly, for τ infrequent itemsets, it may be enough to find only minimal τ -infrequent itemsets. We assume that the input is given in binary matrix form where the number of rows is R, the number of columns is L, and an entry at (x, y) is a 1 if and only if the transaction whose TID = x contains item y. III. A LGORITHM MINIT We recently introduced an algorithm called SUDA2 [8], [9], which finds minimal unique itemsets (MUIs) in a dataset with different properties than those defined above. With certain parameter settings both SUDA2 and our new algorithm should find MUIs. However, the input datasets differ enough to render comparing running times between these two algorithms meaningless.

A. Dataset differences between MINIT and SUDA2 The easiest way to describe the differences in dataset properties is to consider the matrix form. For traditional itemset mining, the matrix consists of binary entries. But for SUDA2, the matrix entries can contain any integer. We can transform a SUDA2-type matrix into a binary matrix by enumerating all of the pairs. For each of these pairs, a column is created in the transformed binary matrix. For every value in a column in the SUDA2-type input matrix, the corresponding location in the transformed binary matrix is given a one. For example, if the first column of the SUDA2-type matrix contains integers in the range of 0 to 2, then the transformed matrix will have three “first columns” with a 1 in the specific column indicating the integer value in the SUDA2type matrix. The essential difference between the SUDA2type matrix and the traditional binary matrix datasets is the added constraint that among collections of columns — such as the three columns corresponding to the first SUDA2-type column of 0 to 2 values — there is exactly one ‘1’ value and the rest are ‘0’ values in every row. B. The MINIT algorithm Our new algorithm can be adapted to handle the more traditional dataset definition and to handle finding minimal τ -infrequent itemsets (MIIs). We call this adaption MINIT, for MINimal Infrequent iTemsets. Initially, a ranking of items is prepared by computing the support of each of the items and then creating a list of items in ascending order of support. Minimal τ -infrequent itemsets are discovered by considering each item ij in rank order, recursively calling MINIT on the support set of the dataset with respect to ij considering only those items with higher rank than ij , and then checking each candidate MII against the original dataset. One mechanism that can be used to consider only higher-ranking items in the recursion is to maintain a “liveness” vector indicating which items remain viable at each level of the recursion. The initial call to the recursive algorithm presented in Algorithm 1 is MINIT(D, V [1:L], maxc). The liveness vector, V [1:L], is initialized to all true values and must be passed by value to lower levels of the recursion as a unique copy of this vector is required at every node in the recursion tree. For those inputs that require prohibitively large running times, supplying a limit for maxc may result in enough useful information computable within a reasonable amount of time. For those “easier” datasets, setting maxc to L will find all MIIs. A significant computational effort of MINIT is to search through the dataset for transactions that hold a specific item. To help this search occur quickly, we pre-process the dataset by building linked lists of TIDs for each item. Essentially, we pre-compute D({ij }) for 1 ≤ j ≤ L by creating linked lists of pointers to the transactions. We also arrange the items in ascending order by support, which is required in order for MINIT to work correctly. Note that some of the items may

Algorithm 1 MINIT(D, V [1:L], maxc) 1: Input: D = input dataset with N rows and L columns of binary numbers 2: Input: V [1:L] = a boolean vector indicating viability of each item 3: Input: maxc = upper bound on cardinality of MII to find in the search 4: Returns: A listing of all MIIs for dataset D with cardinality less than maxc 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21:

compute R ← list of all items in D in ascending order of support if maxc == 1 /* stopping condition */ then Return all items of R that appear less than τ times in D else M ←∅ for each item ij ∈ R do Dj ← D({ij }) V [ij ] ← false Cj ← recursive call to MINIT(Dj , V , maxc-1) for each candidate itemset I ∈ Cj do if I ∪ {ij } is an MII in D then M ← M ∪ (I ∪ {ij } ) end if end for end for Return M end if

be discarded (considered non-viable) in this pre-processing phase for reasons such as having a support equal to |D| (an item appearing in every transaction cannot be part of any MII). Whenever MINIT descends one level of the recursion a new “sub-dataset” is built to represent the support set D({ij }). In the interest of memory efficiency, MINIT maintains only one copy of the dataset. A linked list of TIDs for those transactions in the sub-dataset is constructed to represent a sub-dataset. To support the recursion of MINIT, the new sub-dataset must be pre-processed. There are three tasks required to perform the pre-processing: 1) computing the support of each item, which is needed to produce a rank-ordering of the viable items by support within D({ij }); 2) determining the viability of each item (i.e., pruning some of the items from consideration); and 3) computing the support set of each viable item resulting in a memory efficient representation of the lists of TIDs for each support set D({ij , ik }) for 1 ≤ k ≤ L, j 6= k. Observe that the support set of {ik } within D({ij }) is the same as D({ij , ik }). In practice, MINIT can perform the first and third tasks concurrently to avoid two passes over the data. D({ij , ik }) can be computed and the size of that list is then used as

the support of ik in D({ij }). The second task can then be performed and the data structures built for the support sets of those pruned items are discarded. When most of the items remain viable, the technique of computing all support sets and deriving support from them is very effective. However, if many of the items are discarded as non-viable, building the support set lists for these discarded items is wasted effort. The statement at line 7 of Algorithm 1 can be modified to return all items of R that appear exactly τ times in D which transforms MINIT into an algorithm that finds all τ -occurrent rather than all τ -infrequent itemsets. IV. M INIMAL I TEMSET P ROPERTIES We present properties of MIIs that MINIT relies upon in order to correctly find all MIIs. These properties are adapted from those presented in [9] for the SUDA2 algorithm and dataset characteristics. Consider a minimal τ -infrequent c-itemset I. By definition, I must have the following property. Property 1 (Rareness Property): If itemset I is a MII, then suppD (I) < τ . We note that if an itemset is minimal τ -infrequent, then it must be minimal δ-occurrent for some δ < τ . However, a minimal δ-occurrent c-itemset, I, is not necessarily minimal τ -infrequent for all τ > δ; there may be some size (c − 1) subset of I with support ǫ, for δ < ǫ < τ . The following theorem shows that certain itemsets must exist within the dataset, D, in order for an MII to exist. We call the transactions holding those certain itemsets support rows. Theorem 2 (Support Row Property): Given a minimal τ infrequent c-itemset I = {i1 , i2 , . . . , ic } , with SuppD (I) = δ, δ < τ , for each 1 ≤ j ≤ c there must exist τ − δ support rows in D containing itemset I − {ij } (but not item ij ). Proof: Suppose for some j there exists fewer than τ − δ rows containing I ′ = I − {ij } and not ij . Then I ′ , which exists in D(I), has support SuppD (I ′ ) < τ , and therefore I is not minimal τ -infrequent. Observe that a support row works for only one item in I, thus there are at least c(τ − δ) support rows. The existence of c(τ − δ) support rows in D is both a necessary and sufficient condition for I to be minimal, as seen by the previous theorem and the following lemma. Lemma 3: Given an itemset I = {i1 , . . . , ic } that is τ infrequent in D, with SuppD (I) = δ for δ < τ , and D has c(τ − δ) support rows containing the c subsets of I of size c − 1, then I is a minimal τ -infrequent itemset. Proof: Suppose I is not minimal. Then there exists J ⊂ I that is also τ -infrequent within D. Clearly, from Theorem 2, |J| < c − 1. However, it must be true that J ⊂ I ′ , where I ′ is one of the c subsets of I of size c − 1. Since J ⊂ I, J appears in the δ rows holding I. Moreover, since J ⊂ I ′ , Theorem 2 states that J must appear in the τ − δ support

rows for I ′ . Thus, J appears in at least τ rows in D so is not τ -infrequent. This leads to the following observation as to the minimum support required of a single item in order to appear in a τ infrequent c-itemset. Theorem 4 (Minimum Support Property): Given a fixed τ and itemset cardinality c, an item i must have support SuppD ({i} ) ≥ c + τ − 2 in order for i to be part of a minimal τ -infrequent c-itemset I. Proof: Let I be a minimal τ -infrequent itemset with |I| = c and SuppD (I) = δ. If i ∈ I, then i must appear in at least the δ reference rows of I and (c − 1)(τ − δ) support rows for the c − 1 subsets of I that contain i and have cardinality c − 1. Thus, SuppD ({i} ) ≥ δ + (c − 1)(τ − δ) = c(τ − δ) + (2δ − τ ) = cτ − τ + 2δ − cδ Let δ = τ − r, for some integer r > 0. Then, SuppD ({i} ) ≥ cτ − τ + 2(τ − r) − c(τ − r) = τ + r(c − 2) For a fixed c > 2 and τ , this is minimum when r = 1 (i.e., δ = τ − 1). The Minimum Support Property can give an efficient way to prune significant areas of the search space if several items have low support counts. Corollary 5 (Uniform Support Property): Given a dataset D and item i contained in every transaction of D, then i cannot be contained in any minimal τ -infrequent itemset I. Proof: If I is a minimal τ -infrequent itemset in D containing item i, then the Support Row Property ensures the existence of a row in D containing I − {i} and not i. Since i appears in every transaction in D we have a contradiction.

V. R ECURSIVE I TEMSET P ROPERTIES Given dataset D and some item a (called an anchor item), we consider the itemset properties of D and D({a} ) based upon the recursive algorithm 1. Lemma 6: Given I = {i1 , . . . , ic } is a minimal τ infrequent itemset in D, for each anchor a = ij , 1 ≤ j ≤ c, the itemset Ia = I − {ij } is a minimal τ -infrequent itemset in D({a} ). Moreover, |Ia | = |I| − 1. Proof: Without loss of generality, fix j in the range of 1 to k inclusive. Let a = ij . Since every row in D({a} ) contains item a, the only way Ia could appear at least τ times in D({a} ) is if I appeared in at least τ rows of D. Therefore, Ia is τ -infrequent in D({a} ). Similarly, if Ia is not minimal τ -infrequent in D({a} ), then there exists Ia′ ⊂ Ia that is also τ -infrequent in D({a} ).

But Ia′ ∪ {a} would also be τ -infrequent in D and is a proper subset of I. Hence, Ia must be minimal τ -infrequent in D({a} ). Unfortunately, it is not the case that all minimal τ infrequent itemsets in D({a} ) lead to minimal τ -infrequent itemsets in the original dataset D. However, the following theorem provides a method for finding those that do. Theorem 7: Given a dataset D, an item a, and Ia as a minimal τ -infrequent itemset in D({a} ) with SuppD({a} ) (Ia ) = δ, the itemset I = Ia ∪ {a} is a minimal τ -infrequent itemset in D if and only if there exists τ − δ rows in D containing Ia but not containing item a. Proof: If I is a minimal τ -infrequent itemset in D with SuppD (I) = δ, then the Support Row Property ensures the existence of τ −δ rows in D containing Ia but not containing item a. For the other direction, assume τ − δ rows exist in D containing Ia but not a and that Ia is a minimal τ -infrequent itemset in D({a} ). Note that SuppD (I) = δ. All that is required to show is that I has the requisite c(τ − δ) support rows. Each of the (c − 1)(τ − δ) support rows in D({a} ), augmented with item a, form a support row in D for the itemset I. As all of these (c − 1)(τ − δ) support rows contain item a, the only other support rows needed to ensure I is a minimal τ -infrequent itemset in D are the τ − δ rows stated in the theorem. Since I is τ -infrequent and since c(τ − δ) support rows exist in D, I must be a minimal τ -infrequent itemset in D. The Recursive Property of MIIs helps define the boundaries of the search space by providing a clear indication of the maximum cardinality of candidate MIIs. VI. E XAMPLE To help understand the recursive algorithm and pruning techniques, we present the following example. Consider the input dataset as shown in Table I(a). We will follow the discovery of the 2-occurrent itemset I = {2, 4, 5} , which consists of ranks: 1, 2, 4. TABLE I E XAMPLE DATASET (a) Dataset

(b) Rank Items

TID

Transaction

Rank

Item

Support

1 2 3 4 5 6 7

1, 2, 4, 5, 6 2, 3, 4, 5, 6 1, 2, 3, 4, 6 1, 2, 3, 5, 6 1, 3, 4, 5, 6 1, 3, 6 1, 2, 5

1 2 3 4 5 6

4 2 3 5 1 6

4 5 5 5 6 6

Algorithm 1 finds itemset I by first computing the rank order of items as shown in Table I(b). Lemma 6 indicates that any itemset I that is τ -occurrent in D and has smallest item rank 1 (i.e., item {4} ) will have

I − {4} as a τ -occurrent itemset in D({4} ). So we compute D({4} ) as shown in Table II(a). Note that we can ignore item 6 in D({4} ) by Corollary 5. This brings us down one level in the recursion tree as we explore D({4} ). TABLE II DATASET D({4} ) (a) Dataset

(b) Rank Items

TID

Transaction

Rank

Item

Support

1 2 3 5

1, 2, 4, 5, 6 2, 3, 4, 5, 6 1, 2, 3, 4, 6 1, 3, 4, 5, 6

1 2 3 4

1 2 3 5

3 3 3 3

As we enter Algorithm 1 recursively, we first construct a new rank item list for the dataset at this recursion level (see Table II(b)). For the second iteration of the loop at line 10 of Algorithm 1, the anchor item will be 2. Descending the recursion tree for this anchor will produce the dataset in Table III(a). TABLE III DATASET D({2, 4} ) (a) Dataset

(b) Viable items

TID

Transaction

TID

1 2 3

1, 2, 4, 5, 6 2, 3, 4, 5, 6 1, 2, 3, 4, 6

1 2 3

Transaction 1, 5 3, 5 1, 3

Note that each of the viable items — 1, 3, 5 — all have support at or below our threshold τ = 2. So this recursion tree node returns the itemset list {{1} , {3} , {5} } to the next higher recursion node. To determine {2, 5} is a 2-occurrent itemset in D({4} ), we need only find sufficient support rows as described in Theorem 7. Observe that TID 5 in Table II(a) contains item 5 but does not contain 2. This one support row, along with a support of 2 for item 5 in D({4} ), is enough to conclude that {2, 5} is indeed a 2-occurrent itemset in D({4} ). It will therefore be included in the collection of itemsets passed up to the parent node of the recursion tree. At the root node of the recursion tree, the candidate itemset {2, 5} is merged with item {4} . This candidate {2, 4, 5} is then checked for qualification as a 2-occurrent itemset in D, using Theorem 7. Since {2, 5} has support 2 in D({4} ), one support row in D is sufficient. That support row is TID 4. Since this is the top level of the recursion, we conclude that {2, 4, 5} is a 2-occurrent itemset in D. We note that for Algorithm 1 to find a minimal τ infrequent c-itemset, I, it must explore c levels of the recursion tree. Each level of the tree corresponds to one of the items in I. The item associated with the bottom level of the tree has support in that bottom-level dataset equal to the support of I in the original dataset D. In fact, the support remains the same at each level of the recursion tree.

VII. C OMPLEXITY OF M INIMAL τ -O CCURRENT I TEMSET M INING A. Variations of the problem For a problem such as finding minimal τ -occurrent itemsets, there are variations that have important implications to the complexity of the problem. We consider the following problem variations in increasing order of computational difficulty: 1) The simplest form of a minimal τ -occurrent problem is a decision problem where the objective is, for a given input dataset, to determine if there exists any minimal τ -occurrent itemset and merely answer “yes” or “no”. 2) The next harder problem is a search problem where the objective is, for a given input dataset, to find one (any) minimal τ -occurrent itemset and print out the solution. There are actually two sub-variations to the search version of the problem: (i) find any solution for a specific record in the input dataset; and (ii) find any solution in any record in the input dataset. We note that the computationally easier form of the two subvariations is the less restrictive search problem. 3) The objective of the counting version is, for a given input dataset, to determine the number of minimal τ occurrent itemsets and to print out the number. Even though there may be an exponential number of minimal τ -occurrent itemsets in a given dataset, it may be possible to “count” them in polynomial time and print out a polynomial-size representation of the exponential count (e.g., in binary representation). 4) The most computationally challenging variation is, for a given input dataset, to find and print out all of the τ -occurrent itemsets. Some attention has recently been given to finding rare patterns in datasets [6], [8], [9]. From this perspective, it makes sense to search for minimal τ -occurrent itemsets. A special case of this problem is to set τ = 1 meaning only minimal unique itemsets are sought. Yang provides a nice complexity analysis of the four variations of the maximal τ -occurrent problem. The counting variation of the maximal τ -occurrent problem is #Pcomplete [7] whereas searching for a single solution is possible to do in polynomial time. We show that by seeking minimal rather than maximal τ -occurrent itemsets, even the simplest variation, the decision version, is N P-complete. B. Computational complexity of minimal τ -occurrent itemsets Our proof is based on a proof presented by Daishin Nakamura in [10] which addresses only the variation of searching in a specific record for a 1-occurrent itemset (i.e., minimal unique itemset). The proof is by a reduction from the Hitting Set problem (see [11] for N P-complete proof techniques). An instance of the Hitting Set problem, H = (p, C, k), is defined as Given a collection C = {C1 , . . . , Cq } of subsets of a finite set S = {1, . . . , p} and a positive integer

k ≤ p, determine whether there exists a subset S ′ ⊂ S with |S ′ | ≤ k such that S ′ contains at least one element from each set of C. Theorem 8: Given a dataset and a fixed constant t ≥ 1, to determine if there exists any τ -occurrent itemset in the dataset is N P-complete. Proof: Given an instance of the Hitting Set problem H = (p, C, k) construct a q × p matrix:   x1,1 x1,2 . . . x1,p  x2,1 x2,2 . . . x2,p    M = . ..  ..  .. . .  xq,1

xq,2

...

xq,p

where xi,j = 1 if j ∈ Ci and xi,j = 0 otherwise. Observe that every subset of the columns of M corresponds to a subset S ′ ⊂ S in the Hitting Set problem. Moreover, S ′ is a hitting set in H if and only if the subset of columns whose index numbers are in S ′ induce a matrix projection with at least one 1-entry in every row. This can be seen as each row i of M corresponds to Ci in H. Now denote by Z a (t × p)-matrix of zeroes. Construct a dataset matrix D:   Z  M    D= .   ..  M where the number of copies of M is τ + 1. Now find a minimum τ -occurrent itemset I in D. If I contained any 1-entries, then a record holding I must come from the M portion of D. However, since each row M appears τ + 1 times in D, such an itemset could not be τ -occurrent. So, I consists of only zeros and “appears” in the first t rows of zeros in D. Let S ′ be the set of columns associated with I. Since I is t-occurrent, each row in M must contain a 1-entry in at least one of the columns in S ′ . This corresponds directly to a solution to H. Therefore, any algorithm for finding τ -occurrent itemsets in a dataset, for any τ ≥ 1, can be used to solve the Hitting Set problem. VIII. E XPERIMENTAL R ESULTS All of the experiments were run on Dual Core AMD Opteron Processor 270s running at 2GHz with 8GB of RAM. The datasets we use come from http://fimi.cs.helsinki.fi/data/. All of the datasets are in the proper format for MINIT. A. Mushroom Data The mushroom dataset contains 8124 transactions and 119 items in its inventory. This particular dataset does not challenge MINIT as it can run within 5 seconds for any

70000

1.6e+06

delta-infrequent delta-occurrent

1.4e+06

60000

1.2e+06 Number of MIIs

50000 Number of MIIs

maxc==6 maxc==7 maxc==8

40000 30000 20000

1e+06 800000 600000 400000

10000

200000

0

0 40

60

Fig. 1.

80

100 120 Threshold

140

160

180

200

0

100

Mushroom Dataset

200

Fig. 3.

threshold 1 ≤ τ ≤ 8124 with no restriction on itemset cardinality. What we can see from this dataset is the number of minimal τ -infrequent and minimal τ -occurrent itemsets for varying values of the threshold τ (Figure 1). We expect that the number of τ -occurrent is less than the number of τ infrequent. Figure 1 shows just how drastically different they are. Since the τ -infrequent itemsets includes all of the τ occurrent, the remainder of this section focuses only on the computing minimal τ -infrequent itemsets. B. Chess Data The chess dataset contains 3196 transactions and 75 items in its inventory. Although this dataset is much smaller than the mushroom dataset, it presents significantly more of a computation challenge to MINIT.

300

400 500 600 Threshold

700

800

900

1000

Chess Dataset MII Counts

C. T10I4D100K Data The T10I4D100K dataset, generated by the software described in [12], contains 100000 transactions and 870 items in its inventory. It has an average number of 10 items per transaction and an average support of 4 for each item. 4500 4000 3500 3000 Seconds

20

2500 2000 1500 1000

4500

maxc==8

500

4000 0 10

3500

20

30

40

50

60

70

80

90

100

Threshold

Seconds

3000

Fig. 4.

T10I4D100K Dataset Computation Time

2500

We ran trials with no limit on the cardinality of the MIIs varying the threshold from 1 to 100. It is interesting to see that the computation time (Figure 4) drops faster than the number of MIIs (Figure 5).

2000 1500 1000 500

D. Connect4 Data 0 0

100

200

Fig. 2.

300

400

500 600 Threshold

700

800

900

1000

Chess Dataset Computation Time

We imposed limits on the cardinality of τ -infrequent itemsets. For each maxc of 6, 7, and 8, we ran trials varying the threshold τ . The running time for maxc == 8 is shown in Figure 2. To see the growth pattern, the numbers of minimal τ -infrequent itemsets are shown for maxc of 6, 7, and 8 in Figure 3.

The Connect-4 dataset contains all legal 8-ply positions in the game of connect-4 in which neither player has won yet, and in which the next move is not forced. There are 67557 transactions and 43 columns (one for each of the 42 connect4 squares together with an outcome column - win, draw or lose). Once this dataset in transformed into a binary format it contains 129 items since each cell in the original dataset holds one of three possible values. This dataset presents the most computational challenge to MINIT. We imposed a limit on the cardinality of maxc == 6.

The number of MIIs (Figure 7) has a substantially slower decline than the numbers for the previous datasets. Although the running time also declines more slowly, the decline is even slower than the decline of the MII counts.

1.1e+07 1e+07

Number of MinIIs

9e+06 8e+06

IX. C ONCLUSIONS

7e+06

We present a new algorithm, MINIT, for finding minimal τ -infrequent or minimal τ -occurrent itemsets. The computation time required on the four datasets presented suggest a correlation between the number of MIIs and the amount of computation time required. It would be interesting to see how well MINIT could run in a parallel or grid environment. It would also be useful to find other pruning strategies to improve the running time requirements.

6e+06 5e+06 4e+06 3e+06 2e+06 1e+06 0 10

20

Fig. 5.

30

40

50 60 Threshold

70

80

90

100

ACKNOWLEDGMENT

T10I4D100K Dataset MII Counts

This work was supported in part by the National Science Foundation under grant CTS-0619641. R EFERENCES 1200

maxc==6

1100

Seconds

1000

900

800

700

600 0

100

Fig. 6.

200

300 400 Threshold

500

600

700

Connect4 Dataset Computation Time

850000

maxc==6

800000 750000

Number of MIIs

700000 650000 600000 550000 500000 450000 400000 350000 300000 0

100

Fig. 7.

200

300 400 Threshold

500

Connect4 Dataset MII Counts

600

700

[1] R. Agrawal, T. Imielinski, and A. Swami, “Mining association rules between sets of items in large databases,” in Proceedings of the 1993 International Conference on Management of Data (SIGMOD 93), May 1993, pp. 207–216. [2] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. Verkamo, “Fast discovery of association rules,” in Advances in Knowledge Discovery and Data Mining, U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy, Eds. The AAAI Press, Menlo Park, 1996, pp. 307–328. [3] S. Brin, R. Motwani, J. Ullman, and S. Tsur, “Dynamic itemset counting and implication rules for market basket data,” in Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data. ACM Press New York, NY, USA, 1997, pp. 255–264. [4] M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li, “New algorithms for fast discovery of association rules.” in ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1997, pp. 283– 286. [5] E. Boros, V. Gurvich, L. Khachiyan, and K. Makino, “On the complexity of generating maximal frequent and minimal infrequent sets,” in Symposium on Theoretical Aspects of Computer Science, 2002, pp. 133–141. [Online]. Available: citeseer.ist.psu.edu/boros02complexity.html [6] D. Gunopulos, R. Khardon, H. Mannila, S. Saluja, H. Toivonen, and R. S. Sharma, “Discovering all most specific sentences,” ACM Trans. Database Syst., vol. 28, no. 2, pp. 140–174, 2003. [7] G. Yang, “Computational aspects of mining maximal frequent patterns,” Theoretical Computer Science, vol. 362, pp. 63–85, 2006. [8] A. M. Manning and D. J. Haglin, “A new algorithm for finding minimal sample uniques for use in statistical disclosure assessment,” in IEEE International Conference on Data Mining (ICDM05), Nov. 2005, pp. 290–297. [9] A. M. Manning, D. J. Haglin, and J. A. Keane, “A recursive search algorithm for statistical disclosure assessment,” Data Mining and Knowledge Discovery, 2007, conditionally accepted. [10] A. Takemura, “Minimum unsafe and maximum safe sets of variables for disclosure risk assessment of individual records in a microdata set,” Journal of the Japan Statistical Society, vol. 32, no. 1, pp. 107–117, 2002. [11] M. R. Garey and D. S. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman and Co., ISBN 0-226-31652-1, 1979. [12] R. Agrawal and R. Srikant, “Fast algorithms for mining association rules in large databases,” in VLDB ’94: Proceedings of the 20th International Conference on Very Large Data Bases. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1994, pp. 487–499.