Document not found! Please try again

Efficient search methods for statistical dependency rules - UEF

1 downloads 0 Views 275KB Size Report
can be expressed as a dependency rule ABC → D. For example, in the data on nemoral forests, we can ..... The third case is more difficult to judge, because now ...
To Appear in Fundamenta Informaticae, a special issue on Statistical and Relational Learning in Bioinformatics (Pre-publication version) (2011?) 1–35

1

Efficient search methods for statistical dependency rules Wilhelmiina H¨am¨al¨ainen∗ Department of Biosciences University of Eastern Finland Kuopio, Finland [email protected]

Abstract. Dependency analysis is one of the central problems in bioinformatics and all empirical science. In genetics, for example, an important problem is to find which gene alleles are mutually dependent or which alleles and diseases are dependent. In ecology, a similar problem is to find dependencies between different species or groups of species. In both cases a classical solution is to consider all pairwise dependencies between single attributes and evaluate the relationships with some statistical measure like the χ2 -measure. It is known that the actual dependency structures can involve more attributes, but the existing computational methods are too inefficient for such an exhaustive search. In this paper, we introduce efficient search methods for positive dependencies of the form X → A with typical statistical measures. The efficiency is achieved by a special kind of a branch-andbound search which also prunes out redundant rules. Redundant attributes are especially harmful in dependency analysis, because they can blur the actual dependencies and even lead to erroneous conclusions. We consider two alternative definitions of redundancy: the classical one and a stricter one. We improve our previous algorithm for searching for the best strictly non-redundant dependency rules and introduce a totally new algorithm for searching for the best classically non-redundant rules. According to our experiments, both algorithms can prune the search space very efficiently, and in practice no minimum frequency thresholds are needed. This is an important benefit, because biological data sets are typically dense, and the alternative search methods would require too large minimum frequency thresholds for any practical purpose. Address for correspondence: Wilhelmiina H¨am¨al¨ainen, Department of Biosciences, University of Eastern Finland, Finland, [email protected] ∗ Emil Aaltonen Fund (Emil Alltosen s¨aa¨ ti¨o) has supported this research with a personal grant

2

W. H¨am¨al¨ainen / Efficient search for statistical dependency rules

Keywords: statistical dependence, redundancy, χ2 -measure, z-score, search algorithms

1. Introduction Dependency analysis is a central task in all empirical sciences. Medical scientists want to find factors which predispose or prevent diseases; researchers of genetics are interested in which genes or gene groups are correlated; botanists search for plant associations and communities; environmental scientists try to analyze how human actions affect the climate change; etc. Often the biological observations are represented in the form of occurrence data, lists of factors which occur in the observed cites or other objects. In botany, for example, the field observations are mostly lists of plant, animal, or fungus species which occur in the investigated cites. Each site can be represented by a binary vector, where value 1 in position i indicates that the ith species was observed, and value 0 that it was unobserved. In addition, each site can be characterized by other attributes which tell, whether the soil is acid or alkaline, wet or dry, etc. These can also be binarized, by creating a new binary attribute for each feature. The soil moisture level, for example, could be represented by three binary attributes: wet soil, moist soil, and dry soil. In this kind of data it is possible to search for dependencies of the form “if species (characteristics) A, B and C occur in a site, then species (characteristic) D is more likely to occur, than otherwise”. This can be expressed as a dependency rule ABC → D. For example, in the data on nemoral forests, we can find a dependency rule ash, oak, glossy buckthorn → apple rose, which tells that there is a positive dependency between the occurrences of the apple rose and ash, oak, and glossy buckthorn. However, the dependence does not yet mean that there is a causal relationship, but it is possible that all of the mentioned species just have the same requirements for the soil. On the other hand, it is known that ecosystems have complex relationships, where species can create adaptive habitat conditions for other species (e.g. the litter and shade of trees), compete for the same resources (light, water, nutrients), or interact indirectly (e.g. through a mycorrhiza) [16]. The reasons for the observed dependencies are important research topics for botanists and ecologist, but the existence of dependencies is already itself valuable information. The dependency rules can, for example, guide the decisions on land restoration, protected sites, drainings, or fallings. Genetics is a another area, where data is often in a binary form (presence or absence of gene alleles and diseases). In this context, an important question is which alleles or allele groups have either a positive or a negative dependency with inherited diseases. By analyzing the dependencies, one has already managed to locate several single alleles, which are responsible for single gene diseases. Maybe the best known example is the HBB gene, whose alleles give rise to a diversity of diseases, like sickle-cell anemia, but in the same time protect from malaria [14]. However, many serious and common diseases, like heart disease, diabetes, and cancers, are caused by complex mechanisms, involving multiple genes and environmental factors [19, 18]. Analyzing these dependencies is a hot topic, but first one should develop efficient computational methods to handle data sets with dozens or even hundreds of thousands of attributes. In both problems, the traditional solution is to test all suspected dependencies with a statistical test, like the χ2 -test, and then select the dependencies which are sufficiently significant. The problem is that

W. H¨am¨al¨ainen / Efficient search for statistical dependency rules

3

the number of all possible dependencies is so large that only some of them can be tested. In practice, one usually concentrates on the simplest dependencies, between just two attributes. However, this approach is far from satisfactory, because statistical dependence is not an antimonotonic property, meaning that two attrsibute sets can be mutually independent, even if their supersets are dependent. For example, rules A → D, B → D, C → D can all express independence (or even negative dependence), while ABC → D expresses strong positive dependence. According to current knowledge, this is common in biological data, like plant community data [8] and gene-disease data [19]. When the number of attributes is relatively small, it is possible to implement a brute-force search, which tests all possible dependency rules. However, the problem becomes soon intractable when the number of attributes increases. For example, if we have 9 binary attributes, there are about million possible dependencies to check. If the number of attributes is 15, there are already 15 · 109 possible dependencies (see Appendix A). This so called curse of dimensionality is especially painful in biological data sets, where the number of attributes can be thousands. In addition, in biological systems the attributes are often highly correlated, and human interpreters cannot check all discoveries. Fortunately, a large number of dependencies are redundant (adding no new information to already known dependencies) and can be pruned out. The only problem is to invent an efficient method for discovering all significant, non-redundant dependencies. In this paper, we introduce two solutions, which apply for searching for the most significant nonredundant dependency rules of the form X → A. The condition part X can consist of any number of positive (1-valued) attributes and the consequence A is a single positive-valued attribute. This also means that the dependency between X and A is positive (and between X and ¬A negative). We consider two kinds of redundancy, a classical definition, where a rule is considered redundant, if its condition part contains redundant attributes, and a strict definition, where rule X → A is considered redundant, if there are more general but better dependencies in the subsets of XA. The search algorithms themselves apply for several statistical measures, but we concentrate on just two commonly used measures, the z-score and the χ2 -measure. In the previous research [12, 11, 10], we have already introduced efficient search methods for nonredundant dependency rules in a strict sense using the z-score. In this paper, we introduce a new important pruning property, called Lapis Philosophorum principle, which enables an efficient search for the classically non-redundant rules. The same idea can be applied to our previous algorithms, too, to improve their efficiency. Further improvement is achieved by searching for only the best dependency rules, instead of all rules with some minimum threshold for the measure function. We assume that in practice the user is interested in only the most significant dependencies, but the number of the searched dependency rules is not restricted. Of course, the search is more efficient, when only 100 or 1000 best rules are searched for, instead of searching for all 100 000 or million rules, which have a sufficiently large measure value. In addition to our own previous research, the problem is little studied. Most of the related previous research has concentrated on searching for association rules [2], where each rule X → A has to be sufficiently common, i.e. the rule frequency P (XA) should exceed some minimum frequency threshold minf r . In addition, it is often required that the rule should be sufficiently strong, i.e. the rule confidence P (A|X) should also exceed some minimum confidence threshold mincf . However, neither of these requirements guarantees that there is a positive dependency between X and A. Often X and A are totally independent or there may be even a negative dependency, which the user is still likely to interpret as a positive dependency, with serious consequences. On the other hand, even if there is a positive

4

W. H¨am¨al¨ainen / Efficient search for statistical dependency rules

dependency between X and A, the rule may still be spurious, i.e. the observed dependency is just due to chance and unlikely to hold in future data (see e.g. [24, 11]). Therefore, the discovered association rules should always be pruned in the post-processing phase, using a statistical goodness measure. Still, it is possible that the best dependency rules are totally missed, because the search requires relatively large minimum frequency thresholds. This is especially problematic in dense data sets like gene data (see e.g. [5]), where the required minimum frequency thresholds are always large. A better solution is to directly search for rules which express dependencies and/or have a sufficiently good measure value for some statistical measure. Morishita et al. [20, 21] as well as Nijssen and Kok [23] introduced algorithms for searching for classification rules X → C for a fixed attribute C using the χ2 -measure. Since no redundancy reduction was used, these approaches were poorly scalable to dense data sets. Nijssen et al. [22] developed a more efficient solution for searching for the best classification rule with a closed set in the rule antecedent. The problem of this approach is that closed sets can contain redundant attributes and the rule can be sub-optimal in the future data. Li [15] introduced an algorithm for searching for only non-redundant classification rules (in a classical sense) with different measures, including leverage and lift, which measure the strength of the dependency. We note that all of these approaches suit only the case, where the consequent attribute is fixed. Webb’s MagnumOpus software [26] is able to search, among other things, for general dependency rules X → A with different goodness measures, including leverage and lift. In addition, it is possible to test that the rules are significantly productive, i.e. that the improvement in the confidence of the rule with respect to more general rules is significant. However, the productivity is tested only against the immediate generalizations of the rule, and even redundant rules can pass the test. Antonie and Za¨ıane [4] introduced an algorithm for searching for both positive and negative dependency rules with Pearson’s correlation coefficient, together with a minimum frequency threshold. Koh and Pears [13] suggested a heuristic algorithm, where a minimum frequency threshold was derived from Fisher’s exact test. Unfortunately, the algorithm was based on the faulty assumption that the statistical dependence would be an anti-monotonic property. However, the algorithm should be able to find positive and negative dependency rules among two attributes correctly. Compared to the previous solutions, our new algorithms have a benefit that no minimum frequency thresholds or other restrictions are necessarily needed even with large and dense data sets. However, the current implementations do not yet optimize the space requirement, and the algorithms could be implemented more compactly. The rest of the paper is organized as follows: In Section 2, we define statistical dependency rules and consider suitable statistical measures. in Section 3, we consider the notions of redundancy and minimality. Search strategies, including algorithms and important pruning properties, are described in Section 4. In Section 5, we report some experimental results, and the final conclusions are drawn in Section 6.

2. Statistical dependency rules In this section, we formalize the idea of dependency rules, and consider two types of statistical measures, which can be used to estimate the statistical significance of dependency rules. We do not go into the details of statistical significance testing (deciding whether the discovered rules are actually significant and at which level), because the focus is only, how to search for dependency rules with statistical measures

W. H¨am¨al¨ainen / Efficient search for statistical dependency rules

5

efficiently. Dependency rules are defined as follows: Definition 2.1. (Dependency rule) Let R be a set of binary attributes. Rule X = x → A = a, where |X| = l, x ∈ {0, 1}l and a ∈ {0, 1}, is a dependency rule, if P (X = x, A = a) 6= P (X = x)P (A = a). The dependency is positive, if P (X = x, A = a) > P (X = x)P (A = a), and negative, if P (X = x, A = a) < P (X = x)P (A = a). Otherwise, the rule is called an independence rule. In this paper we concentrate on rules, where all attributes in X are positive (1-valued) and also a = 1. Therefore, the rule can be expressed simply by listing the attributes, X → A. Attribute-value assignment A = 0 is notated by ¬A and X = 0 (i.e. when Ai = 0 for some Ai ∈ X) by ¬X. To decide how good (significant) the dependency is, we need a goodness measure. Here, we consider statistical goodness measures, which are functions of the data size n and frequencies P (XA), P (X), and P (A). From these parameters, we can also derive P (X¬A), P (¬XA), and P (¬X¬A), if needed. When the data size n is fixed, we will use the notation m(XA) = nP (XA) for the absolute frequency of XA. A general goodness measure is defined as follows: Definition 2.2. (Goodness measure) Let f : R3 × N → R be a measure function, whose parameters are rule frequency, f r = P (XA), frequency of the condition part X, f rX = P (X), frequency of the consequence A, f rA = P (A), and the size of the data set, n. If high values of f (f r, f rX , f rA , n) indicate good dependency rules, then we say that f is increasing by goodness, and if low values of f (f r, f rX , f rA , n) indicate good rules, it is called decreasing by goodness. Goodness measure M is a predicate which maps the rule to its goodness value: M (X → A) = f (P (XA), P (X), P (A), n). If f is increasing (decreasing) by goodness, then M is also increasing (decreasing) by goodness. Generally, the statistical goodness measures for dependencies try to estimate – either directly or indirectly – the probability that the observed dependency would have occurred by chance, if X and A were actually independent. Usually, this happens by comparing the observed frequencies P (X = i, A = j), i, j ∈ {0, 1}, to the expected frequencies P (X = i)P (A = j), under the assumption of independence. The measures can be divided into two categories, depending on whether only the relation between P (XA) and P (X)P (A) is considered or if also the relations between P (X¬A) and P (X)P (¬A), P (¬XA) and P (¬X)P (A), and P (¬X¬A) and P (¬X)P (¬A) are taken into account. The first category of measures are used in the value-based interpretation of dependency while the second category of measures are used in the variable-based interpretation of dependency [6]. The z-score is an example of value-based measures, while the χ2 -measure is a variable-based measure. The z-score is defined as p m(XA)(γ(X, A) − 1) m(X, A) − nP (X)P (A) z(X, A) = p = p , (1) nP (X)P (A)(1 − P (X)P (A)) γ(X, A) − P (X, A) where γ(X, A) =

P (XA) P (X)P (A)

6

W. H¨am¨al¨ainen / Efficient search for statistical dependency rules

is the lift (or interest) of the rule. The z-score measures how many standard deviations the observed absolute frequency, m(XA), deviates from its expected value, nP (X)P (A), under the assumption of independence. The z-score can be derived from the binomial probability using the normal approximation, and therefore we can also approximate the probability of observing frequency m(XA) or higher in a data set of size n, if X and A were actually independent (see e.g. [11]). However, when the expected frequency nP (X)P (A) is low (as a rule of thumb less than 5 or 10), the binomial distribution is positively skewed. This means that the z-score overestimates the significance. Therefore, we will not use the normal approximation to estimate the p-values, but the z-score is used only as a goodness measure (ranking function). The χ2 -measure is defined as n(P (XA) − P (X)P (A))2 n(P (X¬A) − P (X)P (¬A))2 + P (X)P (A)P (¬X)P (¬A) P (X)P (A)P (¬X)P (¬A) 2 n(P (¬XA) − P (¬X)P (A)) n(P (¬X¬A) − P (¬X)P (¬A))2 + + P (X)P (A)P (¬X)P (¬A) P (X)P (A)P (¬X)P (¬A) 2 2 2 = z (X, A) + z (X, ¬A) + z (¬X, A) + z 2 (¬X, ¬A).

χ2 (X, A) =

This can be represented in a simpler form n(P (X, A) − P (X)P (A))2 nP (X)P (A)(γ(X, A) − 1)2 = P (X)P (¬X)P (A)P (¬A) (1 − P (X))(1 − P (A)) = n(γ(X, A) − 1)(γ(¬X, ¬A) − 1).

χ2 (X, A) =

(2)

The χ2 -measure can also be used to estimate the real probability that the observed or a stronger dependency occurs between X and A in a data set of size n, assuming that X and A were independent. The approximation is quite accurate, when all the expected absolute frequencies nP (X = i)P (A = j), i, j ∈ {0, 1}, are sufficiently high (once again, as a rule of thumb at least ≥ 5). When the expected counts are smaller, Fisher’s exact test should be used instead. Since both the z-score and the χ2 -measure assume sufficiently high frequencies, it is possible that the estimates for the significance of rules with low frequencies are incorrect. In dense data sets, this happens rarely, but if the frequencies are small, its is recommendable to test the discoveries afterwards with exact tests. Another solution is to search for dependency rules directly with the binomial probability or Fisher’s p-value, but both measures are quite unpractical for massive search. In future, our aim is to design techniques, which begin the search with either the z-score or the χ2 -measure, but then change to exact tests, when the frequencies become too low. For the search purposes, we will need a certain kind of an upper bound for the goodness measure M . The problem is that given an attribute set X, we should estimate the maximal possible M -value for any rule X \ {A}Q → A, when A ∈ X, or rule XQ \ {A} → A, when A ∈ / X, for any attribute set Q ⊆ R \ X. We know that the frequency m(XQ) ≤ m(X) and P (A) is known. So, ideally, the upper bound should be a monotonic function of P (X) (or P (X \ {A}) and P (A). Let us first consider the z-score. Theorem 2.1. Let R be as before, A ∈ R, P (A) > 0, and X ⊆ R. Then for any Q ⊆ R \ X we have

W. H¨am¨al¨ainen / Efficient search for statistical dependency rules

an upper bound

7

p z((XQ) \ {A} → A) ≤ p

m(X)(1 − P (A))

P (A)(1 − P (A)P (X))

.

Proof: Let us denote the lift γ = γ((XQ) \ {A} → A). The maximal value for the lift is P (A)−1 , because for any γ(Y, A) holds ½ ¾ 1 1 1 γ(Y, A) ≤ min , ≤ . P (Y ) P (A) P (A) If A ∈ X, then

½ γ(X \ {A}Q, A) ≤ min

1 1 , P (X \ {A}Q) P (A)

¾ ≤

1 , P (A)



1 . P (A)

and if A ∈ / X, then ½ γ(XQ \ {A}, A) ≤ min

1 1 , P (XQ \ {A}) P (A)

¾

In addition, the frequency of both rules is P (XQ) ≤ P (X). From Equation 1 we see that the z-score is an increasing function of frequency and lift (to verify this, one can simply differentiate the function by frequency and lift). Therefore, an upper bound for the z-score is p p m(X)(U B(γ) − 1) m(X)(1 − P (A)) p =p . U B(γ) − P (X) P (A)(1 − P (A)P (X)) u t For the χ2 -measure, Morishita and Sese [21] have proved the following upper bound for the case A ∈ X (assuming that P (X) < 1, 0 < P (A) < 1, and P (Y ¬A) < 1): ½ 2

χ (X \ {A}Q → A) ≤ max

m(X)(1 − P (A)) m(Y ¬A)P (A) , (1 − P (X))P (A) (1 − P (Y ¬A))(1 − P (A))

¾ ,

where Y = X \ {A}. When we search for only positive dependencies, the upper bound is χ2 (X \ {A}Q → A) ≤

m(X)(1 − P (A)) . (1 − P (X))P (A)

We can easily show that the same upper bound holds also for rules XQ \ {A} → A (i.e. when A ∈ / X). Theorem 2.2. Let R be as before, A ∈ R \ X, P (A) > 0, and X ⊆ R, P (X) < 1. Then for any Q ⊆ R \ X holds m(X)(1 − P (A)) χ2 (XQ \ {A} → A) ≤ . (1 − P (X))P (A)

8

W. H¨am¨al¨ainen / Efficient search for statistical dependency rules

Proof: From Equation 2 we see that the χ2 -measure is an increasing function of frequency and lift, when positive dependencies are considered (γ > 1). Because γ ≤ P (A)−1 and P (XQ \ {A}) ≤ P (X), we have nP (XQ \ {A})P (A)(γ − 1)2 (1 − P (XQ \ {A}))(1 − P (A)) m(X)(1 − P (A)) m(XQ \ {A})(1 − P (A)) ≤ . ≤ (1 − P (XQ \ {A}))P (A) (1 − P (X))P (A)

χ2 (XQ \ {A} → A)) =

u t We notice that in both cases, when A ∈ X or A ∈ / X, the upper bound is defined in the same point, where P (XQ) = P (X) and γ = P (A)−1 , i.e. when both the frequency and lift are maximal possible. In the following, we will notate this upper bound by U B(M ((XQ) \ {A} → A)) = f (m(X), m(X), m(A), n), where measure M is either z or χ2 . When the measure is given, the upper bound function f (m(X), m(X), m(A), n) is notated by U B(z) or U B(χ2 ). Sometimes we will also need an upper bound for the best rule, which can be derived from set XQ, for a given set X and any Q ⊆ R \ X. Since both upper bounds, U B(z) and U B(χ2 ), are decreasing functions of P (A), A ∈ XQ, the maximal U B is achieved, when the consequent attribute has the lowest frequency. If we expand all sets X by adding new attributes in an increasing order by their frequency, then the largest U B is achieved by consequence Amin = arg min{P (Ai )|Ai ∈ X}. In the following, Amin is called the minimal attribute of set X, denoted by mina(X).

3. Redundancy An important task in all rule discovery is to identify redundant rules, which add no new information to remaining rules. When the objective is to find statistical dependencies, independent attributes do not add any new information on the dependency. In fact, they can rather blur the interpretation of dependencies. For example, if people suffering for disease A are more likely to have a gene allele B than healthy people, then there is a positive dependency between A and B. In addition, there are a lot of gene alleles which have no effect on disease A or the occurrence of B. For any such independent allele C we can construct a dependency rule BC → A, which is equally strong to the original rule B → A, but the conclusions could be quite different. Now one could assume that the alleles B and C together cause the disease and consider only people with both alleles B and C as a potential risk group. Even a more serious error could happen, if C actually prevented the disease (A and C were negatively dependent), but now the dependency would also be weaker than in the original rule. Generally, adding new attributes to the antecedent of a dependency rule can 1) have no effect, 2) weaken the dependency, or 3) strengthen the dependency. When the task is to search for only positive dependencies, the first two cases can only add redundancy. The third case is more difficult to judge, because now the extra attributes make the dependency stronger, but in the same time the rule usually becomes less frequent. Therefore, we need some goodness measure M to judge, whether the extra attributes made the rule better or worse. Another kind of redundancy occurs, when the essentially same information is represented by several rules. For example, let us suppose that A depends on BC, but all single attributes A, B, and C are

W. H¨am¨al¨ainen / Efficient search for statistical dependency rules

9

mutually independent. Now also B depends on AC and C depends on AB, and all rules BC → A, AC → B, and AB → C have both equal lift and frequency. The same holds for any set of attributes X, whose subsets contain no dependencies. If |X| = l, we get l dependency rules of the form X \ {Ai } → Ai , Ai ∈ X, which all are equally good in terms of frequency and lift. Since all of them describe the same underlying dependency structure, it is hardly interesting to list all of them. The situation is different, when the subsets of X already contain dependencies. Now the lift values of rules X \ {Ai }A → Ai are generally unequal, but still all of them increase compared to the previous level rules X \ {Ai } → Ai , if A is positively dependent on X. Since the number of rules is typically very large (even millions of rules), it is often sufficient to report only the strongest dependency in each attribute set. This can also reduce the computational effort remarkably, and in practice we can discover rules which would otherwise remain unknown. Based on these considerations, we propose two definitions of redundancy. We assume that the goodness measure M is increasing, but if it happens to be decreasing, we can always construct a new increasing measure e.g. by taking a reverse function. In the following definitions, a proper subset (Y ⊆ X and Y 6= X) is notated by Y ( X. Definition 3.1. (Redundant and non-redundant rules (classical)) Let M be an increasing goodness measure. Rule X → A is redundant, if there exists rule Y → A such that Y ( X and M (Y → A) ≥ M (X → A). Otherwise, rule X → A is non-redundant. Definition 3.2. (Redundant and non-redundant rules (strict)) Let M be an increasing goodness measure. Rule X → A is redundant, if there exists rule Y → B such that Y ∪ {B} ( X ∪ {A} and M (Y → B) ≥ M (X → A) or (ii) Y ∪ {B} = X ∪ A and M (Y → B) > M (X → A). Otherwise, rule X → A is non-redundant. The first definition is called “classical”, because it is often used in the previous research (e.g. [1]). The second definition, introduced in [12], is called “strict”, because it states stricter conditions for the non-redundancy. For example, rule ABC → D can be redundant in the classical sense in respect of AB → D and C → D. However, it cannot be redundant in respect of ACD → B. On the other hand, ABC → D can be redundant in the stricter sense in respect of all these rules, as well as B → A and AD → C. Generally, all rules which are redundant in the classical sense, are also redundant in the strict sense and non-redundant rules in the strict sense are a subset of non-redundant rules in the classical sense. The reason is that rule X → A can be non-redundant in the strict sense only if the rule itself and all of its permutation rules X \ {Ai }A → Ai are non-redundant in the classical sense. So, intuitively, the strict definition searches for attribute sets, whose all rules are good, while the classical definition searches for individual rules, which are good. When the strict definition is used, just one rule – the best one – can be non-redundant in any set X. However, the classical definition allows that all permutation rules X \ {Ai } → Ai can be non-redundant. If the goodness measure is based on just frequency and lift (and possibly data size n), then both rules A → B and B → A will be selected, even if they add no information to each other. The same happens also in larger sets X, |X| > 2, if X’s subsets did not contain any dependencies.

10

W. H¨am¨al¨ainen / Efficient search for statistical dependency rules

4. Minimality Non-redundant rules can be further classified as minimal or non-minimal: Definition 4.1. (Minimal rule (classical)) Non-redundant rule X → A is minimal, if for all rules Z → A, such that X ( Z, M (X → A) ≥ M (Z → A). Otherwise it is non-minimal. Definition 4.2. (Minimal rule (strict)) Non-redundant rule X → A is minimal, if for all rules Z → C, such that X ∪ {A} ( Z ∪ {C}, M (X → A) ≥ M (Z → C). Otherwise it is non-minimal. I.e. a minimal rule is better than any of its specializations (“child rules”) or (being non-redundant) generalizations (“parent rules”). In the algorithmic level this means that we can stop the search without checking any more specific rules, if we have just ensured that the rule is minimal. The following theorems give sufficient conditions for the minimality in both cases, when the measure function is either the z-score or the χ2 -measure. Theorem 4.1. Let X → A be a rule having P (A|X) = 1 and the measure function either z or χ2 . Then X → A is minimal in the classical sense. Proof: It is enough to show that for any Q ⊆ R \ (X ∪ {A}), M (XQ → A) ≤ M (X → A). The proof follows directly from the upper bounds: √ √ m(X)(1−P (A)) m(XQ)(1−P (A)) √ √ (i) When M = z: z(X → A) = ≥ ≥ z(XQ → A). P (A)(1−P (X)P (A))

(ii) When M = χ2 : χ2 (X → A) =

m(X)(1−P (A)) (1−P (X))P (A)



P (A)(1−P (XQ)P (A))

m(XQ)(1−P (A)) (1−P (XQ))P (A)

≥ χ2 (XQ → A). u t

Theorem 4.2. Let X → A be a rule having P (A|X) = 1 and the measure function either z or χ2 . Then X → A is minimal in the strict sense, if A = mina(XQ) for any Q ∈ R \ X. Proof: The previous proof already showed that for all Q ⊆ R \ (X ∪ {A}) M (XQ → A) ≤ M (X → A). Therefore, it is enough to consider rules of the form 1) X \{B}QA → B (B ∈ X) and 2) XQ\{B}A → B (B ∈ / X). For both of them we can use the same upper bounds, and therefore it is enough to consider just one of them. √ √ m(XQA)(1−P (B)) m(X)(1−P (A)) ≤ √ , because (i) When M = z: z(X \ {B}QA → B) ≤ √ P (B)(1−P (XQA)P (B))

P (A)(1−P (X)P (A))

m(XQA) ≤ m(X) and P (B) ≥ P (A). (ii) When M = χ2 : χ2 (X\{B}QA → B) ≤ m(X) and P (B) ≥ P (A).

m(XQA)(1−P (B)) (1−P (XQA))P (B)



m(X)(1−P (A)) (1−P (X))P (A) , because m(XQA)



u t

W. H¨am¨al¨ainen / Efficient search for statistical dependency rules

11

We note that the theorems give only sufficient but not necessary conditions for the minimality. It is possible that a rule is minimal, even if its confidence is less than 1 (and the lift is not maximal possible). This can occur, when the rest of attributes R \ (X ∪ {A}) are either negatively dependent on XA or independent from it. The second note concerns how a minimal rule affects the search. When the rules are searched for with the strict definition, then we can safely prune out all supersets of XA, given that rule X → A was minimal in the strict sense. However, when we search for the rules with the classical definition, then we can merely rule out some of the possible consequences, namely A itself and all B ∈ / X.

5.

Search strategies

In this section, we consider the general search strategies and introduce algorithms for searching for nonredundant dependency rules either in the classical or strict sense. We introduce a new efficient pruning property, which complements the common branch-and-bound search.

5.1.

The basic strategy

The whole search space for dependency rules on attributes R can be represented by a complete enumeration tree. A complete enumeration tree lists all possible attribute sets in P(R). From each set X ∈ P(R), we can generate rules X \ {Ai } → Ai , Ai ∈ X and, therefore, a complete enumeration tree also represents implicitly the set of all possible dependency rules. Figure 1 shows two examples, when R = {A, ..., E}. If some attribute sets do not occur in the data, the corresponding paths are also missing from the enumeration tree. 1)

2) A

E B

A

A

B

A

D

A

A

C

B

B

C

B

C C C

A

B

D

C

A

B

A

A

A

A

B

A

A B

A

D

A

D

B

E

E

E

D

E

D E

E

E

C

D E

D

E

C

E

E

D E

D

E

E

E

E A

Figure 1.

Two alternative enumeration trees on attributes A, .., E. Arrows show the parent nodes of set ACD.

In practice, the enumeration tree can be generated – either level by level or branch by branch – when the search proceeds. However, it is not necessary to generate the enumeration tree explicitly (as a data structure, where the attribute sets are stored), but still the search usually proceeds in the same manner. For example, in the classical breadth-first search (like the Apriori algorithm for frequent sets [3, 17]) one first checks all sets containing a single attribute (1-sets), then all sets containing two attributes (2sets), etc. This corresponds to traversing an enumeration tree level by level, from top to down. In our algorithms, the enumeration tree is actually generated, because we need anyway a storage structure to

12

W. H¨am¨al¨ainen / Efficient search for statistical dependency rules

keep record on possible consequences and M -values of the best rules in the previous level sets. When the attributes are ordered, the parent nodes of any node (see example in Figure 1) are easily located e.g. with the binary search. Since the size of the search space (the complete enumeration tree) is in the worst case exponential, the main problem is how to traverse as minimal an enumeration tree as possible without losing interesting dependency rules. The main strategy is an application of the branch-and-bound method, where new subtrees are generated, if they can contain sufficiently good, non-redundant rules in respect of already discovered rules. For this purpose we need an upper bound for the goodness measure M , which defines the maximal possible M -value for any rule (XQ) \ {A} → A (i.e. X \ {A}Q → A or XQ \ {A} → A), given the information which is available in set X. If we search for non-redundant dependency rules in the strict sense, then it is enough to estimate just one upper bound, ub = U B(M (BestRule(XQ))), where BestRule(XQ) is the best rule which can be derived from XQ. If ub < minM , for the current minimum threshold minM , then no rule in XQ can be interesting. On the other hand, if ub ≤ max{M (BestRule(Y )) | Y ( X}, then all rules in XQ would be redundant in the strict sense. In both cases, set XQ can be pruned out without checking. When non-redundant rules are searched for in the classical sense, we have to estimate the upper bounds for all possible consequences. Let now ub = U B(M ((XQ) \ {A} → A)). If ub < minM , then attribute A is not a possible consequence for any significant rule in X or its supersets. On the other hand, if ub ≤ max{M (Y \ {A} → A) | Y ( X}, then the rule would be redundant. In both cases, we can mark attribute A as an impossible consequence, but the node itself cannot be pruned.

5.2.

Algorithms

Branch-and-bound search can be implemented either in a breadth-first or depth-first manner. The main algorithm for the breadth-first search is given in Algorithm 1 and for the depth-first search in Algorithms 2 (the main program) and 3 (recursive function). For clarity, we use notation s.set to refer to the set, which contains all attributes from the root of the enumeration tree to node s. In practice, it is enough to save just the last added attribute into s. In both search strategies, the first task is to create an empty enumeration tree t and add all single attributes as its children. In this phase, it is already possible to prune out some attributes, if they occur too infrequently (e.g. less than five times in the whole data) or on all rows of the data. After that, the breadth-first search begins to expand the tree level by level, while the depth-first search expands it branch by branch. We do not yet go into details how a set X can be expanded, but the idea is that all attribute sets should occur at most once in the enumeration tree. This is checked in condition Possible(XAi ). If we already know that all rules in X’s supersets will be redundant or insignificant (have M < minM ), then Possible(XAi )=0 for all Ai . In practice, the expansion can be coded in a more elegant way. For example, if the order of attributes is fixed, then we can add an attribute Ai to set X, only if Ai is in the canonical order after all attributes in X. We note that in the breadth-first search condition (num ≥ l + 1) merely guarantees that we have at least l + 1 l-sets, before any (l + 1)-sets are generated. The reason is that each (l + 1)-set has l + 1 parent sets and each parent set exists in the tree only if its supersets can produce non-redundant, significant rules. For clarity, we assume that all discovered rules are stored into a separate collection and, therefore, all nodes containing only minimal rules or having too low upper bounds for any significant rules can be deleted. An alternative is to keep also the discovered rules in the tree, until the search ends. In this case,

W. H¨am¨al¨ainen / Efficient search for statistical dependency rules

we have to mark the leaf nodes, which cannot have any child nodes. Alg. 1 BreadthFirst

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

create root t for all Ai ∈ R create child node si for set {Ai } l = 1; num = |R| while (num ≥ l + 1) for all nodes s at level l X = s.set for all Ai ∈ R \ X if (Possible(XAi ) create child node p for set XAi if (process(p)==0) delete node p l = l + 1; num= number of created sets output discovered rules

Alg. 2 DepthFirst

1 2 3 4 5 6

create root t for all Ai ∈ R create child node si for set {Ai } for all children si dfsearch(si ) output discovered rules

Alg. 3 dfsearch(Node s)

1 2 3 4 5 6 7 8 9

if (process(s)==0) // message to the call level that s can be deleted return 0 X = s.set for all Ai ∈ R \ X if (Possible(XAi )) create child node p for set XAi if (dfsearch(p)==0) delete node p return 1

13

14

W. H¨am¨al¨ainen / Efficient search for statistical dependency rules

All rules which can be derived from set X = s.set or its supersets are evaluated by process(s). Function process checks all s’s parent nodes, calculates X’s frequency, estimates upper bounds, and decides whether any non-redundant, significant rules can be derived from X or its supersets. If this is not possible, node s can be removed. Otherwise, the goodness of rules X \ {Ai } → Ai is estimated by measure M , and the redundancy or minimality of rules is decided. The implementation of function process depends on whether we want to find non-redundant rules in the classical or strict sense. Let us first consider the classical definition. In each node, representing set X, we need two tables: Table cons defines for all Ai ∈ R, whether rules (XQ) \ {Ai } → Ai can be non-redundant and significant. The second table best keeps record on the best M -values achieved in X or its subsets. For any Ai ∈ X, best[i] = max{M (Y → Ai ) | Y ⊆ X \ {Ai }}. Now only the attributes Ai ∈ X need an entry in best and, in practice, the table can be implemented by a more compact structure. However, for clarity, we will assume here that both tables are indexed from 1 to k = |R|. In the root node, possible[i]=1 (all consequent attributes are possible) and best[i]=0.0 (no rules, yet) for all i = 1, . . . , k. Function process1, which is used with the classical definitions, is given in Algorithm 4. Given node s, the first task is to search for s’s parent nodes in the tree. If a parent is missing, then we know that neither X nor any of its supersets can produce any non-redundant, significant rules, and s can be pruned out. Otherwise, we combine the information in parents’ cons and best tables. If consequence Aj was defined as impossible in any parent node, then it is impossible in node s. For possible consequences Aj , we select the best M -value which was achieved for rules of Aj in s’s parents. If Aj does not occur in s.set, then the best value remains zero. In this point it is already possible that all consequences become impossible, and nothing else is done. The function just returns 0 to the call level, so that node s can be removed. If there were possible consequences left, the next task is to calculate new upper bounds. For this we need also the frequency m(X). If the upper bound for a rule with consequence Ai is too low (the rule would be insignificant or redundant), the consequence can be marked as impossible. If the upper bound is sufficiently large and Ai belongs to set X, then we can calculate the exact M -value for rule X \ {Ai } → Ai . If the rule is significant and non-redundant, it is stored to the collection of discovered rules. If the rule is also minimal, then all its specializations X \ {Ai }Q → Ai as well as rules XQ \ {Aj } → Aj , Aj ∈ / X, will be redundant and the consequences can be marked as impossible. Let us now consider the search with the strict definitions of redundancy and minimality. Now the node structure can be simpler, because we will always consider just the best rule, which can be derived from the corresponding set. Therefore, it is enough to have a single value best, which keeps record on the best M -value achieved in set X or its subsets. Function process2, which is used with the strict definitions, is given in Algorithm 5. The function begins like the previous one, by searching for s’s parent nodes. If any of the parent nodes is missing, node s can be pruned. Otherwise, the parents’ best-values are checked and the maximum is stored to s. Next, X’s frequency is calculated and the maximal upper bound for any rule (XQ) \ {Ai } → Ai is estimated. If the measure function depends on only the frequency and lift, the maximal upper bound can be estimated directly, using frequency m(X) and an upper bound for the lift, P (Amin )−1 , where Amin = mina(XQ) for any possible Q ⊆ R \ X. If the upper bound is too low, node s can be removed. Otherwise, all X’s rules are checked and the best is selected. If the best rule is significant and nonredundant, it is stored. If the rule is also minimal (the sufficient conditions are now stricter), then all rules in X’s supersets will be redundant, and node s can be removed.

W. H¨am¨al¨ainen / Efficient search for statistical dependency rules

15

Alg. 4 process1(Node s)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

X = s.set for i = 1 to k // initialization s.cons[i]=1 // all consequences possible s.best[i]=0.0 // minimal M -value // check all parents for all Ai ∈ X search parent pi where pi .set==X \ {Ai } if (pi is missing) return 0 // the node can be removed for j = 1 to k if (pi .cons[j]==0) // if consequence Aj is impossible in any parent, it is impossible in s s.cons[j]=0 // select the best value in parents else if (pi .best[j] > s.best[j]) s.best[j] = pi .best[j] // check upper bounds and rules of X calculate frequency m(X) for i = 1 to k if ((s.cons[i] == 1) and (X 6= Ai )) // if X == Ai , Ai remains possible ub = U B(M ((XQ) \ {Ai } → Ai )) if ((ub < minM ) or (ub ≤ s.best[i])) // the rule would be non-significant or redundant s.cons[i] = 0 // if Ai possible and in X, check the rule else if ((Ai ∈ X) and (|X| ≥ 2)) v = M (X \ {Ai } → Ai ) // is the rule significant and non-redundant? if ((v ≥ minM ) and (v > s.best[i])) s.best[i] = v store rule X \ {Ai } → Ai // is the rule minimal? sufficient condition cf == 1 if (P (Ai |X \ {Ai })==1) // mark redundant rules s.cons[i]=0 for all Aj ∈ /X s.cons[j]=0 if (∀Ai ∈ R s.cons[i]==0) // no consequences possible in s’s children return 0 return 1

16

W. H¨am¨al¨ainen / Efficient search for statistical dependency rules

Alg. 5 process2(Node s)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

18 19 20

X = s.set X.best=0.0 // minimal M -value // check all parents for all Ai ∈ X search for parent pi where pi .set == X \ {Ai } if (pi is missing) return 0 // the node can be removed // select the maximum of parents’ best values if (pi .best> s.best) s.best= pi .best calculate frequency m(X) // upper bound for the best rule which can be derived from any XQ // only for proper rules with X 6= Ai ub = maxX6=Ai {U B(M ((XQ) \ {Ai } → Ai ))} if ((ub < minM ) or (ub ≤ s.best)) // all rules would be non-significant or redundant return 0 // check X’s rules if (|X| ≥ 2) v = max{M (X \ {Ai } → Ai ) | Ai ∈ X} if ((v ≥ minM ) and (v > s.best)) s.best=v store rule X \ {Ai } → Ai // is the rule minimal? sufficient condition cf = 1 and Ai is minimal attribute // which can be added to X if ((P (Ai |X \ {Ai }) == 1) and (Ai == mina(XQ) for any Q which can be added to X)) return 0 return 1

W. H¨am¨al¨ainen / Efficient search for statistical dependency rules

17

In both cases (with classical and strict definitions), it is possible to search for only the K best rules. Now the collection for the discovered rules contains at most K rules and minM is updated, when the Kth best rule changes. If one really wants to get all rules with some threshold minM , then one should consider, where to store them. In the worst case, the number of non-redundant rules can be tens or hundreds of thousands. One solution is to store them into the enumeration tree. When the classical definitions are used, it means that we should save parent pointers, consequent attributes, and possibly also their M -values for all stored rules in a node. When the strict definition is used, it is enough to save just one parent pointer, the consequent attribute, and M -value. In addition, we need a flag to mark a minimal rule, so that no child nodes are created for the node. Finally, we note that table cons used in Algorithm 4 is space consuming, even if it is implemented as a bit vector. In the worst case, the data set can contains tens of thousands of attributes, and in the beginning, all attributes can be possible consequences. When the search proceeds, table cons contains all the time less and less 1-values, which could be stored more efficiently. On the other hand, in table best, we need only l values, when the lth level is processed. In practice they can be stored in the same order as the attributes of current set X occur on the path from the root. However, this table can also be too space consuming, because the measure values are real numbers (type float). Therefore, it is important to remove both structures when they are no more needed (i.e. when all children have been created).

5.3.

Search order

For efficiency, an important concern is in which order the enumeration tree should be traversed. Should we proceed in a breadth-first manner (level by level, from to top to down) or in a depth-first manner (branch by branch, left to right)? Should the attributes be ordered in an ascending or a descending order by frequency? Assuming that the search proceeds from left to right and from the top to bottom, there can be four different kinds of enumeration trees: 1. a tree of type 1 in Figure 1, when attributes are in an ascending order, i.e. P (A) ≤ P (B) ≤ ... ≤ P (E); 2. a tree of type 1 in Figure 1, when attributes are in a descending order, i.e. P (A) ≥ P (B) ≥ ... ≥ P (E); 3. a tree of type 2 in Figure 1, when attributes are in an ascending order, i.e. P (A) ≤ P (B) ≤ ... ≤ P (E); and 4. a tree of type 2 in Figure 1, when attributes are in a descending order, i.e. P (A) ≥ P (B) ≥ ... ≥ P (E). All of them can be traversed in a breadth-first manner, but only the first two are applicable to the depth-first search. Namely, if we proceed branch by branch from left to right, then we should process a node before its parents are processed. Let us now analyze how well these four types of trees allow efficient pruning. In all approaches, the most important concern is to restrict the largest subtree from increasing (it can contain half of the nodes). When the largest subtree has the minimal attribute in its root, it is also likely

18

W. H¨am¨al¨ainen / Efficient search for statistical dependency rules

to stay small. Therefore, we prefer by default trees 2 and 3. Trees 1 and 4 have also other problems, when the non-redundant rules are searched for in the strict sense. In this case, the minimal attribute which can be added to a subtree defines the upper bound. In trees 1 and 4 the minimal attribute is the same for all nodes (unless otherwise proved), and therefore all nodes have also the same high value for the maximal lift. In practice this means that we cannot use the maximal lift to decrease the upper bounds. The last problem concerns only the breadth-first search. In trees 1 and 4, the minimal rules (having the minimal attribute in the consequence) cannot be found until we are in the leaves. It is likely that no minimal rules are found on the first levels, and no pruning based on the minimality is possible. Both trees 2 and 3 avoid these problems. In the second type of tree, the minimal attribute E can occur only in the last branch. It means that the maximal lift, P (E)−1 , is possible to achieve only in the last branch. So, when we proceed level by level from left to right, the maximal lift values increase, but in the same time the frequencies are likely to decrease. This is quite a good compromise for avoiding over-optimistic upper bounds. One special problem with both trees 2 and 3 is that one parent set has a smaller maximal lift than the set itself. For example, the upper bound for lift in set DCA is P (D)−1 , but in CA it is P (C)−1 . If set CA had a too low upper bound with its own frequency and maximal lift, then we know that none of its supersets in the same branch (e.g. CBA) can produce a significant rule. Similarly, if rule A → C was minimal, then all its supersets in the same branch would be redundant. Still, set CA or its supersets can be needed as parent sets for rules in other branches. In the StatApriori algorithm [12, 11] we have used two solutions to this problem. First, we can prune out a set which is either minimal or has a too low upper bound with its own minimal attribute. Now we just have to allow one missing parent (the one in the different branch), when parents are checked. In addition, we have to calculate the missing parent’s frequency, when the rules are evaluated. The second solution is to keep the set in the tree, if it gets a sufficiently high upper bound with the absolute minimal attribute (in the example, attribute E). However, the set can be deleted later, if we do not find enough other parent sets in the next branches. For example, if CA could be only a special parent set and all of the next branches (D- or E-branch) contain at most one non-minimal, potentially significant 2-set, then CA or any of its supersets in branch C cannot be special parents. The formal proof is given in [11]. Trees 2 and 3 differ from each other only in one aspect. If we proceed a level from left to right, then the tree 2 checks first the sets of the most common attributes, while the tree 3 begins with the sets of the rarest attributes. The latter can be useful, if we search for only the K best rules. If the measure M favours rules with large lift but smaller frequency compared to more frequent rules with smaller lift, then it is possible to find larger M -values and update minM in the beginning of the level. Tree 3 is also needed for the extra pruning which is explained in the next section. Based on these considerations, we have selected the tree 3 for the current implementation of the algorithms. The newest programs, StatDep (for strictly non-redundant rules) and Chitwo (for classically non-redundant rules) are both breadth-first searches. In the previous research, we have introduced also a depth-first-based algorithm DeepClue [10], but according to our experiments it is less efficient than the newest versions of the breadth-first algorithms. It is still possible that a depth-first search might suit better some problems, like searching for both positive and negative dependencies or general dependency rules containing also negations.

W. H¨am¨al¨ainen / Efficient search for statistical dependency rules

5.4.

19

Extra pruning

The upper bounds are quite strict for rules X → A, where P (A) ≥ P (X). However, when P (A) < P (X), they are too large for any efficient pruning. Especially, the upper bound for the χ2 (X → A) is always n (maximal possible value), when P (X) > P (A). In addition, it can often happen that the set X and its supersets XQ can occur only as special parents, but still we should create them, unless we know that they are not needed. Therefore, we would need a condition for deciding which attributes of those A∈ / X, P (A) < P (X), are not possible consequences for X or its supersets XQ. The following principle expresses such a condition. It is called Lapis Philosophorum principle, because it can make miracles like the legendary Philosopher’s stone. In fact, this simple principle was the key idea which enabled an efficient search for classically non-redundant rules. A related idea was represented already in [11] for the strictly non-redundant rules, but it was only now when we realized the whole idea in its full potential.

A X

X

level l

p

s

level l+1

Q Q

Figure 2.

Lapis Philosophorum principle. Dash lines represent paths.

Principle 1. (Lapis Philospohorum) Let p be a node at the lth level of an enumeration tree and X the set of attributes represented by p. Attribute A ∈ / X is not a possible consequence for any condition set XQ, Q ⊆ R \ (X ∪ {A}), unless there is a node s at level l + 1 representing set XA such that (i) U B(M (XQ → A)) ≥ minM , (ii) U B(M (XQ → A)) > max{M (Y → A) | Y ( X}, and (iii) X → A is not minimal.

20

W. H¨am¨al¨ainen / Efficient search for statistical dependency rules

If the first condition does not hold, then rule XQ → A is not sufficiently good. If the second condition does not hold, it is redundant in respect of rule Y → A, and if the third condition does not hold, it is redundant in respect of rule X → A. The idea is represented in Figure 2. We note that now the search tree should be number 3. In practice this principle means that when we have processed node s (set XA) and decided to keep it in the enumeration tree, we can mark to node p that consequence A is actually possible. Otherwise, it can be marked as impossible to X and all of its children sets. When P (A) < P (Amin ) for Amin = mina(X) (otherwise it is not possible that P (A) < P (X) ≤ P (Amin )), sets XA are processed before any children are created for X. When the Lapis Philosophorum principle is applied with the strict definition of redundancy, it is enough to keep record on the minimal attribute, which can occur in the consequence of any rule XQ → A. In practice this means that when we have decided to keep s in the tree, we set A = mina(XA) as a minimal possible attribute to node p, unless p’s minimal attribute was already set at this level.

5.5.

Example

Let us now simulate the algorithm on the data set given in Table 1, when the goodness measure is the χ2 -measure and the task is to search for all classically non-redundant dependency rules, for which χ2 ≥ 18.0. In addition, we require that all rules should occur on at least five rows of data. The node corresponding to set X in the enumeration tree is notated by N ode(X). Table 1. Example data for the simulation. Sets which occur in the data and their absolute frequencies. The total number of rows is n = 100.

Set X A1A2A3A4A5A6 A1A2A3A4A5¬A6 ¬A1A2A3A4A5¬A6 A1A2A3¬A4¬A5¬A6 A1¬A2A3¬A4¬A5¬A6 A1¬A2¬A3¬A4¬A5¬A6 ¬A1A2¬A3A4A5¬A6 ¬A1A2¬A3¬A4¬A5¬A6 ¬A1¬A2¬A3¬A4¬A5A6

m(X) 1 9 5 15 15 20 5 20 10

On the first level, all attributes are added to the enumeration tree in an ascending order by their frequencies, i.e. A6, A5, A4, A3, A2, A1. The possible consequences are determined for each node by upperbound U B(χ2 ) (Theorem 2.2). The resulting enumeration tree is represented in Figure 3. In each node, the bitvector represents the cons table. The second level begins by creating child nodes under N ode(A6). Possible child sets are {A6, A5}, {A6, A4}, {A6, A3}, {A6, A2}, and {A6, A1}. All of them behave in the same way. When the parents’ cons tables have been combined, only A6, A5 and A4 remain as possible consequences. How-

W. H¨am¨al¨ainen / Efficient search for statistical dependency rules

21

111111

level 1 A6

A5

000111

A4

011111

A1

A3

A2

111111

011111

111111

111111

level 2 A1

A6 A4

A5 000000

001100

001010

A3

A1

A2

001010

A2

A3

000010

A3

A2

A2

001100

000010

111110

A1 000100

A1

111110

000100

111110

111110

A1 111110

111110

level 3 A4

A5

A3 001010

A2 000010

A1

A2 000010

A1

A1 000010

A3

A2 000010

A1 A2

001100

000010

000100

000100

000100

A1

A1 000100

A2

A3

A2 011110

A1 111110

A1 110110

A1 000100

010000

Figure 3. The simulation of the algorithm on the example data. Bitvectors represent possible (bit 1) and impossible (bit 0) consequences.

ever, all these sets occur on just one row of data, and they can be pruned out as infrequent. By the Lapis Philosophorum principle, consequence A6 is marked as impossible in parent nodes, N ode(A5), N ode(A4), N ode(A3), N ode(A2), and N ode(A1). Next, children are created under N ode(A5). The first child is N ode(A5, A4). When the parents’ cons tables have been combined, possible consequences are A5, A4, A3, and A2. First, upper bounds are calculated for rules A4Q → A5, A5Q → A4, A4A5Q → A3, and A4A5Q → A2. All of them are sufficiently good, and no consequences are pruned out. Next, rule A5 → A4 (equivalently, A4 → A5) is evaluated. Because m(A5, A4) = m(A5) = m(A4) = 20, both rule A5 → A4 and A4 → A5 are

22

W. H¨am¨al¨ainen / Efficient search for statistical dependency rules

minimal. Therefore, all consequences Ai , where Ai 6= A5 and Ai 6= A4 are marked as impossible. In addition, A5 and A4 are marked as impossible consequences (minimal rules were already found for them). Therefore, the node can be deleted. Rule A5 → A4 is stored to the collection of the best rules, because it has a sufficiently good χ2 -value (the best possible value, χ2 = 100). In addition, A4 is marked as impossible in N ode(A5) and A5 in N ode(A4), by the Lapis Philosophorum principle. The second child for N ode(A5) is N ode(A5, A3). After combining the parents’ cons tables, the possible consequences are A5, A3, and A2. Only rules A3Q → A5 and A5Q → A3 have sufficiently large upper bounds and consequences A5 and A3 remain as possible. However, rule A5 → A3 (or, equivalently, A3 → A5) is not sufficiently good (with χ2 = 9.1), and therefore not stored. The third child for N ode(A5) is N ode(A5, A2). In the beginning, possible consequences are A5, A3, and A2. First, upper bounds are calculated for rules A2Q → A5, A5A2Q → A3 and A5Q → A2. All of them are sufficiently good and no consequences are pruned out. Since m(A5A2) = m(A5) = 20, rule A5 → A2 is minimal. Therefore, consequence A3 is marked as impossible. In addition, A2 is marked as impossible (the minimal rule was already found). However, consequence A5 cannot be pruned out by the minimality condition. Consequence A2 is marked as impossible to N ode(A5) by the Lapis Philosophorum principle. Since rule A5 → A2 (equivalently A2 → A5) is sufficiently good (with χ2 = 20.5), the rule is stored to the collection. The rest of the second level is processed similarly. The resulting enumeration tree in the end of the second level is represented in Figure 3. On the third level, the redundancy should also be checked, because sufficiently good rules were already found on the second level. The third level begins by generating child node N ode(A5, A3, A2) under N ode(A5, A3). Only consequence A5 is possible. Rule A3A2Q → A5 has an upper bound 70.6, which is larger than χ2 (A2 → A5) = χ2 (A5 → A2) = 20.5, and therefore the rule could be nonredundant. The χ2 -value of rule A3A2 → A5 is 24.1, which means that the rule is sufficiently good and non-redundant, and is stored to the collection. Consequence A5 remains still possible, because a better rule could be found among its specializations. The second child for N ode(A5, A3) is N ode(A5, A3, A1). Only consequence A5 is possible. The upper bound for rule A3A1Q → A5 is sufficiently large (44.4) and the consequence remains possible. However, rule A3A1 → A5 is not sufficiently good (with χ2 = 1.0) and the rule is not stored. The rest of the third level proceeds similarly. The resulting enumeration tree after the third level is represented in Figure 3. In the fourth level, there are only two sets to be checked, {A5, A3, A2, A1} and {A4, A3, A2, A1}. However, when the parents’ cons tables have been combined, no consequences remain possible, and the nodes are deleted immediately. The discovered rules with their χ2 -values are given in Table 2.

6. Experiments In the experiments, our main interest was to estimate the efficiency of different search approaches, when the χ2 -measure or the z-score was used as a goodness measure. Based on our previous experiences [12, 11, 10], these two measures produce good results, when the task is to find genuine statistical dependencies. Therefore, we did not make any new comparisons with other measures. In the first experiment, the performance of four search tasks (search for classically and strictly re-

W. H¨am¨al¨ainen / Efficient search for statistical dependency rules

Table 2.

23

Discovered rules and their χ2 -values.

Rule

χ2

A5 → A4 A1A2 → A3 A3 → A1 A2A3 → A5 A2A3 → A4 A5 → A2 A4 → A2

100 40.7 28.5 24.1 24.1 20.5 20.5

dundant rules, using either the χ2 -measure or the z-score) was evaluated. In the second experiment, we concentrated on two real world data sets, and analyzed the results in more detail.

6.1.

Performance comparison

In the first experiment, we evaluated the performance of the new algorithms on five biological or medical data sets. The experiment consisted of three parts, where we compared the overall performance of different search strategies, tested the effect of the new pruning principles, and estimated the scalability of the new algorithms. 6.1.1.

Test setting

In the performance evaluation, we compared two implementations of the algorithm, StatDep (for strictly non-redundant rules) and Chitwo (for classically non-redundant rules), using both the χ2 -measure and the z-score with both programs. The program implementations are available on http://www.cs. helsinki.fi/u/whamalai/sourcecode.html. For comparison, we also made some experiments with an implementation of the traditional Apriori algorithm (for frequent association rules) by C. Borgelt [7]. All experiments were executed on Intel Core Duo processor T5500 1.66GHz with 1 GB RAM, 2 MG cache, and Linux operating system. The tested data sets were either biological or medical data. All of them except Biodiv were from the Machine learning repository [9]. Biodiv contains biodiversity data (plant species in different habitats), which we are collecting (see http://www.cs.helsinki.fi/u/whamalai/bddataproject.html). The descriptions of data sets are given in Table 3. For each data set, we give the number of rows, number of all attributes, number of all attributes occurring on at least five rows, and statistics on the transaction length (number of 1-valued attributes on a row). The average and maximal transaction lengths are commonly used to estimate how dense the data is. Based on these statistics, the densest data sets are Heart, Mushroom, and Plants, while Breastcancer and Biodiv are the sparsest. On each data set, we searched for the 100 best rules with the selected measure. No minimum frequency thresholds were used, except the common requirement that each rule should occur on at least five rows. (As a classical rule of thumb, all estimated parameters, like relative frequency, should be based on

24

W. H¨am¨al¨ainen / Efficient search for statistical dependency rules

Table 3. Description of data sets used in the first experiment. n=number of rows, k=number of all attributes, and k2 =number of attributes which occur on at least 5 rows. avgl, minl and maxl give average, minimal, and maximal transaction lengths, when rare attributes (m(Ai ) < 5) have been removed. set Breastcancer Heart Biodiv Mushroom Plants

n 286 267 699 8124 22632

k 25 46 3581 119 70

k2 25 46 358 116 69

avgl 7.3 23.0 6.4 22.0 12.5

minl 6 23 1 20 2

maxl 8 23 47 22 69

at least 5–10 rows of data.) No minimum confidence thresholds or restrictions on the rule length were used either. In the beginning of each search, we gave some initial minimum value minM for the measure M , to speed up the first levels. After that the minM -value was updated automatically, when new good rules were found. We note that parameter minM is by no means necessary for searching for the best K rules and does not affect the discovered rules. With Apriori, we also used the absolute minimum frequency threshold five. Since Apriori finds all frequent rules, an efficient post-processing phase would be needed to select only the best non-redundant rules from the results. However, our objective was to test only, whether the same rules could be found, even in principle, with the traditional frequency-based pruning. 6.1.2. Results of the performance comparison Results of the performance comparison with StatDep and Chitwo are given in Table 4. The size of the enumeration tree has the strongest impact on efficiency. Therefore, we give the dimensions of the generated enumeration trees: the depth of the tree (the last level which was created), the widest level, and its size (number of sets which remain in the tree). In addition, the execution time (CPU time) is given. For Apriori, the results are given in Table 5. The program reports only the last level, the number of discovered rules, and execution time. With Heart, Mushroom, and Plants, the program spent all memory and either crashed or got stuck and was halted after 30 minutes. Breastcancer, Biodiv and Mushroom were computationally the easiest data sets. The first two of them were also easy to handle with the traditional approach, because they are quite sparse. However, Mushroom was surprisingly demanding for Apriori, and the computer crashed on level 7 after spending all memory. The main reason for the success of StatDep and Chitwo on set Mushroom is the large number of strong dependencies (with cf =1.0), which enabled the pruning by minimality. In fact, all of the 100 best rules had confidence 1.0, independent from the measure or the definition of redundancy. In practice, this means that the algorithms were able to prune out large areas of the search space as redundant. Heart was also a relatively easy set, although now the searches took some seconds. Heart contains only 46 attributes, but each row contains exactly 23 attributes, which makes the data very dense. This kind of data is a pathological case for the traditional frequency-based pruning, and Apriori got stuck at level 12. When only non-redundant dependency rules were searched for, the search proceeded to the 14–15th level. StatDep with the χ2 -measure succeeded best. The reason was in the χ2 -measure, which

W. H¨am¨al¨ainen / Efficient search for statistical dependency rules

25

Table 4. Results of the performance comparison with StatDep and Chitwo: the last level, the widest level, its size, and the execution time in seconds. Search task Breastcancer StatDep χ2 Chitwo χ2 StatDep z Chitwo z Heart StatDep χ2 Chitwo χ2 StatDep z Chitwo z Biodiv StatDep χ2 Chitwo χ2 StatDep z Chitwo z Mushroom StatDep χ2 Chitwo χ2 StatDep z Chitwo z Plants StatDep χ2 Chitwo χ2 StatDep z Chitwo z

last level

widest level

its size

time (s)

6 6 6 6

4 4 3 3

774 749 737 743

0 0 0 0

14 15 15 15

8 9 9 9

213756 268960 268903 268912

7 11 9 11

4 4 4 3

2 2 2 2

399 732 381 732

0 0 0 0

10 9 7 4

5 5 3 3

21093 13562 8001 5382

1 1 0 0

14 16 17 18

7 8 9 10

387178 247282 1925063 2340803

35 34 240 385

26

W. H¨am¨al¨ainen / Efficient search for statistical dependency rules

Table 5. Results of the performance comparison with Apriori: the last level, the number of discovered rules, and the execution time in seconds. Data set Breastcancer Heart Biodiv Mushroom Plants

last level 7 ≥12 14 ≥7 ≥6

number of rules 9218 – 163116 – –

time (s) 0 – 0 – –

favoured simple rules, and therefore let the program to prune out many complex rules as redundant. Chitwo did also produce the same simple rules, but due to the weaker definition of redundancy, it also had a weaker pruning ability. In this set, none of the best rules had confidence 1.0, and the pruning by minimality could not be used. Plants was computationally the most demanding set, especially for the z-score. Like Hearts, it is quite a dense set, which contains many strong dependencies, but none of the best rules has confidence 1.0. Apriori got stuck already on level 6. StatDep and Chitwo proceeded to levels 14–18 and the widest level was 7–10. Once again, the χ2 -measure managed to prune the search space more efficiently, because it favoured simpler rules. With the χ2 -measure, no new rules were found after levels 4 (StatDep) or 5 (Chitwo), and with the z-score, the last rules were found on level 6. Still, the search with the z-score produced a large enumeration tree with about two million sets in the widest levels. The problem is that the z-score begins to overestimate the significance, when the expected counts become low. The most important observation in this experiment was that the search with the strict definition of redundancy is not necessarily faster than the search with the classical definition. Here we should make two remarks. First, the search with the classical definition requires more information to be stored into each node, and therefore the enumeration tree is also more space consuming. However, the extra information also enables a better targeted search and pruning, because it is exactly known which rules can still improve. The second remark concerns the resulting rule sets. When we select the 100 best rules with the classical definition, they nearly always contain some rules (like permutation rules AB → C and BC → A or their specializations), which are pruned with the strict definition. So, the 100 best rules with the strict definition summarize in fact a larger rule set than the 100 best rules with the classical definition. To find all those rules, StatDep has to use smaller minM -values and search a larger space than Chitwo does. In this experiment, StatDep and Chitwo produced exactly the same rules only for Mushroom and Biodiv. With the z-score, Chitwo produced in average 6.4% strictly redundant rules, and with the χ2 measure, in average 11.4% strictly redundant rules. An important question is whether these extra rules contain any new information, which cannot be derived from the other rules. In the second experiment (Section 6.2), we tried to analyze this with two real world data sets familiar to the author

W. H¨am¨al¨ainen / Efficient search for statistical dependency rules

6.1.3.

27

The effect of pruning principles

Table 6 represents the results of the performance comparison, when the pruning by minimality or the Lapis Philosophorum principle was disabled. In Breastcancer, the effect of the pruning principles was only marginal. In Heart, the pruning by minimality turned out to be crucial, and without it the search continued very deep. The Lapis Philosophorum principle did not affect the depth of the search, but it generated a clearly smaller enumeration tree. In Biodiv, the tendency was similar, but the Lapis Philosophorum principle had only a marginal effect. In Mushroom, the pruning by minimality pruned the deepest levels of the tree, while the Lapis Philosophorum principle pruned the width of the tree. The only exception was StatDep with the χ2 -measure, where the pruning by minimality had a stronger effect on both the depth and width of the tree. The most interesting data set was Plants, where the best rules could not be searched for at all without the Lapis Philosophorum principle. Both programs, StatDep and Chitwo, got stuck (after spending all memory) with both measures on the 6th level. If only the pruning by minimality was disabled, then the search could be performed with the χ2 -measure. However, if the z-score was used, then Chitwo got stuck at level 10. StatDep could finish the search, but it took over 11 minutes CPU time and the widest level of the enumeration tree contained over three million nodes. We conclude that both pruning principles are especially crucial in dense data sets. The pruning by minimality plays a more important role, when the data contains strong dependency rules with confidence cf =1. The Lapis Philosophorum principle is more applicable, when the dependencies are weaker. 6.1.4.

Scalability

In the worst case, the complexity of searching for the most significant dependency rules is exponential in the number of attributes, k. A pathological case happens, when the most significant (or only) dependency rules involve all k attributes. However, in practice, the dependencies are usually relatively simple, especially in sparse data sets. An interesting question is how well the proposed algorithm scales up, when the number of attributes increases. This, of course, depends on the data set and its distribution; in a dense data set, the upper bounds tend to be high and the search continues deep. On the other hand, if the data contains many simple, minimal rules, the deeper level rules can be pruned out as redundant. In this experiment, we tested how well the two versions of the algorithm (StatDep and Chitwo) scaled up, when the 100 best rules were searched for from data set Plants using either the z-score or the χ2 -measure. Set Plants was selected, because it was the only data set, where the execution time was sufficiently long. In an exact evaluation, one should check the execution time in all possible projections of the data set containing i attributes, i = 2, . . . , k, and calculate the average execution time for each i. However, this would result 2k − k − 1 tests, which is prohibitive. Therefore, we have estimated the scalability with the following strategy: First, five attributes were selected randomly from all attributes and the corresponding data projection was created. The execution time for this data set was recorded. Then, the data set was expanded by adding five new attributes randomly and testing the execution time with the resulting data set. The operation was repeated for all i = 5, 10, . . . , k. Since different attributes can have quite a different effect on the execution time, the whole process was repeated five times and the results were averaged. The results are represented in Figure 4. We observe that the goodness measure has a stronger effect

28

W. H¨am¨al¨ainen / Efficient search for statistical dependency rules

Table 6. Results of the performance comparison, when the pruning by minimality (Min) or the Lapis Philosophorum principle (LP) were disabled: the last level, the widest level, its size, and the execution time in seconds.

Search task Breastcancer StatDep χ2 Chitwo χ2 StatDep z Chitwo z Heart StatDep χ2 Chitwo χ2 StatDep z Chitwo z Biodiv StatDep χ2 Chitwo χ2 StatDep z Chitwo z Mushroom StatDep χ2 Chitwo χ2 StatDep z Chitwo z Plants StatDep χ2 Chitwo χ2 StatDep z Chitwo z

last level

Without Min widest its size level

time (s)

last level

Without LP widest its size level

time (s)

6 6 6 6

4 4 3 4

774 885 740 885

0 0 0 0

6 6 6 6

4 4 4 4

784 769 789 789

0 0 0 0

18 22 22 22

10 11 11 11

421953 1350345 1352078 1352078

21 125 97 131

14 15 15 15

7 8 8 8

558861 597911 642059 647114

17 25 21 31

9 13 9 13

2 7 2 7

551 3562 626 3525

0 0 0 1

5 5 5 3

2 2 2 2

732 431 732 732

0 0 0 0

18 15 14 14

8 7 3 7

337182 35357 10722 7539

23 4 1 1

10 10 10 10

5 5 5 5

137736 130225 119003 119003

5 6 4 5

15 17 19 ≥10

7 9 10 ≥9

406731 337173 3073565 ≥3501538

40 51 673 –

≥6 ≥6 ≥6 ≥6

≥6 ≥6 ≥6 ≥6

≥4469291 ≥4419040 ≥5349426 ≥5316685

– – – –

W. H¨am¨al¨ainen / Efficient search for statistical dependency rules

29

on the scalability than the type of redundancy. With both definitions of redundancy, the χ2 -measure scaled up better than the z-score. The reason is that the z-score tends to favour rules with an infrequent consequent attribute, and therefore the search continues deep. On the other hand, the χ2 -measure ranks all rules X → A with m(XA) = m(X) = m(A) equally good, independent from the frequency P (A). 400 Chitwo z Chitwo Χ2 StatDep z StatDep Χ2

350

300

t

250

200

150

100

50

0 20

30

40

50

60

70

k

Figure 4.

Execution time (t) in seconds in set Plants, when the number of attributes (k) increases.

We note that the number of rows, n, has only a marginal effect on the scalability. However, it is harder to test, because the selected rows affect the attribute frequencies and thus the data distribution. It would be impossible to select subsets of rows while keeping the data distribution (and the number of attributes) constant.

6.2.

Experiments on real world data

In the second experiment, the goal was to compare the results (two definitions of redundancy and two goodness measures) on two real world data sets, which contained plant community data from Finland. 6.2.1.

Test setting

The first data set, Bogs, contained descriptions of 315 bogs in Middle Finland. For each bog, there was a list of plants as well as other characteristics like the bog type (the main type and a subtype), nutrient level (three classes), whether the bog had been drained or not (two classes), and how much it had changed after the draining (three classes). The second data set, Forests, contained plant lists from 246 nemoral forests

30

W. H¨am¨al¨ainen / Efficient search for statistical dependency rules

in Southern Finland. Both data sets are available on http://www.cs.helsinki.fi/u/whamalai/ datasets.html. The data characteristics are given in Table 7. Both data sets contain relatively many attributes, compared to the number of rows. However, when we remove too rare attributes (which occur less than five times in the data), the numbers of attributes become substantially smaller. In Forests, there are still 180 attributes, which reflects the species richness of nemoral forests. In Bogs, the average transaction length is relatively small, which hints that it is a computationally easy set for any method. Forests is just the opposite. Both the average and maximal transaction lengths are large, which hints that there are several dependencies and the search space can be huge. Table 7. Description of two plant community data sets: n=number of rows, k=number of all attributes, and k2 =number of attributes which occur on at least 5 rows. avgl, minl and maxl give average, minimal, and maximal transaction lengths, when rare attributes (m(Ai ) < 5) have been removed. set Bogs Forests

n 377 246

k 315 206

k2 131 180

avgl 12.9 43.7

minl 3 7

maxl 35 134

For analysis, we searched for the 100 best dependency rules from both data sets using the χ2 -measure and the z-score, and both definitions of redundancy (programs StatDep and Chitwo). No minimum frequency thresholds (except the absolute frequency five) or restrictions on the rule length were used. For comparison, we tried to search for the same rules by traditional means, by first searching for the frequent rules with the absolute minimum frequency five and then selecting the 100 best rules with either the χ2 -measure or the z-score. In Bogs, this was possible (the search produced about 500 000 rules, from which the best ones could be selected), but Forests turned out to be computationally too difficult. Finally, the search succeeded with a minimum frequency threshold minf r = 0.15 (absolute minimum frequency 37), but the resulting rule set contained over 17 million rules. None of them belonged to the best 100 rules with either goodness measures. StatDep and Chitwo did not meet any computational problems, although some levels of the enumeration tree contained over million sets and the search proceeded to level 16. However, the previous versions of StatApriori [12, 11] as well as DeepClue [10] could not handle Forests without any restrictions. The significant improvement was partly due to the new pruning principle (Lapis Philosophorum), which decreased the number of sets on each level (i.e. the search space), and partly the decision to store the best rules into a separate collection, instead of the tree. Also the more compact implementation of nodes saved space and therefore improved the efficiency in practice. We note that only the number of sets on each level affects the processing time, but in practice, the memory usage can have an equally strong impact. If the data structures do not fit into the main memory, the processor spends its time in swapping, and the search does not proceed. 6.2.2. Results Some statistics on the discovered rules are given in Tables 8 and 9. We recall that the best rules which are discovered by StatDep belong to the best rules discovered by Chitwo, and the best rules by Chitwo

W. H¨am¨al¨ainen / Efficient search for statistical dependency rules

31

Table 8. Description of the 100 best rules in Bogs using two measures, the χ2 -measure and the z-score, and three search methods, StatDep, Chitwo and the traditional Apriori with postprocessing. avgl=average rule length, χ2 =average χ2 -value, z=average z-value, γ=average lift, cf =average confidence, f r=average frequency. Method χ2 StatDep Chitwo Trad. Apriori z StatDep Chitwo Trad. Apriori

avgl

χ2

4.0 4.1 5.4

171 176 200

– – –

4.3 4.2 4.9

– – –

12.6 12.7 13.6

z

γ

cf

fr

25.4 24.8 27.6

0.68 0.67 0.82

0.027 0.047 0.034

27.0 26.7 30.8

0.67 0.68 0.82

0.012 0.015 0.021

Table 9. Description of the 100 best rules in Forests using two measures, the χ2 -measure and the z-score, and two search methods, StatDep and Chitwo. avgl=average rule length, χ2 =average χ2 -value, z=average z-value, γ=average lift, cf =average confidence, f r=average frequency. Method χ2 StatDep Chitwo z StatDep Chitwo

avgl

χ2

6.5 6.5

208 208

– –

6.5 6.5

– –

14.1 14.1

z

γ

cf

fr

40.8 40.8

0.86 0.86

0.021 0.021

40.8 40.8

0.86 0.86

0.021 0.021

32

W. H¨am¨al¨ainen / Efficient search for statistical dependency rules

belong to the best rules by Apriori. Therefore, the best rules with Apriori have also the best χ2 -measure and z-score values. Because both χ2 and z are increasing with the frequency and lift, the rules by Apriori have also the largest frequency and/or lift values. In Bogs, all approaches produced slightly different results. When the χ2 -measure was used, Chitwo produced nine rules which were strictly redundant and pruned by StatDep. With the z-score, the results were nearly identical, and only two rules by Chitwo were strictly redundant. Apriori produced quite different results with both measures, because there were a lot of redundant rules (in the classical sense). For example, for rule “hare’s-tail cottongrass, just drained, oligotrofic → type=ollknroj”, the rule set with the χ2 -measure contained 10 more specific rules, which had an equal or a smaller χ2 value. To get the 100 best non-redundant rules in the classical sense using the χ2 -measure, we should take the 324 best rules by Apriori. This set contains 224 redundant rules, more than two thirds. When the z-score was used, the number of redundant rules was even larger. To get the 100 best non-redundant rules, we should take the 345 best rules by Apriori. When the same definition of redundancy was used, both goodness measures produced similar results, although the χ2 -measure and the z-score ranked the rules in different orders. With Chitwo (classical definition of redundancy), 94% of the best rules were common with both measures. With StatDep (strict definition), the proportion of the common rules was 87%. Generally, the z-score prefers rules with high lift γ(X → A) to those with high frequency. Since the least frequent consequent attributes can gain the largest lift values, they occur often in the best rules, even if the rules have small frequency. The χ2 -measure takes into account also γ(¬X → ¬A), which is small for infrequent consequent attributes. For example, if P (A) < 0.5, then γ(¬X → ¬A) < 2, and if P (A) < 0.10, then γ(¬X → ¬A) < 1.12. In practice, the χ2 -measure also seems to favour larger frequencies than the z-score. These observations explain the differences between the rules by the χ2 -measure and the z-score. Rules with high lift occur typically in both rule sets, but the results by the χ2 -measure can also contain rules with low lift. The most interesting difference in the rules from the Bogs data was the third best rule with the χ2 measure, which did not occur at all among the 100 best rules with the z-score. The rule was “birch → fir” with γ = 2.4, cf = 0.88, and f r = 0.32. Lift γ(¬X → ¬A) = 1.5 is small, but still larger than many other rules have. However, the frequency is exceptionally high, which is probably the main reason for a good rank. The rule itself is quite trivial, because the bogs are usually forested (after draining) by planting both birches and firs. Rules from Forests were easy to analyze, because both measures, χ2 and z, and both programs produced exactly identical results. There were not even permutation rules, because all rules were relatively complex (5–7 attributes). First 95 rules had apple rose in the consequence and the rest five rules had the eggbract sedge in their consequence. The rules were mostly variations of each other, and all rules for the apple rose contained ash, oak, and glossy buckthorn in their condition part. The apple rose rules were quite infrequent (all of them occurred on just five rows), but they had high lift values (γ = 49.2 and cf = 1.0 for the first 11 rules and γ = 41.0 and cf = 0.834 for the rest). The eggbract sedge rules had a slightly higher frequency (occurred on 10 rows), but smaller lift (γ = 18.9, cf = 1.0). None of these

W. H¨am¨al¨ainen / Efficient search for statistical dependency rules

33

rules could have been found with the traditional approach, which required minimum frequency threshold 0.15.

7. Conclusions In this paper, we have tackled the problem of searching for non-redundant, statistical dependency rules efficiently from large and dense data sets. We have concentrated on dependency rules of the form X → A, where all attributes are positive, and also the dependency between the attribute set X and attribute A is positive. As a goodness measure, we have used the z-score and the χ2 -measure, but the search algorithms work with many other statistical measures, as well. The only requirement for the measure M is that we can be given an upper bound for M ((XQ) \ {A} → A), Q ⊆ R \ X, when only frequencies m(X) and m(A) are known. Pruning redundant rules is especially crucial for the correct interpretation of the statistical dependency rules. However, it is even more important for efficient search. In this paper, we have considered two definitions of redundancy, the classical one and a stricter definition. In practice, both definitions produce nearly identical results. We have introduced a new search algorithm for the classically non-redundant rules and also generalized our previous algorithms for the strictly redundant rules. In both algorithms, we apply a new important pruning principle, called Lapis Philosophorum. It enables an efficient pruning in cases, where the common branch-and-bound technique is prohibitive. According to our experiments, the search methods restrict the search space quite efficiently, but the space requirement can still be a problem. Another practical problem is the behaviour of the z-score, when the frequencies become too low. In future research, our first aim is to design a more compact data structure for the enumeration tree. The second plan is to develop methods for using exact p-values (binomial probability and Fisher’s exact test) to handle the search with low frequencies. A third interesting idea is to apply the concept of productivity [25] instead of redundancy and search for only rules, which improve the measure value significantly compared to more general rules.

References [1] Aggarwal, C., Yu, P.: A New Framework For Itemset Generation, Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS 1998), ACM Press, 1998. [2] Agrawal, R., Imielinski, T., Swami, A.: Mining Association Rules between Sets of Items in Large Databases, Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, ACM Press, 1993. [3] Agrawal, R., Srikant, R.: Fast algorithms for mining association rules, Proceedings of the 20th International Conference on Very Large Data Bases (VLDB’94), Morgan Kaufmann, 1994. [4] Antonie, M.-L., Za¨ıane, O. R.: Mining positive and negative association rules: an approach for confined rules, Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD’04), Springer-Verlag, 2004. [5] Becquet, C., Blachon, S., Jeudy, B., Boulicaut, J.-F., Gandrillon, O.: Strong–association–rule mining for large–scale gene–expression data analysis: a case study on human SAGE data, Genome Biology, 3, 2002, 1–16.

34

W. H¨am¨al¨ainen / Efficient search for statistical dependency rules

[6] Blanchard, J., Guillet, F., Gras, R., Briand, H.: Using Information-Theoretic Measures to Assess Association Rule Interestingness, Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05), IEEE Computer Society, 2005. [7] Borgelt, C., Kruse, R.: Induction of Association Rules: Apriori Implementation, Proceedings of the 15th Conference on Computational Statistics (COMPSTAT 2002), Physica Verlag, Heidelberg, Germany, 2002. [8] Dale, M., Blundon, D., MacIsaac, D., Thomas, A.: Multiple species effects and spatial autocorrelation in detecting species associations, Journal of Vegetation Science, 2, 1991, 635–642. [9] Frank, A., Asuncion, A.: UCI Machine Learning Repository, 2010, http://archive.ics.uci.edu/ml. [10] H¨am¨al¨ainen, W.: Lift-based search for significant dependencies in dense data sets, Proceedings of the Workshop on Statistical and Relational Learning in Bioinformatics (StReBio’09), in the 15th ACM SIGKDD conference on Knowledge Discovery and Data Mining (KDD’09), ACM Press, 2009. [11] H¨am¨al¨ainen, W.: StatApriori: an efficient algorithm for searching statistically significant association rules, Knowledge and Information Systems: An International Journal (KAIS), 23(3), June 2010, 373–399. [12] H¨am¨al¨ainen, W., Nyk¨anen, M.: Efficient discovery of statistically significant association rules, Proceedings of the 8th IEEE International Conference on Data Mining (ICDM’08), 2008. [13] Koh, Y., Pears, R.: Efficiently Finding Negative Association Rules Without Support Threshold, Advances in Artificial Intelligence, Proceedings of the 20th Australian Joint Conference on Artificial Intelligence (AI 2007), 4830, Springer, 2007. [14] Kwiatkowski, D.: How Malaria Has Affected the Human Genome and What Human Genetics Can Teach Us about Malaria, The American Journal of Human Genetics, 77, 2005, 171–192. [15] Li, J.: On Optimal Rule Discovery, IEEE Transactions on Knowledge and Data Engineering, 18(4), 2006, 460–471. [16] Liu, F., Wang, W., Zhang, M., Zheng, J., Wang, Z., Zhang, S., Yang, W., An, S.: Species association in tropical montane rain forest at two successional stages in Diaoluo Mountain, Hainan, Frontiers of Forestry in China, 3, 2008, 308–314. [17] Mannila, H., Toivonen, H., Verkamo, A.: Efficient algorithms for discovering association rules, Papers from the AAAI Workshop on Knowledge Discovery in Databases (KDD’94), AAAI Press, 1994. [18] Mei, H., Cuccaro, M., Martin, E.: Multifactor Dimensionality Reduction-Phenomics: A Novel Method to Capture Genetic Heterogeneity with Use of Phenotypic Variables, American Jounal of Human Genetics, 81, 2007, 1251–61. [19] Moore, J., Asselbergs, F., Williams, S. M.: Bioinformatics challenges for genome-wide association studies, Bioinformatics, 26(4), 2010, 445–455. [20] Morishita, S., Nakaya, A.: Parallel branch-and-bound braph bearch for borrelated bssociation bules, LargeScale Parallel Data Mining, Revised Papers from the Workshop on Large-Scale Parallel KDD Systems, in the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’00), 1759, Springer-Verlag, 2000. [21] Morishita, S., Sese, J.: Transversing itemset lattices with statistical metric pruning, Proceedings of the Nineteenth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS’00), ACM Press, 2000. [22] Nijssen, S., Guns, T., Raedt, L. D.: Correlated itemset mining in ROC space: a constraint programming approach, Proceedings the 15th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD’09), ACM Press, 2009.

W. H¨am¨al¨ainen / Efficient search for statistical dependency rules

35

[23] Nijssen, S., Kok, J.: Multi-class Correlated Pattern Mining, Proceedings of the 4th International Workshop on Knowledge Discovery in Inductive Databases, 3933, Springer, 2006. [24] Webb, G.: Discovering significant rules, Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’06), ACM Press, 2006. [25] Webb, G.: Discovering Significant Patterns, Machine Learning, 68(1), 2007, 1–33. [26] Webb, G., Zhang, S.: K-Optimal Rule Discovery, Data Mining and Knowledge Discovery, 10(1), 2005, 39–79.

Appendix When the number of attributes k is odd, the number of all possible dependencies is k µ ¶ X k i=2

i

bi/2c µ i

2

X j=1

i j

¶ .

The first sum lists all sets of 2–k attributes and their all possible value combinations. The second sum expresses in how many ways we can divide i attributes to two parts. With some algebraic manipulation, k the equation becomes 52 − 3k + 12 . (For even values of k, the derivation is more complex.)

Suggest Documents