Efficient search for statistically significant dependency ...

Department of Computer Science Series of Publications A Report A-2010-2

Efficient search for statistically significant dependency rules in binary data

Wilhelmiina Hämäläinen

To be presented, with the permission of the Faculty of Science of the University of Helsinki, for public criticism in Auditorium CK112, Exactum, on October 1st, 2010, at 12 o’clock noon.

University of Helsinki Finland

Contact information Postal address: Department of Computer Science P.O. Box 68 (Gustaf Hällstr¨ omin katu 2b) FI-00014 University of Helsinki Finland Email address: [email protected] (Internet) URL: http://www.cs.Helsinki.FI/ Telephone: +358 9 1911 Telefax: +358 9 191 51120

c 2010 Wilhelmiina Häm¨ Copyright ° al¨ ainen ISSN 1238-8645 ISBN 978-952-10-6447-0 (paperback) ISBN 978-952-10-6448-7 (PDF) Computing Reviews (1998) Classification: H.2.8, G.1.6, G.2.1, G.3, I.2.6 Helsinki 2010 Helsinki University Print

Efficient search for statistically significant dependency rules in binary data Wilhelmiina Häm¨ al¨ ainen Department of Computer Science P.O. Box 68, FI-00014 University of Helsinki, Finland [email protected] http://www.cs.Helsinki.FI/Wilhelmiina.Hamalainen/ PhD Thesis, Series of Publications A, Report A-2010-2 Helsinki, September 2010, 163 pages ISSN 1238-8645 ISBN 978-952-10-6447-0 (paperback) ISBN 978-952-10-6448-7 (PDF) Abstract Analyzing statistical dependencies is a fundamental problem in all empirical science. Dependencies help us understand causes and effects, create new scientific theories, and invent cures to problems. Nowadays, large amounts of data is available, but efficient computational tools for analyzing the data are missing. In this research, we rise to the challenge, and develop efficient algorithms for a commonly occurring search problem – searching for the statistically most significant dependency rules in binary data. We consider dependency rules of the form X → A or X → ¬A, where X is a set of positive-valued attributes and A is a single attribute. Such rules describe which factors either increase or decrease the probability of the consequence A. A classical example are genetic and environmental factors, which can either cause or prevent a disease. The emphasis in this research is that the discovered dependencies should be genuine – i.e. they should also hold in future data. This is an important distinction from the traditional association rules, which – in spite of their name and a similar appearance to dependency rules – do not necessarily represent statistical dependencies at all or represent only spurious connections, which occur by chance. Therefore, the principal objective is to search for the rules with statistical significance measures. Another imiii

iv portant objective is to search for only non-redundant rules, which express the real causes of dependence, without any occasional extra factors. The extra factors do not add any new information on the dependence, but can only blur it and make it less accurate in future data. The problem is computationally very demanding, because the number of all possible rules increases exponentially with the number of attributes. In addition, neither the statistical dependency nor the statistical significance are monotonic properties, which means that the traditional pruning techniques do not work. As a solution, we first derive the mathematical basis for pruning the search space with any well-behaving statistical significance measures. The mathematical theory is complemented by a new algorithmic invention, which enables an efficient search without any heuristic restrictions. The resulting algorithm can be used to search for both positive and negative dependencies with any commonly used statistical measures, like Fisher’s exact test, the χ2 -measure, mutual information, and z-scores. According to our experiments, the algorithm is well-scalable, especially with Fisher’s exact test. It can easily handle even the densest data sets with 10000–20000 attributes. Still, the results are globally optimal, which is a remarkable improvement over the existing solutions. In practice, this means that the user does not have to worry whether the dependencies hold in future data or if the data still contains better, but undiscovered dependencies. Computing Reviews (1998) Categories and Subject Descriptors: H.2.8 Database Applications — data mining G.1.6 Optimization — global optimization G.2.1 Combinatorics — Combinatorial algorithms G.3 Probability and Statistics — contingency table analysis, nonparametric statistics I.2.6 Learning — Knowledge acquisition General Terms: algorithms, theory, experimentation Additional Key Words and Phrases: dependency rule, statistical significance, rule discovery

Acknowledgements I am very grateful to Professor Matti Nykänen for taking the invidious position as my supervisor. The task was not easy, because I had selected a challenging mission myself and he could only watch when I walked my own unknown paths. Still, he succeeded better than any imaginable supervisor could do in the most important task of the supervisor – the psychological support. Without him, I would have given up the whole work after every setback. During my last year in the Helsinki Graduate School in Computer Science and Engineering (Hecse), I have also been priviledged to enjoy the luxury of two extra supporters, Professor Jyrki Kivinen and Jouni Seppänen, Ph.D. Both of them have done excellent work in mentally supporting me and answering my questions. I would also like to thank my pre-examiners, Professor Geoff Webb and Siegfried Nijssen, Ph.D. as well as Jiuyong Li, Ph.D. and Professor Hannu Toivonen, for their useful comments and fruitful discussions. Marina Kurtén, M.A. has done me a valuable favour by helping to improve the language of my thesis. I thank my half-feline family, Kimmo and Sofia, for their love and patience (and especially doing all the housework without too much whining). They believed in me and what I was doing, unquestioningly, or at least they managed to pretend it well. My parents Terttu and Yrj¨ o have always encouraged me to follow my own paths and they have also offered good provisions for that. This research has been supported with personal grants from the Finnish Concordia Fund (Suomalainen Konkordia-Liitto) and Emil Aaltonen Fund (Emil Aaltosen Sä¨ ati¨ o). Conference travels have been supported by grants from following institutions: the Department of Computer Science at the University of Helsinki, the Department of Computer Science at the University of Kuopio, the Helsinki Graduate School in Computer Science and Engineering (Hecse), and the IEEE International Conference on Data Mining (ICDM). All their support is gratefully acknowledged. v

vi

To the memory of my wise Grandmother

The stone that the builders rejected has become the cornerstone. Matt. 21:42

vii

viii

Contents

List of Figures

xiii

List of Tables

xv

List of Algorithms

xvii

List of Notations

xx

1 Introduction 1.1 Dependency rules . . . . . . 1.2 The curse of redundancy . . 1.3 The curse of dimensionality 1.4 Results and contribution . . 1.5 Organization . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

2 Statistically significant dependency 2.1 Statistical dependency rules . . . . 2.2 Statistical significance testing . . . 2.3 Measures for statistical significance 2.3.1 Variable-based measures . . 2.3.2 Value-based measures . . . 2.4 Redundancy and improvement . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

1 1 3 5 10 14

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

15 15 20 23 23 32 39

3 Pruning insignificant and redundant rules 3.1 Basic concepts . . . . . . . . . . . . . . . . . . . . . 3.2 Upper and lower bounds for well-behaving measures 3.2.1 Well-behaving measures . . . . . . . . . . . . 3.2.2 Possible frequency values . . . . . . . . . . . 3.2.3 Bounds for well-behaving measures . . . . . . 3.3 Lower bounds for Fisher’s pF . . . . . . . . . . . . . 3.4 Pruning redundant rules . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

45 46 49 49 53 57 63 67

ix

rules . . . . . . . . . . . . . . . . . . . . . . . .

. . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

x

Contents

4 Search algorithms for non-redundant 4.1 Basic branch-and-bound strategy . . 4.2 The Lapis Philosophorum principle . 4.3 Algorithm . . . . . . . . . . . . . . . 4.4 Complexity . . . . . . . . . . . . . . 5 Experiments 5.1 Test setting . . . . . . . . . . . . 5.1.1 Data sets . . . . . . . . . 5.1.2 Evaluating results . . . . 5.1.3 Search methods . . . . . . 5.1.4 Test environment . . . . . 5.2 Results of the quality evaluation 5.2.1 Mushroom . . . . . . . . 5.2.2 Chess . . . . . . . . . . . 5.2.3 T10I4D100K . . . . . . . 5.2.4 T40I10D100K . . . . . . . 5.2.5 Accidents . . . . . . . . . 5.2.6 Pumsb . . . . . . . . . . . 5.2.7 Retail . . . . . . . . . . . 5.2.8 Summary . . . . . . . . . 5.3 Efficiency evaluation . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

dependency rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . .

69 69 75 77 84

. . . . . . . . . . . . . . .

91 91 91 93 94 96 96 97 99 101 103 104 107 109 111 114

6 Conclusions

117

A Useful mathematical results

121

B Auxiliary results B.1 Numbers of all possible rules . . . . . . . . . . . . B.2 Problems of closed and free sets . . . . . . . . . . . B.2.1 Generating dependency rules from closed or B.2.2 The problem of closed classification rules . B.3 Proofs for good behaviour . . . . . . . . . . . . . . B.4 Non-convexity and non-concavity of the z-score . .

. . . . . . . . . . free sets . . . . . . . . . . . . . . .

123 123 124 124 126 127 134

. . . . .

137 137 140 141 141 142

C Implementation details C.1 Implementing the enumeration tree . . . . . . C.2 Efficient frequency counting . . . . . . . . . . C.3 Efficient implementation of Fisher’s exact test C.3.1 Calculating Fisher’s pF . . . . . . . . C.3.2 Upper bounds for pF . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

Contents

xi

C.3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . 147 References

153

Index

162

xii

Contents

List of Figures

1.1 1.2

Actually and apparently significant and non-redundant rules Rules discovered by search algorithms . . . . . . . . . . . .

11 12

2.1 2.2

An example of p-values in different variable-based models. . An example of p-values in different value-based models. . .

29 39

3.1 3.2 3.3 3.4 3.5 3.6 3.7

Two-dimensional space of absolute frequencies NX and NXA . Point (m(X), m(XA = a)) corresponding to rule X → A = a. Possible frequency values for a positive dependency. . . . . Possible frequency values for a negative dependency. . . . . Axioms (ii) and (iii) for well-behaving measures. . . . . . . Axiom (iv) for well-behaving measures. . . . . . . . . . . . . Proof idea (Theorem 3.8). . . . . . . . . . . . . . . . . . . .

54 55 56 57 58 58 63

4.1 4.2 4.3 4.4

A complete enumeration tree of type 1. . . . . A complete enumeration tree of type 2. . . . . Lapis Philosophorum principle. . . . . . . . . . An example of the worst case tree development.

70 71 76 88

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

C.1 Exact pF and three upper bounds as functions of m(XA). . 147 C.2 A magnified area from the previous Figure. . . . . . . . . . 148

xiii

xiv

List of Figures

List of Tables

2.1 2.2 3.1

A contingency table . . . . . . . . . . . . . . . . . . . . . . Comparison of p-values and asymptotic measures for example rules X → A and Y → A. . . . . . . . . . . . . . . . . .

18 40

3.4

Upper bound ub1 = f (m(A = a), m(A = a), m(A = a), n) for common measures. . . . . . . . . . . . . . . . . . . . . . Upper bound ub2 = f (m(X ), m(X ), m(A = a), n) for common measures, when m(X) < m(A = a). . . . . . . . . . . . Upper bound ub3 = f (m(XA = a), m(XA = a), m(A = a), n) for common measures. . . . . . . . . . . . . . . . . . Lower bounds lb1 , lb2 , and lb3 for Fisher’s pF . . . . . . . .

5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11 5.12 5.13 5.14 5.15 5.16 5.17 5.18

Description of data sets. . . . . . . . . . . . . . . . . . . . . 92 Results of cross validation in Mushroom (average counts). . 98 Results of cross validation in Mushroom (average statistics). 98 Results of cross validation in Chess (average counts). . . . . 100 Results of cross validation in Chess (average statistics). . . 100 Results of cross validation in T10I4D100K (average counts). 102 Results of cross validation in T10I4D100K (average statistics).102 Results of cross validation in T40I10D100K (average counts). 103 Results of cross validation in T40I10D100K (average statistics).104 Results of cross validation in Accidents (average counts). . . 106 Results of cross validation in Accidents (average statistics). 106 Results of cross validation in Pumsb (average counts). . . . 108 Results of cross validation in Pumsb (average statistics). . . 108 Results of cross validation in Retail (average counts). . . . . 109 Results of cross validation in Retail (average statistics). . . 110 Summarized results of the quality evaluation. . . . . . . . . 111 Parameters of the efficiency comparison. . . . . . . . . . . . 114 Efficiency comparison of Kingfisher with pF . . . . . . . . . . 115

3.2 3.3

xv

59 60 64 66

xvi

List of Tables 5.19 Efficiency comparison of Kingfisher with χ2 and Chitwo with z. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 B.1 An example distribution, where free sets contain no statistical dependencies. . . . . . . . . . . . . . . . . . . . . . . . . 125 C.1 Comparison of the exact pF value, three upper bounds, and the χ2 -based approximation, when n = 1000. . . . . . . . . 150 C.2 Comparison of the exact pF value, three upper bounds, and the χ2 -based approximation, when n = 10000. . . . . . . . . 151

List of Algorithms 1 2 3 4 5 6 7 8

BreadthFirst . . . . . . . . . DepthFirst . . . . . . . . . . dfsearch(v) . . . . . . . . . . Kingfisher(R, r, minM , K ) . . check1sets(R, r, minM , minfr ) bfs(st, l, len) . . . . . . . . . . checknode(vX ) . . . . . . . . Auxiliary functions . . . . . .

xvii

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

73 73 74 79 79 80 81 82

xviii

LIST OF ALGORITHMS

List of Notations N = Z+ ∪ {0} A, B, C,. . . a, b, c, . . . ∈ {0, 1} X, Y, Z R, S, T , U R = {A1 , . . . , Ak } |R| = k Dom(R) = {0 , 1 }k Dom(X ) = {0 , 1 }l x = (a1 , ..., al ) A1 = ai , A2 = a2 (X = x) = (A1 = a1 ), . . . , (Al = al ) X = (A = 1), . . . , (Al = 1) t = (id, (A1 = a1 ), . . . , (Ak = ak )) r = {t1 , . . . , tn } |r| = n m(X = x) fr (X = x ) = P (X = x ) =

m(X =x ) |r |

Pr (X = x) cf (X = x → A = a) = P (A = a|X = x ) (X=x,A=a) γ(X = x, A = a) = PP(X=x)P (A=a) δ(X = x, A = a) = P (X = x, A = a) − P (X = x)P (A = a) NX , NA , NXA , N

xix

natural numbers binary attributes attribute values attribute sets rule sets set of all attributes number of attributes attribute space domain of X, |X| = l vector of attribute values value assignment, where A1 = a1 and A2 = a2 event, X = {A1 , ..., Al } a short hand notation for a positive-valued event row (transaction), id ∈ N data set number of rows in r absolute frequency of event, number of rows where X = x relative frequency of event X = x in set r real probability of event X=x confidence of rule X=x→A=a lift of rule leverage of rule random variables in N

xx

List of Notations

Chapter 1 Introduction Data miners are often more interested in understandability than accuracy or predictability per se. C. Glymour, D. Madigan et al. The core of all empirical sciences is to infer general laws and regularities from individual observations. Often the first step is to analyze statistical dependencies among different factors and then try to understand the underlying causal relationships. Computers have made systematic search possible, but still the task is computationally very demanding. In this thesis, we develop new efficient search algorithms for certain kinds of statistical dependencies. The objective is to find statistically the most significant, non-redundant dependency rules in binary data using only statistical significance measures for pruning and ranking. In this chapter, we introduce the main problem, summarize the previous solutions, explicate the new contribution, and introduce the organization of the thesis.

1.1

Dependency rules

The concept of statistical dependence is based on statistical independence – the idea that two factors or groups of factors do not affect the occurrence of each other. If the real probabilities Pr of two events, A = a and B = b, were known, the definition would be simple. The absolute independence states that Pr (A = a, B = b) = Pr (A = a)Pr (B = b). For example – according to current knowledge – drinking coffee does not affect the probability of getting cancer. Similarly, the events would be considered statistically dependent, if Pr (A = a, B = b) 6= Pr (A = a)Pr (B = b). The larger the deviation 1

2

1 Introduction

of Pr (A = a, B = b) and Pr (A = a)Pr (B = b) is, the stronger is the dependency. If Pr (A = a, B = b) < Pr (A = a)Pr (B = b), the dependency is negative (i.e. B = b is less likely to occur, if A = a had occurred than if A = a had not occurred, and vice versa), and if Pr (A = a, B = b) > Pr (A = a)Pr (B = b), the dependency is positive (i.e. B = b is more likely to occur, if A = a had occurred than if A = a had not occurred, and vice versa). For example, (according to current knowledge) smoking increases the probability of getting lung cancer, but it (as well as drinking coffee) decreases the probability of getting Alzheimer’s disease. The problem is that empirical science is always based on finite sets of observations and the real probabilities Pr are not known. Instead, we have to use (typically, maximum likelihood) estimates P based on the observed relative frequencies in the data. This leads to inaccuracies, especially in small data sets. Therefore, it is quite possible to observe spurious dependencies, where P (A = a, B = b) deviates substantially from P (A = a)P (B = b), even if A = a and B = b were actually independent. For example, the earliest small-scale studies indicated that pipe smoking would decrease the risk of lung cancer compared to non-smokers, but according to current research, pipe smoking has no effect or may, in fact, raise the risk of lung cancer. To solve this problem, statisticians have developed several measures of significance to prune out spurious dependencies and find the most genuine dependencies, which are likely to hold also in future data. The general idea in all statistical significance measures is to estimate the probability that the observed or a more extreme deviation between P (A = a, B = b) and P (A = a)P (B = b) had occurred by chance in the given data set, if A = a and B = b were actually independent. Given a statistical significance measure, M , we can now search only the most significant (best) dependencies or all dependencies, whose M value is inside certain limits. For example, in medical science, a focal problem is to find statistically significant dependencies between genes or gene groups and diseases. The discovered dependencies can be expressed as dependency rules of the form A1 = a1 , . . . , Al = al → D = d, where Ai ’s are gene alleles, which either occur (ai = 1) or do not occur (ai = 0) in the gene sequence, and D is a disorder, which was either observed (d = 1) or unobserved (d = 0). The rule does not yet say that a certain combination of alleles causes or prevents the disease, but if the rule was statistically significant, it is likely to hold also in future data. Even if the causal mechanisms were never discovered (and they seldom are), these dependency rules help to target further research and identify potential risk patients. In this research, we concentrate on a certain, commonly occurring type

1.2 The curse of redundancy

3

of dependency rules among binary attributes. The rule can be of the form X → A = a, where the antecedent contains only true-valued attributes X = A1 , . . . , Al (i.e. Ai = 1 for all i) and the consequence is a single attribute, which can be either true (a = 1) or false (a = 0). Since a has only two values, the resulting rules can be expressed simply as X → A and X → ¬A. Because negative dependency between X and A is the same as positive dependency between X and ¬A, it is enough to search for only positive dependencies. This type of dependency rules are sufficient in many domains. For example, in ecology, an interesting problem is to find dependencies between the occurences of plant, fungus, or animal species. Dependency rules reveal which species groups indicate the presence or absence of other species. This knowledge on ecosystems is important, when, for example, one tries to locate areas for protection, understand how to prevent erosion, or restore vegetation on already damaged sites. In addition to species information, other factors can be included by binarizing the attributes. For example, soil moistness levels can be expressed by three binary attributes corresponding to dry, moist, and wet soil; the acidity of soil can be expressed by binary attributes corresponding to certain pH levels, and so on. In gene data, one can simply create a new attribute for each allele. Environmental factors like drugs used, the subject’s life style, or habits can be included with a simple binarization.

1.2

The curse of redundancy

For the correct interpretation of dependency rules it is important that the rules are non-redundant. This means that the condition contains only those attributes, which contribute positively to the dependency. Generally, adding new attributes to the antecedent of a dependency rule, can 1) have no effect, 2) weaken the dependency, or 3) strengthen the dependency. When the task is to search for only positive dependencies, the first two cases can only add redundancy. The third case is more difficult to judge, because now the extra attributes make the dependency stronger, but at the same time the rule usually becomes less frequent. In an extreme case, the rule is so rare that it has no statistical validity. Therefore, we need a statistical significance measure to decide whether the new attributes made the rule better or worse. Example 1.1 A common task in medical science is to search factors which are either positively or negatively dependent with a certain disease. If all

4

1 Introduction

possible dependency rules are generated, a large proportion of them is likely to be redundant. Let us now consider all three types of redundant rules, given a non-redundant, significant rule X → ¬A, where X stands for a lowfat diet and physical exercise and A for liver cirrhosis. The interpretation of the rule is that a low-fat diet and physical exercise are likely to prevent liver cirrhosis. We assume that P (¬A|X) < 1, which means that in principle more specific rules could express a stronger dependency. Now it is possible to find several rules of the form XB → ¬A, which express equally strong dependency to X → ¬A, but where B has no effect of A. For example, B could stand for listening Schubert, reading Russian classics, or playing Mahjong. Since P (¬A|XB) = P (¬A|X), also P (¬A|X¬B) = P (¬A|X), P (A|XB) = P (A|X), and (P A|X¬B) = P (A|X). In the worst case, where X → ¬A was the only dependency in data, we could generate an exponential number of its redundant specializations, none of which would produce any new information. In fact, the specializations could do just the opposite, if the doctor assumed that Dostoyevsky is needed to prevent liver cirrhosis, in addition to healthy diet and exercise. An even more serious misinterpretation could occur, if B actually weakened the dependency. For example, if B means regular alcohol consumption, there is a well-known positive dependency B → A (alcohol causes liver cirrhosis). Now B weakens rule X → ¬A, but XB → ¬A can still express a positive dependency. However, it would be quite fatal to conclude that a low-fat diet, physical exercise, and regular alcohol consumption help to prevent the cirrhosis. In fact, if we analyzed all rules (including rules where the antecedent contains any number of negations), we could find that P (¬A|X¬B) > P (¬A|X) and P (A|¬XB) > P (A|B), i.e. that avoiding alcohol strengthens the good effect of the diet and exercises, while the diet and exercises can weaken the bad effect of alcohol. However, as we allow only restricted occurrences of negations, a misinterpretation could easily occur, unless redundant rules were pruned. An example of the third kind of redundancy could be rule XB → ¬A, where B means that the patient goes in for Fengshui. If the data set contains only a few patients who go in for Fengshui, then it can easily happen that P (¬A|XB) = 1. The rule is the strongest possible and we could assume that Fengshui together with a healthy diet and exercises can totally prevent the cirrhosis. However, the dependency is so rare that it could just be due to chance. In addition, there are several patients who follow a healthy diet and go in for sports, but not for Fengshui. Therefore, the counterpart of rule XB → ¬A, ¬(XB) → A, is probably weaker than

1.3 The curse of dimensionality

5

¬X → A. This time, the redundancy is not discovered by comparing the confidences P (¬A|X) and P (¬A|XB), but instead we should compare the statistical significance of the rules, before drawing any conclusions on Fengshui. In this thesis, we generalize the classical definition of redundancy (e.g. [1]), and call a rule X → A = a redundant, if there exists a more general but at least equally good rule Y → A = a, where Y ( X. The goodness can be measured by any statistical significance function. Typically, all statistical significance measures require [64] that the more specific (and less frequent) rule X → A = a should express a stronger dependency between X and A = a, ¬X and A 6= a, or both, to achieve a better value than the more general rule. However, the notion of redundancy does not yet guarantee that the improvement in the strength of the rule is statistically significant. Therefore, some authors (e.g. [75]) have suggested that instead of redundancy we should test productivity of the rule with respect to more general rules. Rule X → A = a is called productive with respect to Y → A = a, Y ( X, if P (A = a|X) > P (A = a|Y ). The improvement is considered significant if the probability that it had occurred by chance is less than some pre-defined threshold maxp . The problem of this approach is that it compares only P (A = a|X) and P (A = a|Y ), but not their counterparts P (A 6= a|¬X) and P (A 6= a|¬Y ). Therefore, it is possible that a rare but strong rule passes the test, even if it is less significant than its generalizations. Thus, the significant productivity alone does not guarantee non-redundancy. This problem will be further analyzed in Section 2.4.

1.3

The curse of dimensionality

The search problem for the most significant, non-redundant dependency rules can occur in two forms: In the enumeration problem, the task is to search for all non-redundant rules whose goodness value M does not exceed some pre-defined threshold α. In the optimization problem, the task is to search for the K best non-redundant dependency rules, given the desired number of rules K. Typically, the enumeration problem produces a larger number of rules, but with suitable parameter settings the results are identical. In the following, we will concentrate on the optimization problem, but – if not otherwise stated – the results are applicable to the enumeration problem, as well. Searching for the best non-redundant dependency rules is computationally a highly demanding problem, and no polynomial time algorithms are

6

1 Introduction

known. This is quite expected, because even the simpler problem – searching for the best classification rule X → C with a fixed consequence C – is known to be N P -hard with common goodness measures [57]. Given a set of attributes, R, the whole search space consists of all attribute combinations in P(R). For each attribute set X ∈ P(R), we can generate |X| rules of the form X \ {A} → A, where |X| is the number of attributes in X. For each X, there is an equal number of negative rules X \ {A} → ¬A, but it does not affect the number of all possible rules, because X → A and X → ¬A are mutually excluding. ³ ´ P Therefore, the total number of possible rules is ki=2 i ki = k2k−1 − k (Equation (A.5)), where k = |R|. For example, if the number of attributes is 20, there are already over 10 million possible dependency rules of the desired form. In many real-world data sets, the number of attributes can be thousands, and even if most of the attribute combinations do not occur in the data, all possible rules cannot be checked. This so-called curse of dimensionality is inherent in all rule discovery problems, but statistical dependency rules have an extra difficulty – the lack of monotonicity. Statistical dependence is not a monotonic property, because X → A = a can express independence, even if its generalization Y → A = a, Y ( X, expressed dependence. It is also possible that X → A = a expresses negative (positive) dependence, even if its generalization Y → A = a expressed positive (negative) dependence. On the other hand, statistical dependence is neither an anti-monotonic property, because Y → A = a can express independence, even if its specialization X → A = a expressed dependence. In practice this means that the lack of simple dependency rules, like A → B, does not guarantee that there would not be significant dependencies in their specializations, like ACDE → B. This problem is especially problematic in genetic studies, where one would like to find statistically significant dependencies between alleles and diseases. Traditionally, scietists have tested only two-attribute rules, where a disease depends on only one allele. In this way, they have already found thousands of associations, which explain medical disorders. However, it is known that many serious and common diseases like heart disease, diabetes, and cancers, are caused by complex mechanisms, involving multiple genes and environmental factors. Analyzing these dependencies is a hot topic, but it has been admitted that first we should develop efficient computational methods. As a middle-step, the scientists have begun to analyze interactions between known or potential factors affecting a single disease. If the attribute set is relatively small (up to a few hundred alleles) and there is enough computer capacity it is possible to check dependencies in-


7

volving even 15 attributes. However, if all interesting genes are included, the number of attributes can be 500 000–1 000 000, and testing even the pair-wise dependencies becomes prohibitive.[22, 34, 54] Since the brute-force approach is infeasible, a natural question is to ask whether the existing data mining methods could be applied to the problem. Unfortunately, the problem is little researched, and no efficient solutions are known. In practice, the existing solutions in data mining fall into two approaches: the first approach is to search for association rules [2] first, and then select the statistically significant dependency rules in the post-processing phase. The second approach is to search for statistical dependency rules directly with some statistical goodness measure. The first approach is quite ineffective, because the objective of association rules is quite different from the dependency rules. In spite of its name, an association rule does not express an association in the statistical sense, but only a combination of frequently occurring attributes. The general idea of association rule discovery algorithms is to search for all rules of the form X → A, where the relative frequency of event XA is P (XA) ≥ minfr for some user-defined threshold minfr . This does not yet state that there is any connection between X and A, and therefore it is customary to require that either the confidence of the rule, P (A|X), or the lift of the rule, P (A|X)/P (A), is sufficiently large. The minimal confidence requirement states some connection between X and A, but it is not necessarily (positive) statistical dependence. If the confidence P (A|X) is large, then A is likely to occur, if X had occured, but A may still be more likely, if X had not occurred. The minimal lift requirement guarantees the latter, and it can be used to select all sufficiently strong statistical dependencies among frequent rules. The problem of this approach is that many dependency rules are missed, because they do not fit the minimum frequency requirement, while a large number of useless rules is generated in vain. In practice, the required minimum frequency threshold has to be relatively large for feasible search. With large-dimensional data sets the threshold can be 0.60-0.80, which means that the dependency should occur on at least 60–80% of the rows. Such rules cannot have a high lift value and the strongest and most significant dependency rules are missed. In the worst case, all discovered rules can be either independence rules or express insignificant dependencies [77]. In addition, a large proportion of rules will be redundant, and pruning them afterwards among millions of rules is time consuming. We note that there exist association rule discovery algorithms, which are said to search only “non-redundant rules”. However, usually the word “redundancy” is used

8

1 Introduction

in another sense, referring to rules that are not necessary for representing the collection of all frequent rules, or whose frequency can be derived from other rules (e.g. [48]). In Appendix B.2.1, we analyze the most popular of such condensed representations, closed sets [61, 9] and free sets [17], and show why they do not suit for searching for non-redundant statistical dependency rules, even if the minimum frequency threshold could be set sufficiently low. In spite of the above-mentioned statistical problems, association rule algorithms and frequency-based search in general have been popular in the past. The reason is that frequency is an anti-monotonic property, which enables an efficient pruning, namely, if a set Y is too infrequent, i.e. P (Y ) < minfr for some threshold minfr , then all of its supersets X ⊇ Y are also infrequent and can be pruned without checking. Therefore, most of the algorithms which search for statistical dependencies do also use frequencybased pruning with some minimum frequency thresholds, in addition to statistical search criteria. The second approach consists of algorithms, which search for only statistical dependency rules. None of the existing algorithms searches for the statistically most significant dependencies, but it is always possible to test the significance afterwards. Still, the search algorithm may miss some of the most significant dependencies, if they were pruned during the search. Due to the problem complexity, a common solution is to restrict the search space of possible rules heuristically and require some minimum frequency, restrict the rule length, fix the consequent attribute, or use some other criteria to prune “uninteresting” rules. The most common restriction is to search for only classification rules of the form X → C with a fixed consequent attribute C. Morishita and Sese [57] and Nijssen and Kok [60] introduced algorithms for searching for classification rules with the χ2 -measure. Both approaches utilized the convexity of the χ2 -function and determined a minimum frequency threshold, which guarantees a certain minimum χ2 -value. This technique is quite inefficient (dependening on the frequency P (C)), because the resulting minimum frequency thresholds are often too low for efficient pruning. A more efficient solution was developed by Nijssen et al. [59] for searching for the best classification rule with a closed set in the rule antecedent (i.e. rule X → C such that for all Z ) X, P (X) > P (Z)). The goodness measure was not fixed to the χ2 -measure, but other zero-diagonal convex measures like the information gain could be used, as well. The problem of this approach is that the closed sets can contain redundant attributes and the rule can be sub-optimal in the future data (see Appendix B.2.2).


9

The χ2 -measure was also used by Liu et al. [47], in addition to minimum frequency thresholds and some other pruning heuristics, to test the goodness of rule X → C and whether the productivity of rule X → C with respect to its immediate generalizations (Y → C, X = Y B for some attribute B) was significant. It is customary to test the productivity only against the immediate generalizations, because checking all 2|X| sub-rules of the form Y → A, Y ( X is inefficient. Unfortunately, this also means that some rules may appear as productive, even if they are weaker than some more general rules. Li [45] introduced an algorithm for searching for only non-redundant classification rules with different goodness easures, including confidence and lift. No minimum frequency thresholds were used, but instead it was required that P (X|A) was larger than some predefined threshold. Searching for general dependency rules with any consequent attribute is a more difficult and less studied problem. Xiong et al. [82] represented an algorithm for searching for positive dependencies between just two attributes using an upper bound for Pearson’s correlation coefficient as a measure function. Webb’s MagnumOpus software [76, 78] is able to search for general dependency rules of the form X → A with different goodness measures, including leverage, P (XA) − P (X)P (A), and lift. In addition, it is possible to test that the rules are significantly productive. In practice, MagnumOpus tests the productivity of rule X → A with respect to its immediate generalizations (Y → A, X = Y B for some B) with Fisher’s exact test. Otherwise, redundant rules are not pruned and the rules are not necessarily the most significant. The algorithm is quite well scalable, and if only the K best rules are searched for with the leverage, it is possible to perform the search without any minimum frequency requirement. However, we note that the leverage itself favours rules with a frequent consequent, and therefore some significant and productive rules can be missed. All of the above-mentioned algorithms search only for positive dependencies, but there are a couple of approaches which have also searched for negative dependency rules. Antonie and Za¨ıane [6] introduced an algorithm for searching for negative rules using a minimum frequency threshold, minimum confidence threshold, and Pearson’s correlation coefficient as a goodness measure. Since Pearson’s correlation coefficient for binary attributes p is the same as χ2 /n, where n is the data size, the algorithm should be able to search dependencies with the χ2 -measure, as well. Thiruvady and Webb [71] represented an algorithm for searching for general positive and negative dependency rules of the form (X = x) → (A = a), x, a ∈ {0, 1}, using the leverage as a goodness measure. No mini-

1 Introduction

10

mum frequency thresholds were needed, but the rule length was restricted. Wu et al. [80] also used leverage as a goodness measure together with a minimum frequency threshold and a minimum threshold for the certainty factor (P (A = a|X) − P (A = a))/P (A 6= a), a ∈ {0, 1}. Koh and Pears [40] suggested a heuristic algorithm, where a minimum frequency threshold was derived from Fisher’s exact test. Unfortunately, the algorithm was based on the faulty assumption that the statistical dependence would be an anti-monotonic property. However, the algorithm should be able to find positive and negative dependency rules among two attributes correctly. Searching for general dependency rules, where X can contain any number of negations, is an even more complex problem, and no efficient solutions are known. Meo [53] analyzed a related problem and developed a method for searching for attribute sets X, where variables corresponding to X \{A} and A are statistically dependent for any A ∈ X, and the dependence is not explained by any simpler dependencies in the immediate subsets Y , X = Y B. Since the objective was to find dependencies between variables, and not events, all 2|X| attribute value combinations had to be checked. The method was applied to select attribute sets containing dependencies from a set of all frequent attribute sets. Due to obvious complexity, the method is poorly scalable, depending on the data distribution and the minimum frequency threshold, and in practice it may be necessary to restrict the size of X, too. In addition, some researchers (e.g. [18, 39]) have studied another related problem, where statistically significant “correlated sets” are searched for. The idea is that Q the probability of set X should deviate from its expectation, i.e. P (X) 6= li P (Ai ), where X = A1 · · · Al . While interesting for their own sake, the correlated sets do not help to find statistically significant dependency rules [56].

1.4

Results and contribution

Figures 1.1 and 1.2 summarize the research problem. Figure 1.1 (a) shows the universe U of all possible rules on the given set of attributes; set Sr is the collection of the most significant dependency rules, and its subset Rr contains only the non-redundant rules. The set of the most significant rules depends on the selected goodness measure, M . For simplicity, we assume that M is increasing by goodness, i.e. high values of M indicate a good rule. Then, Sr consists of those rules that would gain a measure value M ≥ minM in a sample of size n, if the sample were perfect (i.e. if the estimated probabilities were fully correct, Pr = P ). Similarly, Rr

1.4 Results and contribution

11

consists of rules, which would be the most significant, non-redundant rules in a perfect sample of size n. Ideally, we would like to find the rules in collection Rr . (a)

(b)

U

Sd Sr Rr

Rd

Sd \ Sr Sr \ Sd

Figure 1.1: (a) Actually significant (Sr ) and non-redundant (Rr ) dependency rules. (b) Magnification of the dashed rectagle in (a) showing rules which appear as significant (Sd ) and non-redundant (Rd ) in the data.

The statistical problem is that we are given just a finite (and often noisy) real-world data set, where the observed probabilities P are only inaccurate estimates for the real probabilities Pr . As a result, the measure values M are also inaccurate, and it is possible that some spurious rules appear to be significant while some actually significant dependency rules appear to be insignificant. This is shown in Figure 1.1 (b), which contains all possible rules that occur in the data (have frequency P (XA) ≥ 1). Set Sd is the collection of all dependency rules that appear to be significant in the data, and subset Rd contains all significant rules, which appear to be non-redundant in the data. By tailoring the significance threshold minM , it is possible to decrease either the set of spurious rules Sd \ Sr or the set of missing significant rules Sr \ Sd , but not both. Similarly, stronger requirements for the non-redundancy can decrease the set of spuriously non-redundant rules Rd \ Rr , while weaker requirements can decrease the set of missing, actually non-redundant rules Rr \ Rd . Figure 1.2 demonstrates the search problem and the resulting extra errors. Set T consists of all rules which can be found with existing search

1 Introduction

12

traditional methods find

Sd

our target

Rd

T

Figure 1.2: Rules discovered by traditional methods (T ) and our target rules (Rd ). Dash lines refer to boundaries of Sr and Rr in Figure 1.1.

methods. Due to problem complexity, all of them use either restricting pruning heuristics (especially the minimum frequency requirement) or inappropriate measure functions, which can catch only a small proportion of significant dependency rules in Sd . This is especially true in largedimensional, dense data sets, where it can happen that none of the most significant rules are discovered. In addition, the traditional methods produce a large number of non-significant or redundant rules. The latter can be pruned out in the post-processing phase, although it is laborious, but there is no way to restore the set of missing rules Sd \ T . In this research, the objective is to develop effective search methods for the set Rd – dependency rules which are the most significant and nonredundant in the data. We do not take a stand on how the related statistical problem should be solved, but clearly the set Rd is a better approximation for the ideal rule set Rr than T . The new search algorithms are developed on such a general level that it is possible to apply different definitions of statistical dependency and different significance measures. We have adopted a general definition of redundancy, which leaves space for extra pruning of non-redundant rules with stricter criteria, like productivity. The main result of this research is a new generic search algorithm, which

1.4 Results and contribution

13

is able to perform the search efficiently compared to the existing solutions without any of their extra requirements for the frequency or complexity of the rule. Since no sub-optimal search heuristics are used, the resulting dependency rules are globally optimal in the given data set. In the algorithm design, the emphasis has been to minimize the time complexity, without any special attention to the space requirement. However, in practice, the implementations are capable of handling all the classical benchmark data sets (including ones with nearly 20,000 attributes) with a normal desktop computer. In addition to the classical problem of searching for only positive dependency rules, the new algorithm can also search for negative dependencies with certain significance measures (e.g. Fisher’s exact test and the χ2 measure). This is an important step forward in some application domains, like gene–disease analysis, where the information on negative dependencies is also valuable. The development of new efficient pruning properties has also required comprehensive theoretical work. In this thesis, we develop the theory for evaluating the satistical significance of dependency rules in different frameworks. Especially, we formalize the notion of well-behaving goodness measures, uniformly with the classical axioms for any “proper” measures of dependence. The most important new insight is that all well-behaving measures obtain their best values at the same points of the search space. In addition, they share other useful properties, which enable efficient pruning of redundant rules. Most of the results of this thesis have been introduced in the following papers. The author has been the main contributor in all of them. The papers contain also extra material, which is not included in this thesis. 1. W. Häm¨ al¨ ainen and M. Nykänen. Efficient discovery of statistically significant association rules. In Proceedings of the 8th IEEE International Conference on Data Mining (ICDM 2008), pages 203–212, 2008. 2. W. Häm¨ al¨ ainen. Lift-based search for significant dependencies in dense data sets. In Proceedings of the Workshop on Statistical and Relational Learning in Bioinformatics (StReBio’09), in the 15th ACM SIGKDD conference on Knowledge Discovery and Data Mining, pages 12–16. ACM, 2009. 3. W. Häm¨ al¨ ainen. Statapriori: an efficient algorithm for searching statistically significant association rules. Knowledge and Informa-

1 Introduction

14

tion Systems: An International Journal (KAIS), 23(3):373–399, June 2010. 4. W. Häm¨ al¨ ainen. Efficient search methods for statistical dependency rules. Fundamenta Informaticae, 2010. Accepted pending revisions. 5. W. Häm¨ al¨ ainen. Efficient discovery of the top-K optimal dependency rules with Fisher’s exact test of significance. Submitted 2010. 6. W. Häm¨ al¨ ainen. General upper bounds for well-behaving goodness measures on dependency rules. Submitted 2010. 7. W. Häm¨ al¨ ainen. New tight approximations for Fisher’s exact test. Submitted 2010.

1.5

Organization

The rest of the thesis is organized as follows. In Chapter 2, we introduce the main concepts related to statistical dependence, dependency rules, statistical significance testing in different frameworks, and redundancy and improvement of dependency rules. In Chapter 3, we formalize the pruning problem and introduce well-behaving goodness measures. The theoretical basis for the search with any well-behaving goodness measures is given there. In Chapter 4, we analyze different strategies to implement the search. We introduce a new efficient pruning property, which can be combined with the basic branch-and-bound search. As a result, we intoduce a generic search algorithm, and analyze its time and space complexity. In Chapter 5, we report experiments on classical benchmark data sets. The final conclusions are drawn in Chapter 6. At the end of the thesis are appendices, which complement the main text. Appendix A contains general mathematical results, which are used in the proofs. Appendix B contains auxiliary results (proofs for statements in the main text). Implementation details and their effect on the complexity are described in Appendix C.

Chapter 2 Statistically significant dependency rules One of the most important problems in the philosophy of natural sciences is – in addition to the well-known one regarding the essence of the concept of probability itself – to make precise the premises which would make it possible to regard any given real events as independent. This question, however, is beyond the scope of this book. A.N. Kolmogorov In spite of their intuitive appeal, the notions of statistical dependence and – especially – statistically significant dependence are ambiguous. In this thesis, we do not try to solve which are the correct definitions or measure functions for statistically significant dependency rules, but instead we try to develop generic search methods consistent with the most common interpretations. In this chapter, we define dependency rules on a general level, and consider two alternative interpretations – variable-based and value-based semantics – what the dependency means. We recall the general idea of the statistical significance of a dependency and give statistical measures for evaluating the significance. Finally, we discuss the notions of redundancy and improvement.

2.1

Statistical dependency rules

In its general form, a dependency rule on binary data is defined as follows: 15

16

2 Statistically significant dependency rules

Definition 2.1 (Dependency rule) Let R be a set of binary attributes, X ( R and Y ⊆ R \ X sets of binary attributes, and Dom(X ) = {0 , 1 }l , |X| = l, and Dom(Y ) = {0 , 1 }m , |Y | = m, their domains. Let Dep(α, β) be a symmetric relation which defines statistical dependence between events ω1 and ω2 ; Dep(ω1 , ω2 ) = 1 , if ω1 and ω are statistically dependent, and Dep(ω1 , ω2 ) = 0 , otherwise. For all attribute value combinations x ∈ Dom(X ) and y ∈ Dom(Y ), rules X = x → Y = y and Y = y → X = x are dependency rules, if Dep(X = x , Y = y) = 1 . First, we note that the direction of the rule is customary, because the relation of statistical dependence is symmetric. Often, the rule is expressed in the order, where the confidence of rule (cf (X = x → Y = y) = P (Y = y|X = x ) or cf (Y = y → X = x ) = P (X = x |Y = y)) is maximal. In this thesis, we concentrate on a special case of dependency rules, where 1) the consequence Y = y consist of a single attribute-value combination, A = a, a ∈ {0, 1}, and 2) the antecedent X = x is a conjunction of true-valued attributes, i.e. (X = x) ≡ (A1 = 1, . . . , Al = 1), where X = {A1 , . . . , Al }. With these restrictions, the resulting rules can be expressed in a simpler form X → A = a, where a ∈ {0, 1}, or X → A and X → ¬A. We note that with certain relations of dependency, this type of rules includes also rules of form ¬X → A and ¬X → ¬A, because they are considered equivalent to X → ¬A and X → A, respectively. These two restrictions on the type of rules are commonly made also in the previous research on association rules, mainly for computational reasons. Allowing several attributes in the consequent has only a small effect on the number of all possible rules, but it complicates the definition of the redundancy and also the search for non-redundant rules. In this case, each attribute set Z can be divided into consequence and antecedent parts in 2|Z| − 2 ways (instead of |Z| ways), and each rule X → Z \ X can be redundant with respect to 2|Z| − 2|X| − 2|Z\X| more general rules (see Theorems B.1 and B.2). Another reason for allowing just a single attribute in the consequence is to simplify the interpretation and application of discovered dependency rules. Allowing any number of negations in the antecedent part increases the computational complexity of the problem remarkably. Now for each set Z there are 2|Z| different value combinations and for each of them we can generate |Z| rules with a single attribute in the consequence. The total number of all possible rules is O(k3k ), instead of O(k2k ) (see Theorem B.3).

2.1 Statistical dependency rules

17

Let us now consider what it means that there is a statistical dependency between an attribute set X and an attribute value assignment A = a. There are at least two different interpretations, depending on whether X and A = a are interpreted as events or random variables. Traditionally, the statistical dependence between events is defined as follows (e.g. [68, 53]): Definition 2.2 (Independent and dependent events) Let R be as before, X ( R a set of attributes, and A ∈ R \ X a single attribute. For any truth-value assignments a, x ∈ {0, 1} events X = x and A = a are statistically independent, if P (X = x, A = a) = P (X = x)P (A = a). If the events are not independent, they are dependent. When P (X = x, A = a) > P (X = x)P (A = a), the dependence is positive, and when P (X = x, A = a) < P (X = x)P (A = a), the dependence is negative. We note that actually the independence condition holds for the real but unknown probabilities Pr (X = x, A = a), Pr (X = x), and Pr (A = a). Therefore, the condition for their frequency-based estimates in some real world data set is sometimes expressed as P (X = x, A = a) ≈ P (X = x)P (A = a). Because the frequency-based estimates for probabilities are likely to be inaccurate, it is customary to call events dependent only if the dependency is sufficiently strong. The strength of the statistical dependence between X = x and A = a can be measured by lift (also called interest [18], dependence [81], or degree of independence [83]): γ(X = x, A = a) =

P (X = x, A = a) . P (X = x)P (A = a)

Another commonly used measure is leverage (also called dependence value [53]) δ(X = x, A = a) = P (X = x, A = a) − P (X = x)P (A = a). The difference between the leverage and lift is that the first one measures the absolute deviation of the observed relative frequency P (X = x, A = a) from its expectation P (X = x)P (A = a), if the events were independent, while the lift measures a relative deviation from the expectation. If the dependency rules are searched by the strength of the dependence (like in [75, 45]), the results can be quite different, depending on which measure is used. However, both measures can also produce spurious rules, which are due to chance, and therefore we will use other goodness measures as search criteria. Statistical dependence between variables is defined as follows (e.g. [68, 53]):


18

Definition 2.3 (Independent and dependent variables) Let V and W be discrete random variables, whose domains are Dom(V ) and Dom(W ). V and W are statistically dependent, if P (V = v, W = w) = P (V = v)P (W = w) for all values v ∈ Dom(V ) and w ∈ Dom(W ). Otherwise they are dependent. Once again, the variables are called dependent only if the overall dependence is sufficiently strong. To estimate the strength of the overall dependence, we need a measure (like the χ2 -measure or mutual information MI ) which takes into account all dependencies between value combinations v and w. For dependency rule X → A = a, we have now two alternative interpretations, which are sometimes called value-based and variable-based semantics [14] of the rule. In the value-based semantics, only dependence between events X = 1 and A = a is considered, while in the variable-based semantics X and A are considered as random variables (above V and W ), whose dependence is measured. In the latter case, the dependencies in all four possible event-combinations, XA, X¬A, ¬XA, and ¬X¬A, are evaluated. Table 2.1: A contingency table with absolute frequencies of XA, X¬A, ¬XA and ¬X¬A. Absolute leverage ∆ = nδ is the deviation from the expectation under independence, where n is the data size and δ is the leverage. X ¬X Σ

A m(XA) = m(X)P (A) + ∆ m(¬XA) = m(¬X)P (A) − ∆ m(A)

¬A m(X¬A) = m(X)P (¬A) − ∆ m(¬X¬A) = m(¬X)P (¬A) + ∆ m(¬A)

Σ m(X) m(¬X) n

The two interpretations are closely related, because for binary variables X and A, all value combinations X = x, A = a x, a ∈ {0, 1}, have the same absolute value of leverage, |δ|, as shown in Table 2.1. Therefore, |δ| can now be used to measure the strength of the dependence also in the variable-based semantics. However, all four events have generally different lift values, and therefore some events may be strongly dependent, while the overall dependency can be considered weak. This can happen, if the expected frequency P (X = x)P (A = a) is small and the dependency between X = x and A = a is strong.

2.1 Statistical dependency rules

19

Example 2.4 In the gene–disease data, many alleles or allele combinations (here X) occur quite rarely, but still they can have a strong effect on a certain disorder D. In an extreme case, the occurrence of X indicates always the occurrence of D, but D can occur also with other allele combinations. Now P (D|X) = 1 and the lift γ(X → D) = P (D)−1 , which can be quite strong. Similarly, γ(X → ¬D) = 0 indicates a strong negative dependency between X and ¬D. However, the other two events express only a weak dependency and the lift values γ(¬X → D) and γ(¬X → ¬D) are near 1. The leverage, δ(X, D) = P (X)P (¬D), is also small, because P (X) was small. In the value-based semantics, dependency rule X → D would be considered strong, but in the variable-based semantics, it would be considered weak. Since the search algorithms use goodness measures, which reflect the strength of the dependency, the rule could be either discovered or undiscovered during the search, depending on the interpretation.

The two interpretations of dependency rule X → A = a lead also to different conclusions, when the dependency rule is used for prediction. Both value-based and variable-based semantics state that A = a is more likely to occur, if X had occurred, than if X had not occurred. In the variablebased semantics, we can also state that A 6= a is more likely to occur, if X had not occurred, than if it had occurred. However, in the value-based semantics this dependency can be so weak that no predictions can be made on A = a or A 6= a, given that X had not occurred. Finally, we note that there is also a third possible interpretation concerning the dependence between X and A. Attribute set X could also be thought as a variable, whose values are all possible 2|X| value combinations x ∈ Dom(X ). In data mining, this approach is rarely taken, due to computational complexity of the search. Another problem is that when |X| is relatively large, many attribute value combinations do not occur in the data at all or so infrequently that the corresponding probabilities cannot be estimated accurately. On the contrary, in machine learning, this interpretation is commonly used, when classifiers or Bayesian networks are learned from binary data. However, in these applications, the data sets are typically much smaller than in data mining and also the problem is easier, because the objective is not to check all sufficiently strong dependencies or even the K best dependencies, but only to find (heuristically) one sufficiently good dependency model.

20

2.2


Statistical significance testing

The main idea of statistical significance testing is to estimate the probability that the observed discovery has occurred by chance. If the probability is very small, we can assume that the discovery is genuine. Otherwise, it is considered spurious and discarded. The probability can be estimated either analytically or empirically. The analytical approach is used in the traditional significance testing, while randomization tests estimate the probability empirically. Traditional significance testing can be further divided into two main classes: the frequentist and Bayesian approaches. The frequentist approach is the most commonly used and best studied (see e.g. [29, Ch. 26] or [46, Ch. 10.1]). The main idea is to estimate the probability of an observed or a rarer phenomenon under some null hypothesis. When the objective is to test the significance of the dependency between events or variables X and A, the null hypothesis H0 is the independence assumption: P (X, A) = P (X)P (A). For this purpose, we define a random variable NXA which gives the absolute frequency of event XA. The task is to calculate the probability p = P (NXA ≥ m(X, A) | H0 ) that XA occurs at least m(X, A) times in the given data set, if X and A were actually independent. In the classical (Neyman-Pearsonian) hypothesis testing, the p-value is compared to some pre-defined threshold α. If p ≤ α, the null hypothesis is rejected and the discovery is called significant at level α. Parameter α defines the probability of committing a type I error, i.e. accepting a spurious rule. Another parameter, β, is used to define the probability of committing a type II error, i.e. rejecting a genuine dependency rule as non-significant. The problem is how to decide suitable thresholds, and often only the pvalues are reported. Deciding threshold α is even harder in the data mining where numerous patterns are tested. For example, if we use threshold α = 0.05, then there is a 5% chance that a spurious rule passes the significance test. If we test 10 000 rules, it is likely that we will find 500 spurious rules. This so called multiple testing problem is inherent in the knowledge discovery, where one often performs an exhaustive search over all possible patterns. As a solution, the more patterns we test, the stricter bounds for the significance we should use. The most famous correction method is Bonferroni adjustment [66], where the desired significance level p is divided by the number of tests m. In the dependency rule discovery, we can give an upper bound for the number of rules to be tested [75]. An alternative is to control the expected number of errors among selected rules [10]. In this thesis, we do not try to solve this problem, but instead our objective is to

2.2 Statistical significance testing

21

develop search methods for the dependency rules with the best p-values – i.e. the most genuine dependencies. The idea of Bayesian significance testing is quite similar to the frequentist approach, but now we assign some prior probabilities P (H0 ) and P (H1 ) to the null hypothesis H0 and the research hypothesis H1 . The conditional probabilities P (NXA ≥ m(X, A) | H0 ) and P (NXA ≥ m(X, A) | H1 ) are estimated from the data, and the posterior probabilities of hypotheses P (H0 | NXA ≥ m(X, A)) and P (H1 | NXA ≥ m(X, A)) are calculated by the Bayes’ rule. The results are asymptotically similar (under some assumptions even identical) to the traditional hypothesis testing, although the Bayesian testing is sensitive to the selected prior probabilities [5]. The main problem of the traditional significance testing is the underlying assumption that the data is a random or otherwise representative sample from the whole population [69]. Often, the tests are used even if this assumption is violated (e.g. all available data is analyzed), but in this case, the discoveries describe only the available data set and there are no guarantees that they would hold in the whole population [27]. This is especially typical for exploratory data analysis and data mining, in general. In addition, there are problems, where random sampling is not possible even in principle, because the population is infinite (e.g. the researcher is interested in the dependence between employee’s smoking habits and work efficiency in general, including already dead and still unborn people) [25, 7–8]. Another restriction concerning most of the traditional significance testing are the parametric assumptions on the underlying data distribution. The problem is that the p-value cannot be estimated, unless one has assumed some form of the distribution under the null hypothesis. Typically, one selects the distribution from some parametric family like binomial or normal distributions and estimates the missing parameters from the data. In large data sets, the exact form of the distribution is less critical, because all distributions tend to normal, and also the parameters can be estimated more accurately. However, in dependence testing, the most important factor is the assumed distribution of absolute frequencies nP (X = x)P (A = a), x, a ∈ {0, 1}. If m(X = x) or m(A = a) is small, there are no guarantees that the corresponding random variable would be normally distributed or that the frequency estimate would be accurate. Both of the above mentioned problems can be solved by randomization testing (e.g. [25]). The main idea of randomization tests is to estimate the significance of the discovery empirically, by testing how often the discovery occurs in other, randomly generated data sets, which share some properties

22


of the original data set. The most common approach is to fix some permutation scheme, how data is permutated, and select a random subset of all possible permutations. If only a single dependency rule X → A is tested, it is enough to generate random distributions dj for attributes Ai ∈ X and A. Usually, it is required that all marginal probabilities P (Ai ) and P (A) remain the same as in the original data. For each random distribution dj , a test statistic M , which measures the goodness or strength of the rule, is calculated. Let us now assume that the measure M is increasing by the goodness of the rule (a higher value indicates a better rule). If the original data set produced M -value M0 and m random distributions produced M -values M1 , . . . , Mm , the empirical p-value of the rule is pem =

|{dj |Mj ≥ M0 }| . m

If the data set is relatively small and X is simple, it is possible to enumerate all possible distributions, where the marginal probabilities hold. This leads to an exact permutation test, which gives an exact p-value. On the other hand, if the data set is large and/or X is more complex, all possibilities cannot be checked, and the empirical p-value is less accurate. In this case, the test is called a random permutation test or an approximate permutation test. The advantage of randomization tests is that the data does not have to be a random sample from a finite population. The reason is that the permutation scheme defines a population, which is randomly sampled, when new data sets are generated. Therefore, the discoveries can be assumed to hold in the population defined by the permutation scheme. In addition, the randomization test approach does not assume any underlying parametric distribution. Still, the classical parametric test statistics can be used to estimate the p-values.[25] The only problem is that it is not always clear, how the data should be permutated. The number of random permutations plays also an important role in the testing. The more random permutations are performed, the more accurate the empirical p-values are, but in practice, extensive permutating can be too time consuming. We also note that the randomization approach does not offer any direct method for the search of significant dependency rules; it can merely be used to validate the results, when good candidates have first been found by other means. The idea of randomization tests can be extended for estimating the overall significance of all mining results. For example, Gionis et al. [30] tested the significance of the number of all frequent sets (given a minimum frequency threshold) and the number of all pair-wise correlations among

2.3 Measures for statistical significance

23

the most frequent attributes (measured by Pearson’s correlation coefficient, given a minimum correlation threshold) using randomization tests. In this case, it is necessary to generate complete data sets randomly for testing. The difficulty is to decide what properties of the original data set should be maintained. As a solution, Gionis et al. kept both column marginals (P (Ai )s) and row marginals (numbers of 1s on each row) fixed, and new data sets were generated by swap randomization [21]. A prerequisite for this method is that the attributes are semantically similar (e.g. occurrence or absence of species) and it is sensible to swap their values. In addition, there are some pathological cases, where no or only a few permutations exist with the given row and column marginals, resulting a poor p-value, even if the original data set contains a significant pattern.[30]

2.3

Measures for statistical significance

Based on the definitions of a dependency rule and statistical dependence, a rule X → A = a can be called a dependency rule, only if the dependence between events X = 1 and A = a (in the value-based semantics) or binary variables X and A (in the variable-based semantics) is sufficiently significant. For this evaluation, we need a significance measure M . In statistics, the significance of dependence in the variable-based semantics has been well studied and several measures are available. However, the significance of dependence in the value-based semantics has been only vaguely discussed in the data mining literature. For any situation, the appropriate measure depends on the assumptions on the origin of data. Usually, it is thought that the data is a sample from a larger (possibly infinite) population, which follows a certain distribution. Assumptions on the form of the distribution and how its parameters are estimated lead to different measures. In the following, we will consider the most common assumptions and the related significance measures in both variable-based and value-based semantics.

2.3.1

Variable-based measures

In the variable-based semantics , the significance of an observed dependency between X and A is determined by an independence test. The task is to estimate the probability of the observed or a more extreme contingency table (Table 2.1), assuming that X and A were actually independent. There is no consensus how the extremeness relation should be defined, but intuitively,

24


contingency table Ti is more extreme than table Tj , if the dependence between X and A is stronger in Ti than in Tj . So, a necessary condition for the extremeness of Ti over Tj is that the leverage in Ti is larger than in Tj . In the following, we will notate the relation “table Ti is equally or more extreme to table Tj ” by Ti º Tj . The probability of each contingency table Ti depends on the assumed model M. Model M defines the space of all possible contingency tables TM (under the model assumptions) and the probability P (Ti |M) of each table Ti ∈ TM . Because the task is to test independence, the assumed model should satisfy the independence assumption Pr (XA) = Pr (X)Pr (A) in some form. For the probabilities P (Ti |M) holds X

P (Ti |M) = 1.

Ti ∈TM

Now the probability of the observed contingency table T0 or of any at least equally extreme contingency table Ti , Ti º T0 , is the desired p-value p=

X

P (Ti |M).

(2.1)

Ti ºT0

Classically, the models for independence testing have been divided into three main categories (sampling schemes) [8, 62], which we call multinomial, double binomial, and hypergeometric models. In the statistics literature (e.g. [8, 72]), the corresponding sampling schemes are called double dichotomy, 2×2 comparative trial, and 2×2 independence trial. In the following, we describe the three models using the urn metaphor. Because there are two binary variables of interest, X and A, we cannot use the basic urn model with white and black balls. Instead, we assume a large (potentially infinite) urn of similar apples, which are either red or green, and each apple is either sweet or bitter. The problem is to decide whether there is a significant dependency between the apple colour and taste, when we have only a finite sample of apples available. For example, let us suppose that we have observed in our sample that red apples are more likely to be sweet than green ones, and we want to test if this observed dependency is just due to chance. Multinomial model In the multinomial model, we assume that the real probabilities of sweet red apples, bitter red apples, sweet green apples, and bitter green apples are defined by parameters pXA , pX¬A , p¬XA , and p¬X¬A . The probability


25

of red apples is pX and of green apples 1 − pX . Similarly, the probability of sweet apples is pA and of bitter apples 1 − pA . According to independence assumption, pXA = pX pA , pX¬A = pX (1 − pA ), p¬XA = (1 − pX )pA , and p¬X¬A = (1 − pX )(1 − pA ). A sample of n apples is taken randomly from the urn. Now the probability of obtaining NXA sweet red apples, NX¬A bitter red apples, N¬XA sweet green apples, and N¬X¬A bitter green apples is defined by multinomial probability µ P (NXA , NX¬A , N¬XA , N¬X¬A |n, pX , pA ) =

n NXA , NX¬A , N¬XA , N¬X¬A

¶

n−NX NA X pN pA (1 − pA )n−NA . (2.2) X (1 − pX )

Since data size n is given, the contingency tables can be defined by triplets < NXA , NX¬A , N¬XA > or, equivalently, triplets < NX , NA , NXA >. Therefore, the space of all possible contingency tables is TM = {< NX , NA , NXA > | NX = 0, · · · , n; NA = 0, · · · , n; NXA = 0, · · · , min{NX , NA }}. For estimating the p-value with Equation (2.1), we should still solve two problems. First, the parameters pX and pA are unknown. The most common solution is to estimate them by the observed relative frequencies, i.e. pX = P (X) and pA = P (A). Second, we should decide, when a contingency table Ti is equally or more extreme than the observed contingency table T0 . For this purpose, we need some measure, which evaluates the overall dependence in a contingency table. An example of such measures is the odds ratio: odds(NXA , NX¬A , N¬XA , N¬X¬A ) =

NXA N¬X¬A . NX¬A N¬XA

The problem of odds ratio is that it is not defined, when NX¬A N¬XA = 0. In practice, the multinomial test is seldom used, but the multinomial model is an important theoretical model, from which other models can be derived as special cases. Double binomial model In the double binomial model, it is assumed that we have two large urns, one for red and one for green apples. Let us call these the red and the green urn. In the red urn, the probability of sweet apples is pA|X and of bitter apples 1 − pA|X , and in the green urn the probabilities are pA|¬X

26


and 1 − pA|¬X . According to the independence assumption, the probability of sweet apples is the same in both urns: pA = pA|X = pA|¬X . A sample of m(X) apples is taken randomly from the red urn and another random sample of m(¬X) apples is taken from the green urn. The probability of obtaining NXA sweet apples among the selected m(X) red apples is defined by the binomial probability µ ¶ m(X) XA P (NXA |m(X), pA ) = pN (1 − pA )m(X)−NXA . A NXA Similarly, the probability of obtaining N¬XA sweet apples among the selected green apples is µ ¶ m(¬X) ¬XA P (N¬XA |m(¬X), pA ) = pN (1 − pA )m(¬X)−N¬XA . A N¬XA Because the two samples are independent from each other, the probability of obtaining NXA sweet apples from m(X) red apples and N¬XA sweet apples from m(¬X) green apples is the product of the two binomials: µ P (NXA , N¬XA |n, m(X), m(¬X), pA ) =

m(X) NXA

¶µ

m(¬X) N¬XA

¶

n−NA A , (2.3) pN A (1 − pA )

where NA = NXA +N¬XA is the total number of the obtained sweet apples. We note that the double binomial probability is not exchangeable with respect to the roles of X and A, i.e. generally P (NXA , N¬XA |n, m(X), m(¬X), pA ) 6= P (NXA , NX¬A |n, m(A), m(¬A), pX ). In practice, this means that the probability of obaining m(XA) sweet red apples, m(X¬A) bitter red apples, m(¬XA) sweet green apples, and m(¬X¬A) bitter green apples is (nearly always) different in the model of the red and green urns from the model of the sweet and bitter urns. Since m(X) and m(¬X) are given, each contingency table is defined as a pair < NXA , N¬XA > or, equivalently, < NA , NXA >. The space of all possible contingency tables is TM = {< NXA , N¬XA > | NXA = 0, · · · , m(X); N¬XA = 0, · · · , m(¬X)}. We note that NA is not fixed, and therefore NA is generally not equal to the observed m(A).


27

For estimating the significance with Equation (2.1), we should once again estimate the unknown parameter pA and decide, how the extremeness relation is defined. The most common solution is to estimate pA from the data, and set pA = P (A). For the extremeness relation, we can use for example the odds ratio. However, if we fix also NA = m(A), it becomes easy to define, when a dependency is stronger than the observed. The positive dependence between red and sweet apples becomes stronger, when NXA increases, and the negative dependence between sweet and green apples becomes stronger, when N¬XA = m(A) − NXA decreases. Now the p-value gets a simple expression m(X)

p=

X

µ

i=m(XA)

m(X) i

¶µ

m(¬X) m(A) − i

¶ P (A)m(A) P (¬A)m(¬A) .

Another common solution is approximate the p-value with asymptotic tests, which are discussed later. Hypergeometric model In the hypergeometric model, there is no sampling from an urn. Instead, we can assume that we are given a finite urn of n apples, containing exactly m(X) red apples and m(¬X) green apples. We test all n apples and find that m(XA) of red apples are sweet and m(¬XA) of green apples are sweet. The question is how probable is our urn (or the set of all at least equally extreme urns) among all possible apple urns with m(X) red apples, m(¬X) green apples, m(A) sweet apples, and m(¬A) bitter apples. Now the urns correspond to contingency tables. The number of all possible urns with the fixed totals m(X), m(¬X), m(A), and m(¬A) is m(A) µ

X i=0

m(X) i

¶µ

m(¬X) m(A) − i

¶

µ =

n m(A)

¶ .

The equality from the Vandermonde’s identity (Equation A.1). (We ¡ follows ¢ recall that ml = 0, when l > m.) Assuming that all urns with these fixed totals are equally likely, the probability of an urn with NXA sweet red apples is µ ¶−1 n P (NXA |m(X), M (¬X), m(A), m(¬A)) = . m(A) Because all totals are fixed, the extremeness relation is also easy to define. Positive dependence is stronger than observed, when NXA > m(XA).

28


For the p-value, it is enough to sum the probabilities of urns containing at least m(XA) sweet red apples. The resulting p-value is ³ pF =

J1 X

m(X) m(XA)+i

´³

³

i=0

m(¬X) m(¬X¬A)+i

n m(A)

´

´ ,

(2.4)

where J1 = min{m(X¬A), m(¬XA)}. (Instead of J1 , we could give an upper range m(A), because the zero terms disappear.) This p-value is known as Fisher’s p, because it is used in Fisher’s exact test, an exact permutation test, where pF is calculated. We give it a special symbol pF , because it will be used later. For negative dependence between red and sweet apples (or positive dependence between green and sweet apples) the p-value is ³ ´³ ´ m(X) m(¬X) J2 X m(XA)−i m(¬X¬A)−i ³ ´ , (2.5) pF = i=0

n m(A)

where J2 = min{m(XA), m(¬X¬A)}. Relations between the three models The main difference between the three models is which of the variables N , NX , and NA are considered fixed. In the multinomial model all variables except N = n are randomized. However, if the model is conditioned with NX = m(X), it leads to the double binomial model. If the double binomial model is conditioned with NA = m(A), it leads to the hypergeometric model. For completeness, we could also consider the Poisson model, where all variables, including N , are unfixed Poisson variables. If the Poisson model is conditioned with the given data size, N = n, it leads to the multinomial model.[44, ch. 4.6-4.7] Selecting the correct model and defining the extremeness relation is a controversial problem, which statisticians (mostly Fisherian vs. NeymanPearsonian schools) have argued for the last century (see e.g. [84, 4, 43, 72, 36]). One problem is that even if one or both marginal totals, NX and/or NA , are kept unfixed, the observed counts m(X) and/or m(A) are still used to estimate the unknown parameters. Therefore, Fisher and his followers have suggested that we should always assume both NX and NA fixed and use Fisher’s exact test or – when it is heavy to compute – a suitable asymptotic test. The opponents have reminded that the results are better generalizable outside the data set, if some variables are kept unfixed,


29

m(X)=30, m(A)=50, n=100 0.6 Fisher double bin multinom

0.5

p

0.4 0.3 0.2 0.1 0 15

17

19

21

23 m(XA)

25

27

29

Figure 2.1: An example of p-values in different variable-based models, when m(X) and m(A) are fixed.

but, nevertheless, they have also suggested to use asymptotic tests, which are conditional on the observed counts. Figure 2.1 shows an example of different p-values as functions m(XA), when m(X) = 30, m(A) = 50, and n = 100. In all models, the extremeness relation has been defined as D ≥ δ, where D is a random variable for the leverage. In the beginning of the line, the leverage is δ = 0, and in the end, it is maximal δ = 0.15. Since the double binomial model is asymmetric with respect to X and A, the values are slightly different, when m(X) and m(A) are reversed. Generally, Fisher’s exact test gives more cautious pvalues than the multinomial and double binomial models. However, the values approach each other, when the data size is increased. Asymptotic measures We have seen that the p-values in the multinomial and double binomial models are quite difficult to calculate. However, the p-value can often be approximated easily using asymptotic measures. With certain assumptions, the resulting p-values converge to the correct p-values, when the data size n (or m(X) and m(¬X)) tend to infinity. In the following, we introduce


30

two commonly used asymptotic measures for independence testing: the χ2 measure and mutual information. In statistics, the latter corresponds to log likelihood ratio [58]. The main idea of asymptotic tests is that instead of estimating the probability of the contingency table as such, we calculate some better behaving test statistic M . If M gets value val , we estimate the probability of P (M ≥ val ) (assuming that large M values indicate a strong dependency). In the case of the χ2 -test, the test statistic is the χ2 -measure: χ2 =

1 X 1 X n(P (X = i, A = j) − P (X = i)P (A = j))2 P (X = i)P (A = j) i=0 j=0

=

n(P (X, A) − P (X)P (A))2 . P (X)P (¬X)P (A)P (¬A)

(2.6)

So, in principle, each term tests how much the observed frequency m(X = i, A = j) deviates from its expectation nP (X = i)P (A = j), under the independence assumption. If the data size n is sufficiently large and none of the expected frequencies is too small, the χ2 -measure follows approximately the χ2 -distribution with one degree of freedom. This can be derived from the double binomial model as follows: NXA and N¬XA are binomial variables with expected values p µ1 = m(X)pA and m(X)pA (1 − pA ) and µ2 = p m(¬X)pA and standard deviations σ1 = σ2 = m(¬X)pA (1 − pA ). If the unknown parameter pA is estimated by P (A), we can calculate z-scores for NXA and N¬XA : NXA − m(X)P (A) z1 = p m(X)P (A)P (¬A)

and

N¬XA − m(¬X)P (A) . z2 = p m(¬X)P (A)P (¬A) The z-score measures how many standard deviations the observed frequency deviates from its expectation. If m(X) and m(¬X) are sufficiently large, then both z1 and z2 follow the standard normal distribution N (0, 1). Therefore, their squares and also the sum of the squares χ2 = z12 + z22 follow the χ2 distribution. As a classical rule of thumb [28], the χ2 -measure can be used only, if all expected frequencies nP (X = x)P (A = a), x ∈ {0, 1}, a ∈ {0, 1}, are at least 5. However, the approximations can still be poor in some situations, when the underlying binomial distributions are skewed, e.g. if P (A) is near 0 or 1, or if m(X) and m(¬X) are far from each other [84, 4]. According to Carriere [20], this is quite typical for data in medical science.


31

One reason for the inaccuracy of the χ2 -measure is that the original binomial distributions are discrete while the χ2 -distribution is continuous. A common solution is to make a continuity correction and subtract 0.5 from the expected frequency nP (X)P (A). According to Yates [84], the resulting continuity corrected χ2 -measure can give a good approximation to Fisher’s pF , if the underlying hypergeometric distribution is not markedly skewed. However, according to Haber [33], the resulting χ2 -value can underestimate the significance, while the uncorrected χ2 -value overestimates it. For the χ2 -measure, we can give also alternative expressions, which are sometimes easier to analyze: χ2 =

nδ 2 P (X)P (¬X)P (A)P (¬A) = n(γ(X, A) − 1)(γ(¬X, ¬A) − 1).

(2.7) (2.8)

From these equations we see that the χ2 -measure is symmetric for positive and negative dependencies and it favours rules, where both γ(X, A) and γ(¬X, ¬A) are large or – in the case of negative dependence – small. Mutual information is another popular asymptotic measure, which has been used to test independence. It is defined as M I = m(XA) log(P (XA)) + m(X¬A) log(P (X¬A)) + m(¬XA) log(P (¬XA)) + m(¬X¬A) log(P (¬X¬A)) − m(X) log(P (X)) − m(¬X) log(P (¬X)) − m(A) log(P (A)) − m(¬A) log(P (¬A)). (2.9) Mutual information is actually an information theoretic measure, but in statistics, 2M I is known as log likelihood ratio. The likelihood ratio can be derived from the multinomial model as follows: The likelihood of the observed counts m(XA), · · · , m(¬X¬A) is defined by µ ¶ n L(p) = m(XA), m(X¬A), m(¬XA), m(¬X¬A) m(XA) m(X¬A) m(¬XA) m(¬X¬A) pX¬A p¬XA p¬X¬A ,

pXA

where p =< pXA , pX¬A , p¬XA , p¬X¬A > contains the unknown parameters. The maximum likelihood estimates for the parameters are p1 = < P (XA), P (X¬A), P (¬XA), P (¬X¬A) >. Under the independence assumption, the parameters are p =< pX pA , pX p¬A , p¬X pA , p¬X p¬A >, and the maximum likelihood estimate is p0 =< P (X)P (A), P (X)P (¬A), P (¬X)P (A), P (¬X)P (¬A) > .


32

Now the ratio of the likelihood of the data under the independence assumption and without it is L(p0 ) = λ= L(p1 )

µ

¶ ¶ µ P (X)P (A) m(XA) P (X)P (¬A) m(X¬A) P (XA) P (X¬A) ¶ µ ¶m(¬XA) µ P (¬X)P (A) P (¬X)P (¬A) m(¬X¬A) . P (¬XA) P (¬X¬A)

The log likelihood ratio is −2 log(λ) = 2M I. It follows asymptotically the χ2 -distribution [79], and often it gives similar results to the χ2 -measure [73]. However, sometimes the two tests can give totally different results [4].

2.3.2

Value-based measures

In the value-based semantics the idea is that we would like to find events XA = a, which express a strong dependency, even if the general dependency between variables X and A is relatively weak. In this case, the strength of the dependency is usually measured by the lift, because the leverage has the same absolute value for all events XA, X¬A, ¬XA, ¬X¬A. However, the lift alone is an unsuitable measure, because it gets its maximal value, when m(XA = a) = m(X) = m(A = a) = 1 – i.e. when the rule occurs on just one row [35]. Such a rule is quite likely due to chance and hardly interesting. Therefore, we should either test the null hypothesis that the lift is γ ≤ 1 (no positive dependence) [11], that the lift is less than some threshold γ0 > 1 [41], or simply calculate the probability of observing such a large lift value, if X and A were actually independent (significance testing). The p-value is defined like in the variable-based testing by Equation (2.1), but now the extremeness relation should not depend on the leverage but lift. A necessary condition for the extremeness of table Ti over Tj is that in Ti the lift is larger than in Tj . However, since the lift is largest, when NX and/or NA are smallest (and NXA = NX or NXA = NA ), it is sensible to require that also NXA is larger in Ti than in Tj . If both NX and NA are fixed, then the lift is larger than observed if and only if the leverage is larger than observed, and it is enough to consider tables, where NXA ≥ m(XA). However, if either NX , NA , or both are unfixed, then we should always check the lift value, which is random variable G=

nNXA , NX NA

and compare it to the observed lift γ.


33

Next, we will first survey how the value-based significance of dependence has been defined in the previous research. Then we will analyze the problem using the three classical models. Finally, we derive an alternative binomial test and the corresponding asymptotic measure for the value-based significance of dependence. Value-based significance in the previous research In the previous research on association rules, some authors [24, 41, 42, 19, 52] have speculated how to test the null hypothesis P (A|X) ≤ P (A), which is equivalent to testing γ ≤ 1. For some reason, it has been taken for granted that the exact probability is defined by a binomial test in the part of the data where X is true. This is equivalent to adopting the double binomial model, but inspecting just one of the urns – the red apples. So, one tries to decide whether there is a dependency between red colour and sweetness by taking a sample of m(X) red apples from the red urn. It is assumed that NXA ∼ Bin(m(X), pA ), and the unknown parameter pA is estimated from the data, as usual. For positive dependence, the p-value is defined as [24] m(X)

X

p=

µ

i=m(XA)

m(X) i

¶ P (A)i P (¬A)m(X)−i

(2.10)

and for negative dependence as [24] m(XA) µ

p=

X i=0

m(X) i

¶ P (A)i P (¬A)m(X)−i .

We see that NX = m(X) is the only variable, which has to be fixed – even N can be unfixed. In [42], it was explicitly noted that NA is unfixed, and therefore, in the positive case, i goes from m(XA) to m(X), not to NXA ≥ min{m(X), m(A)}. The idea is that when NXA ≥ m(XA), then m(X) P (A|X), and if we already had P (A|X) > P (A), then also G ≥ γ. Similarly, in the negative case, G ≤ γ. So, the test checks correctly all cases, where the lift is at least as large (as small) as observed. Since the binomial p is quite difficult to calculate, it is common to estimate it asymptotically by the z-score. In the case of positive dependence, the binomial variable p NXA has expected value µ = m(X)P (A) and standard deviation σ = m(X)P (A)P (¬A). The corresponding z-score is [42, 19]


34

m(XA) − m(X)P (A) m(XA) − µ = p σ m(X)P (A)P (¬A) p m(X)(P (A|X) − P (A)) p . = P (A)P (¬A)

z=

(2.11)

If m(X) is sufficiently large and P (A) is not too near 1 or 0, the z-score follows the standard normal distribution. However, when the expected frequency m(X)P (A) is low (as a rule of thumb < 5), the binomial distribution is positively skewed. This means that the z-score overestimates the significance. The problem of this measure (and the original binomial model) is that two rules with different antecedents X are not comparable. So, all rules (with different X) are thought to be from different populations and are tested in different parts of the data. We note that this idea of checking just one urn is also underlying in the J-measure [70]: J = m(XA) log

P (X¬A) P (XA) + m(X¬A) log , P (X)P (A) P (X)P (¬A)

(2.12)

which is often used to evaluate association rules. It is a reduced version of the mutual information, where all terms containing ¬X have been dropped. One solution to the comparison problem is to normalize the z-score, and divide it by the maximum value p m(X)P (¬A) . max{z(X → A)} = p P (A)P (¬A) The result is cfa =

P (A|X ) − P (A) , P (¬A)

(2.13)

which is the same as Shortliffe’s certainty factor [67] for positive dependence. For the negative dependence, the certainty factor is cfaneg =

P (A|X ) − P (A) . P (A)

According to [13], the certainty factor can produce quite accurate results when used for prediction purposes. However, it is very sensitive to redundant rules, because all rules with confidence 1.0 gain the maximum score, even if they occur on just one row. In addition, the certainty factor does not suit search purposes, because the upper bound is always the same.


35

Finally, we note that like the double binomial model, also this single binomial model is not exchangeable in the sense that generally p(X → A) 6= p(A → X). The same holds for the corresponding z-score, J-measure, and certainty factor. This can be counter-intuitive, when the task is to search for statistical dependencies, and in the search these measures should be used with care. In addition, with this binomial model the significance of the positive dependence between X and A is generally not the same as the significance of the negative dependence between X and ¬A. With the corresponding z-score, the significance values are related, and zpos (X → A) = −zneg (X → ¬A), where zpos denotes the z-score of positive dependence and zneg the z-score of negative dependence. With the certainty factor and J-measure, the significance of the positive dependence between X and A and the significance of the negative dependence between X and ¬A are equal. Value-based significance under the classical models Let us now analyze the value-based significance in the classical models. For simplicity, we consider only positive dependence. We assume that the extremeness relation is defined by comparing the lift variable G and the observed lift γ and frequency variable NXA and the observed frequency m(XA), i.e. Ti º T0 , if Gi ≥ γ and NXA,i ≥ m(XA), where Gi is the lift in table Ti and NXA,i is the frequency in Ti . In the multinomial model, only the data size N = n is fixed. Each contingency table, described by triplet < NX , NA , NXA >, has a probability P (NXA , NX − NXA , NA − NXA , n − NX − NA + NXA |n, pX , pA ), defined by Equation (2.2). The p-value is achieved, when we sum over all possible triplets, where G ≥ γ:

p=

n X

NX X

Q1 X

P (NXA , NX − NXA , NA − NXA ,

NX =0 NXA =m(XA) NA =NXA

n − NX − NA + NXA |n, pX , pA ), XA where Q1 = nN γNX . (We note that the terms are zero, if NX < NXA .) In the double binomial model, NX = m(X) is also fixed. Each contingency table can be described by pair < NA , NXA >, and it has probability P (NXA , NA − NXA |n, m(X), m(¬X), pA ) by Equation 2.3. Now we should


36

sum over all possible pairs, where G ≥ γ: p=

n X

Q2 X

P (NXA , NA − NXA |n, m(X), m(¬X), pA ),

NXA =m(XA) NA =NXA nNXA where Q2 = γm(X) . In the hypergeometric model, also NA = m(A) is fixed. As noted before, the extremeness relation is now the same as in the variable-based case, and the p-value is defined by Equation (2.4). This is an important observation, because it means that Fisher’s exact test tests the significance also in the value-based semantics. The same is not true for the first two models, where rule X → A can get a different p-value in variable-based and value-based semantics.

Alternative binomial model When NX and/or NA are unfixed, the p-values are quite heavy to compute. Therefore, we will now derive a simple binomial model, where it is enough to sum over just one variable. The binomial probability can be further estimated by an equivalent z-score or the z-score can be used as an asymptotic test measure as such. Unlike in the traditional binomial model, we will test the rule in the whole data set, which means that the significance values of different rules are comparable. Let us suppose that we have a large urn containing N apples. The probability of red apples is pX , of sweet apples pA , and sweet red apples pXA . According to the independence assumption pXA = pX pA . Therefore, the urn contains N pXA = N pX pA sweet red apples. A sample of n apples is taken randomly from the urn. The exact probability of obtaining NXA sweet red apples is defined by the hypergeometric distribution: ´ ´³ ³ P (NXA |N, n, pXA ) =

N pXA NXA

N −N pXA n−NXA

³ ´ N n

.

When N tends to infinity, the distribution approaches to binomial, and the probability of NXA given n and pXA becomes µ ¶ n n−NXA XA . Pb (NXA |n, pXA ) = pN XA (1 − pXA ) NXA When the unknown parameter pXA = pX pA is estimated from the data, the probability becomes µ ¶ n pb (NXA |n, P (X)P (A)) = (P (X)P (A))NXA (1−P (X)P (A))n−NXA . NXA


37

Since NXA is the only variable which occurs in the probability, the extremeness relation is defined simply by Ti º T0 ⇔ NXA ≥ m(XA). When the unknown parameter pX pA is estimated from the data, the pvalue of rule X → A is pbin =

³n ´

n X i=m(XA)

i

(P (X )P (A))i (1 − P (X )P (A))n−i .

(2.14)

This is in fact the classical binomial test in the whole data set, while the traditionally used binomial (Equation (2.10)) is a binomial test in the part of data, where X is true. Next we show that pbin gives also an upper bound for the p-value in the multinomial model. Lemma 2.5 ³n ´

n X

pmul (X → A|n, pX pA ) ≤

i=m(XA)

i

(pX pA )i (1 − pX pA )n−i .

Proof In the multinomial model, the exact p-value of rule X → A is n X

p= µ

Q1 X

NX X

NX =0 NXA =m(XA) NA =NXA

n NXA , NX − NXA , NA − NXA , n − NX − NA + NXA

¶ (pX pA )NXA

((1−pX )pA )NX −NXA ((1−pX )pA )NA −NXA ((1−pX )(1−pA ))n−NX −NA +NXA , where Q1 =

nNXA γNX .

µ

Each term can be expressed in form n NXA

¶ (pX pA )NXA B(NXA , NX , NA ),

where µ B(NXA , NX , NA ) =

n − NXA Nx − NXA , NA − NXA , n − NX − NA + NXA

¶

(pX (1 − pA ))NX −NXA ((1 − pX )pA )NA −NXA ((1 − pX )(1 − pA ))n−NX −NA +NXA . For the exact p-value, we should sum B(NXA , NX , NA ) over all values NX = NXA , . . . , n and NA = NXA , . . . , Q. We get an upper bound for the


38

p-value, if we instead sum also NA over the whole range (without caring about the G ≥ γ condition): NA = NXA , . . . , n. Let C(NX , NA |NXA ) be the resulting coefficient: C(NX , NA |NXA ) =

n X

n X

B(NXA , NX , NA ).

NX =NXA NA =NXA

By the multinomial theorem A.3, the coefficient is C(NX , NA |NXA ) = (1 − pX pA )n−NXA , which proves the upper bound. 2 Since NXA is a binomial variable with expected value µ = nP (X)P (A) p nP (X)P (A)(1 − P (X)P (A)), the correand standard deviation σ = sponding z-score is z(X → A) =

m(X, A) − nP (X)P (A) m(X, A) − µ =p σ nP (X)P (A)(1 − P (X)P (A)) p nP (XA)(γ(X, A) − 1) = p . γ(X, A) − P (X, A)

(2.15)

The same z-score was given also by Lallich [42] for the case, where NX and NA are unfixed. Because the discrete binomial distribution is approximated by the continuous normal distribution, the continuity correction can be useful, like with the χ2 -measure. We note that this binomial model and the corresponding z-score are exchangeable, which is intuitively a desired property. However, the statistical significance of the positive dependence between X and A is generally not the same as the significance of the negative dependence between X and ¬A. For example, the z-score for the negative (or positive) dependence between X and ¬A is m(X¬A) − nP (X)P (¬A) . z(X → ¬A) = p nP (X)P (¬A)(1 − P (X)P (¬A)) Figure 2.2 shows an example of different p-values as functions m(XA) (and thus lift), when m(X) = 30, m(A) = 50, and n = 100. In the beginning of the line, the lift is γ = 1, and in the end, it is maximal γ = 2.0. Since the double binomial and the traditional binomial (bin2, Equation (2.10)) models are asymmetric with respect to X and A, the values are slightly different, when m(X) and m(A) are reversed. Generally, the new binomial model, bin1, gives the most cautious p-values and the multinomial and double binomial models give the smallest p-values. However, the values approach each other, when the data size is increased.

2.4 Redundancy and improvement

39

m(X)=30, m(A)=50, n=100 0.6 bin1 bin2 Fisher double bin multinom

0.5

p

0.4

0.3

0.2

0.1

0 17

19

21

23 m(XA)

25

27

29

Figure 2.2: An example of p-values in different value-based models, when m(X) and m(A) are fixed.

In this Figure, the single binomial model (bin2) looks fine. However, there are pathological cases when it and the related z-score and the Jmeasure give opposite rankings than the other significance measures. Example 2.6 Let us compare two rules, X → A and Y → A, in the value-based semantics. The frequencies are n = 100, m(A) = 50, m(X) = m(XA) = 30, m(Y ) = 60, and m(Y A) = 50. I.e. P (A|X) = 1 and P (Y |A) = 1. The results are given in Table 2.2. All of the traditional association rule measures (pbin2 , its z-score z2 , and J-measure) favour rule X → A, while all the other measures (new binomial pbin1 and its z-score z1 , multinomial pmul , double binomial pdouble , and Fisher’s pF ) rank rule Y → A better. In the classical models, the difference between the rules is quite remarkable.

2.4

Redundancy and improvement

In Section 1.2 we already discussed the problem of redundancy. Given the p-values of rules, we define the redundancy as follows. (The notion is generalized later in Definition 3.2.)

40


Table 2.2: Comparison of p-values and asymptotic measures for example rules X → A and Y → A. X→A Y →A pbin1 1.06e-4 2.21e-5 pmul 8.86e-13 1.01e-19 pF 1.60e-12 7.47e-19 pdouble 2.05e-13 7.35e-20 z1 4.20 4.36 pbin2 9.31e-10 8.08e-8 z2 5.48 5.16 J 36.12 14.56

Definition 2.7 (Redundancy (with respect to p)) Let p(X → A = a) be the p-value of rule X → A = a. Rule X → A = a is redundant, if there exists rule Y → A = a such that Y ( X and p(Y → A = a) ≤ p(X → A = a). Otherwise, rule X → A = a is non-redundant. Redundancy can also be evaluated with the asymptotic measures, but it requires more care, because the asymptotic measures like the χ2 -measure and z-score tend to overestimate the significance, when the frequencies are small. Therefore, it is possible that a rule appears as non-redundant, when evaluated with an asymptotic measure, but redundant, when the exact p-values are calculated. As a solution, we suggest to use the asymptotic measures during the search to filter out rules which are definitely redundant, and check the redundancy of all remaining infrequent rules using the exact p-values. If all expected counts (m(X)P (A), m(X)P (¬A), m(¬X)P (A), and m(¬X)P (¬A) for rule X → A = a) are sufficiently large, then the check for the exact p-values can be omitted. The above definition of redundancy concerns only the observed redundancy in the given data set. It is based on the assumption that the observed probabilities P would be accurate estimates for the real, but unknown probabilities Pr . However, it is possible that a rule appears in the data as nonredundant, even if it is actually redundant (see Figure 1.1). We call such rules as spuriously non-redundant. To eliminate spuriously non-redundant rules, we should estimate the significance that rule X → A = a really improves a more general rule Y → A = a, Y ( X. Let us now denote the improvement of rule X → A = a over rule Y → A = a by Imp(X → A = a|Y → A = a). In practice, the improvement can be defined by any goodness measure M . If large values of M indicate


41

a good rule, then Imp(X → A = a|Y → A = a) = M (X → A = a) − M (Y → A = a), and Imp > 0, if X → A = a improves Y → A = a. The task is to estimate the probability of p = P (Imp(X → A = a|Y → A = a) > 0), assuming the null hypothesis that attributes X \ Y do not improve rule Y → A = a. If the resulting p-value is sufficiently small, we can assume that the observed improvement is not due to chance. Unfortunately, there is no generally applicable theory for estimating the improvement of dependency rules in either variable- or value-based semantics. In the association rule research, a related problem concerning the significance of productivity has been addressed. We will now briefly review solutions to this problem and analyze why they do not suit for the general improvement testing. Traditionally, the improvement has been defined by productivity (e.g. [38, 75]): Definition 2.8 (Productivity) Rule X → A = a is productive, if P (A = a|X) > P (A = a|Y ) for all Y ( X. So, here the goodness measure M is confidence, and it is required that a productive rule has a larger confidence than any of its generalizations. The significance of productivity is tested separately for all Y → A = a, Y ( X, and all p-values should be below some fixed threshold maxp . In each test, the null hypothesis is that P (A = a|X) = P (A = a|Y ). If we denote the extra attributes in rule X → A = a by Q = X \ Y , the condition means that Q and A are independent in the set, where Y is true. The significance is estimated by calculating p(Q → A = a|Y ), i.e. the pvalue of rule Q → A = a in the set where Y holds. If the significance measure is Fisher’s pF , the significance of productivity of Y Q → A = a over Y → A = a is [75] ³ ´³ ´ m(Y Q) m(Y ¬Q) J X m(Y QA=a)+i m(Y ¬QA=a)−i ³ ´ . (2.16) p(Q → A = a|Y ) = i=0

m(Y ) m(Y A=a)

When the χ2 -measure is used to estimate the significance of productivity, the equation is [47] χ2 (Q → A|Y ) =

m(Y )(P (Y )P (Y QA) − P (Y Q)P (Y A))2 . P (Y Q)P (Y ¬Q)P (Y A)P (Y ¬A)

(2.17)

42


Let us now analyze, why this approach is unsuitable for testing the improvement of dependency rules. First of all, the improvement of statistical dependency cannot be measured by the confidence or lift alone. It is true that the productivity is a necessary condition for the non-redundancy with all commonly used goodness measures (as we will see in the next chapter), but it is not a sufficient condition. A rule may have a maximal possible confidence, cf = 1 , and still be just due to chance. In the variablebased semantics, we should also take into account the complementary rules ¬X → A 6= a and ¬(XQ) → A 6= a. Typically, the confidence of the complementary rule decreases, when the confidence of rule X → A = a increases in the specializations. The following example demonstrates that a productive rule can still be redundant, which means that the significance of the productivity does not guarantee any improvement. We calculate the significance with both Fisher’s pF and the χ2 -measure. Since pF can be used in both variable- and value-based semantics, the example shows that the traditional productivity testing cannot identify spuriously non-redundant rules even in the valuebased semantics. Example 2.9 Let us reconsider the rules X → A (=Y Q → A) and Y → A in Example 2.6. Rule X → A is clearly productive with respect to Y → A. Let us now calculate the significance of productivity using Fisher’s pF . Since m(Y Q) = m(Y QA), there is just one term: ³ ´ ¡ 30 ¢ m(Y ¬Q) m(Y ¬Q¬A) 10 ´ = ¡ 60 ¢ = 0.00040. p(Q → A|Y ) = ³ m(Y ) m(Y A)

50

The value is so small that we can assume that the productivity is significant. However, rule X → A is redundant (pF = 1.60 · 10−12 ) with respect to Y → A (pF = 7.47 · 10−19 ). The same happens with the χ2 -measure. Now the significance of the productivity has χ2 (Q → A|Y ) = 12, which is fairly good (about p = 0.0005). Still rule Y → A is more significant (with χ2 = 66.7) than Y Q → A (with χ2 = 42.9). A natural question arises, whether we could just replace the confidence with another goodness measure and use the same scheme for significance testing. However, it is not possible either, if we do want to keep the classical models of significance testing. The problem is that the above tests for the significance of productivity use the model, where just one urn from two are checked. Namely, the significance of Q → A = a is tested only in the Y


43

urn, and the ¬Y urn is totally ignored. For example, let us suspect that there is a stronger dependency between big red apples and sweetness than between the red apples and sweetness. Now the traditional association rule researchers would test the dependency between the size and taste of apples in the red urn alone, and then decide, if rule big red → sweet is significantly better than rule red → sweet. In effect, this is equivalent to removing all small red apples to the green urn and testing, how much the sweetness in the red urn was improved. What one ignored is how much the bitterness in the green urn was degraded. It is clear that this model does not suit the variable-based semantics. However, it is unclear, whether it can be used in the value-based semantics. Since we do not have any solution to this problem, we will concentrate on searching the most significant non-redundant rules. However, our intuition is that the condition for the significance of the improvement can be expressed by the redundancy coefficient: ρ=

p(XQ → A = a) , p(X → A = a)

where ρ > 1 indicates a non-redundant rule, and larger ρ values could be used to prune out spuriously non-redundant rules. If this is the fact, then the forthcoming search algorithms suit for searching genuinely nonredundant rules, as well.

44


Chapter 3 Pruning insignificant and redundant rules We are to admit no more causes of natural things than such as are both true and sufficient to explain their appearances. Therefore, to the same natural effects we must, so far as possible, assign the same causes. I. Newton In our problem, the search space for dependency rules consists of all possible rules of form X → Ai or X → ¬Ai , where X ( R can be any attribute set and Ai ∈ R\X can be any single attribute. In the enumeration problem, the task is to find all sufficiently good, non-redundant rules, while in the optimization problem, the task is to find the K best, non-redundant rules. Both the goodness and redundancy of rules is defined with respect to some general goodness measure M . The main problem is how to prune the exponential search space effectively without losing any significant rules. In this chapter, we introduce the theoretical basis for the search, without going to algorithmic details, yet. First, we define the general goodness measure and the related concept of redundancy and formalize the search problem. Then we introduce a class of well-behaving goodness measures, which covers all commonly used measures for statistical dependencies. We prove upper (or lower) bounds, which can be used to estimate the best possible M -value for an arbitrary rule X → A = a, when only some information on X and A is known. We show that the upper (or lower) bounds for all well-behaving measures as well as the lower bounds for Fisher’s pF are met in the same points of the search space. This important discovery means that the same search strategy can be applied to all these measures 45

46

3 Pruning insignificant and redundant rules

to prune out useless areas of the search space. Finally, we show how the concept of redundancy can be utilized in extra pruning.

3.1

Basic concepts

Let us first define a general goodness measure for dependency rules. For simplicity, we consider rules of form X → A = a, where a ∈ {0, 1}, i.e. rules X → A and X → ¬A. Definition 3.1 (Goodness measure) Let R be a set of binary attributes and U = {X → A = a | X ( R, A ∈ R \X, a ∈ {0, 1}} the set of all possible rules, which can be constructed from attributes R. Let f (NX , NXA , NA , N ) : N4 → R be some statistical measure function, which measures the significance of positive dependency between X and A = a, given absolute frequencies NX = m(X), NXA = m(XA = a), NA = m(A = a), and data size N = n. Function M : U → R is a goodness measure for dependency rules, if M (X → A = a) = f (m(X), m(XA = a), m(A = a), n). Measure M is called increasing (by goodness), if large values of M (X → A = a) indicate that X → A = a is a good rule, and, respectively, decreasing, if low values indicate a good rule. In the above definition, we have defined the statistical measure function on parameters NX , NXA , NA , and N . First, we note that in practice, some of these parameters can be considered constant. For example, if the data size N = n is given, it can be omitted. If the consequence A = a is also fixed, we can use a simpler function fA=a (NX , NXA ). Second, we note that even if we have defined the function f in whole N4 , only some parameter value combinations can occur in any real data set. For any real frequencies hold 0 ≤ n, 0 ≤ m(X) ≤ n, 0 ≤ m(A = a) ≤ n, and 0 ≤ m(XA = a) ≤ min{m(X), m(A = a)}. In addition, for any non-trivial rule must hold 0 < n, 0 < m(X) < n, and 0 < m(A = a) < n. If n = 0, the data set would not exist. If m(A = a) or m(X) were 0, the corresponding rule X → A = a would not occur in the data at all. On the other hand, if m(X) = n or m(A = a) = n, then either X or A = a would occur on all rows of data, and the rule could express only independence. Therefore, it suffices that the function is defined in the set of all legal parameter values. Third, we note that the actual function can be defined on other parameters, if they can be derived from NX , NXA , NA , and N . Examples of commonly occurring derived parameters are leverage δ and confidence cf .

3.1 Basic concepts

47

For example, when the data size n and the consequence A are fixed, the χ2 -measure can be defined e.g. by the following two functions: f1 (NX , NXA ) = f2 (NX , δ) =

n(NXA − NX P (A))2 NX (n − NX )P (A)(1 − P (A))

n3 δ 2 . NX (n − NX )P (A)(1 − P (A))

Functions f1 and f2 can be transformed to each other by equalities ¶ µ NXA − NX P (A) f1 (NX , NXA ) = f2 NX , n f2 (NX , δ) = f1 (NX , NX P (A) + nδ). These transformations are often useful, when the behaviour of the function is analyzed. The redundancy with respect to an arbitrary goodness measure M is defined as follows: Definition 3.2 (Redundancy (with respect to M )) Let M be an increasing (decreasing) goodness measure. Rule X → A = a is redundant, if there exists rule Y → A = a such that Y ( X and M (Y → A = a) ≥ M (X → A = a) (M (Y → A = a) ≤ M (X → A = a)). Otherwise, rule X → A = a is non-redundant. Now we can formalize the search problem, where either all sufficiently good (enumeration problem) or only the K best (optimization problem) non-redundant dependency rules are searched for. Definition 3.3 (Search problem) Let M be an increasing (a decreasing) goodness measure and minM (maxM ) a user-defined threshold. In the optimization problem, minM = min{M (·)} (maxM = max{M (·)}), unless another initial value is given. In addition, the maximal number of the best rules to be searched, K > 0, is given in the optimization problem. Let U = {X → A = a | X ( R, A ∈ R \ X, a ∈ {0, 1}} be the set of all possible rules. The problem is to find set S ⊆ U such that for all rules X → A = a ∈ S holds 1. M (X → A = a) ≥ minM (or ≤ maxM for a decreasing M ); 2. if we have an optimization problem, M (X → A = a) ≥ MK (or ≤ MK ), where MK is the M -value of the Kth best rule in S; and

48

3 Pruning insignificant and redundant rules 3. there does not exist any rule Y → A = a ∈ U such that Y ( X and M (Y → A = a) ≥ M (X → A = a) (or M (Y → A = a) ≤ M (X → A = a)). In addition, it is required in the optimization problem that |S| ≤ K.

We note that in the optimization problem, the discovered set of rules can contain less than K rules, if the data set did not contain enough nonredundant positive dependencies or the dependencies were not sufficiently good. In practice, the searched set S covers only a small fraction of the whole search space U. The problem is to find S from U such that as large area of U \ S as possible can be left unchecked. For this purpose, we need effective pruning strategies. The following pruning strategies are based on the assumption that the search proceeds from more general rules to more specific rules. This is a reasonable assumption, because then the redundancy or non-redundancy of discovered rules is always known. The basic search strategy is the following: The search space is traversed by generating sets of attributes Z ⊆ R in an cumulative order, i.e. all subsets Y ( Z are checked before Z. For each set Z, all possible consequences A = a, A ∈ R, a ∈ {0, 1}, are checked. If neither rule Z \ {A} → A = a nor any of its specializations (ZQ) \ {A} → A = a, Q ⊆ R \ Z, belongs to the set of non-redundant, significant rules S, then consequence A = a can be pruned as impossible in the subspace defined by sets ZQ. If no consequence A = a, A ∈ R, a ∈ {0, 1} can produce a non-redundant, significant rule with an antecedent (ZQ) \ {A} (i.e. all consequences are impossible), then all sets ZQ can be pruned without any further checking. So, formally, the problem is to determine, when the value of relation ½ 0 if (ZQ) \ {A} → A = a ∈ / S, Possible(Z , A, a) = 1 otherwise, is definitely zero. For this purpose, we have to estimate an upper bound (or a lower bound, if M is decreasing) for the M -value of the best possible rule (ZQ) \ {A} → A = a, Q ⊆ R \ Z. If the upper bound is too low compared to minM , MK , or M (Y → A = a), Y ( Z \ {A}, then we know that Possible(Z , A, a) = 0 . The available information for determining the upper bound UB(M ((ZQ) \{A} → A = a)) depends on the set Z and whether A ∈ Z. In practice, we need upper/lower bounds in three different cases: 1) when Z = {A}, 2) when A ∈ / Z, and 3) when Z = X ∪ {A} for some X ⊆ R \ {A}. The resulting upper bounds are denoted by ub1 , ub2 , and ub3 and the lower bounds by lb1 , lb2 , lb3 .

3.2 Upper and lower bounds for well-behaving measures

49

1. In the beginning of the search, only the absolute frequencies m(A) and m(¬A) are known for all attributes A ∈ R. Therefore, we should estimate ub1 = UB (M (X → A = a)) (or the corresponding lower bound lb1 ) for any possible X ⊆ R \ {A} and a ∈ {0, 1}. 2. When m(X), A ∈ / X, and m(A = a) are known, we should estimate ub2 = UB (M (XQ → A = a)) (or lb2 ) for any possible Q ⊆ R \ (X ∪ {A}). As a special case, this gives also an upper/lower bound for M (X → A = a), because Q can be an empty set. 3. When m(X), m(XA = a), and m(A = a) are known, we should estimate ub3 = UB (M (XQ → A = a)) (or lb3 ) for any possible Q ⊆ R \ (X ∪ {A}). These upper/lower bounds can already be used for pruning out some of the forthcoming redundant rules. However, it is still possible that some forthcoming rule X → A = a is not redundant with respect to any known rules Y → A = a, Y ( X, but it will be redundant with respect to another forthcoming rule Z → A = a, Z ( X. Even in this case, it sometimes possible to infer the redundancy, before neither Z nor X is checked. In the following, we will first derive upper and lower bounds for all three cases, which occur in the search. Then we will explain extra principles for pruning out redundant rules.

3.2

Upper and lower bounds for well-behaving measures

In this section, we prove general upper/lower bounds, which hold for any well-behaving goodness measure M . The notion of well-behaving measures is defined by the classical axioms for any “proper” goodness measures [63, 49]. In practice, the axioms hold for a large class of popular goodness measures used for evaluating the goodness of statistical dependencies, classification rules, or association rules.

3.2.1

Well-behaving measures

We recall that for any rule X → A, the measure value M (X → A) can be determined as a function of four variables: absolute frequencies NX = m(X), NXA = m(XA), NA = m(A), and the data size N = n. Let us now assume that the measure M is increasing by goodness, meaning that high values of M (X → A) indicate that X → A is a good rule. According

50


to the classical axioms by Piatetsy-Shapiro [63] the following properties should hold for any proper measure M , measuring the goodness of a positive dependency between X and A: Axiom (i) M is minimal, when N NXA = NX NA , Axiom (ii) M is monotonically increasing with NXA , when NX , NA , and N remain unchanged, and Axiom (iii) M is monotonically decreasing with NX (or NA ), when NXA , NA (or NX ), and N remain unchanged. The first axiom simply states that M (X → A) gets its minimum value, when X and A are independent. In addition, it is (implicitly) assumed that M gets the minimum value also for negative dependencies, i.e. when N NXA < NX NA . The second axiom states that M increases, when the dependency becomes stronger (leverage δ increases) and the rule becomes more frequent. The third axiom states that M decreases, when the dependency becomes weaker (δ decreases). Piatetsky-Shapiro [63] noticed that under these conditions M gets its maximal value for any fixed m(X), when m(XA) = m(X). In addition, he assumed that M would get its global maximum, when m(XA) = m(X) = m(A). However, the latter does not necessarily hold, because the axioms do not tell how to compare rules X → A and XQ → A, where P (A|X) = P (A|XQ) = 1. In this case, the more general rule, X → A, has at least as large NXA and NX as XQ → A has. Major and Mangano [49] suggested a fourth axiom, which solves this problem: Axiom (iv) M is monotonically increasing with NX , when cf = fixed, NA and N are fixed, and cf > NNA .

NXA NX

is

According to this axiom, a more general rule is better, when two rules have the same (or equally frequent) consequence and the confidence is the same. In addition, it was required that the dependency should be positive. However, based on our derivations, we assume that the same property holds also for negative dependencies, and M is non-increasing only when there is independence. In Appendix B.3, we show this for the χ2 -measure and mutual information. In the following, we extend the axioms for negative dependencies and prove general upper bounds which hold for any measure M following these axioms. The upper bounds are the same as derived by Morishita and Sese


51

[57] in the case of the χ2 -measure. In their work, the upper bounds were derived by showing that the χ2 -measure is a convex function of NX and NXA for a fixed consequence A. Similar results could be achieved for other convex measures, but checking the axioms is simpler than the convexity proofs. In addition, there are non-convex goodness measures, which still follow the axioms (an example of a non-convex and non-concave well-behaving measure is the z-score, Equation (2.15), as shown in Appendix B.4). For simplicity, assume that M is increasing by goodness; for a decreasing measure, the properties are reversed (minimum vs. maximum, increasing vs. decreasing). Definition 3.4 (Well-behaving goodness measure) Let M be an increasing goodness measure defined by function f (NX , NXA , NA , N ). Let f2 (NX , δ, NA , N ) be another function which defines the same measure: f2 (NX , δ, NA , N ) = f (NX , NXNNA + δN, NA , N ) and f (NX , NXA , NA , N ) = X NA , NA , N ). f2 (NX , N NXAN−N 2 4 Let S ⊆ N be a set of all legal parameter values (NX , NXA , NA , N ) for an arbitrary data set. Measure M is called well-behaving, if it has the following properties in set S: (i) f2 gets its minimum value, when δ = 0. (ii) If NX , NA , and N are fixed, then (a) f2 is a monotonically increasing function of δ, when δ > 0 (positive dependence), and (b) f2 is a monotonically decreasing function of δ, when δ < 0 (negative dependence). (iii) If NXA = m(XA = a), NA = m(A = a), and N = n are fixed, then (a) f is a monotonically decreasing function of NX , when NX < n·m(XA=a) (positive dependence), and m(A=a) (b) f is a monotonically increasing function of NX , when NX > n·m(XA=a) (negative dependence). m(A=a) (iv) If NA = m(A = a) and N = n are fixed, then for all cf1 , cf2 ∈ [0 , 1 ] (a) f (NX , cf1 NX , m(A = a), n) is monotonically increasing with (positive dependence), and NX , when cf1 > m(A=a) n

52

3 Pruning insignificant and redundant rules (b) f (NX , m(A = a) − cf2 (n − NX ), m(A = a), n) is monotonically (negative dependence). decreasing with NX , when cf2 > m(A=a) n

The first two conditions are obviously equivalent to the classical axioms (i) and (ii). The only difference is that the behaviour is expressed in terms of leverage δ. This enables that the measure can be defined for negative dependencies, as well. In addition, it is often easier to check that the conditions hold for a desired function, when it is expressed as a function of δ. In practice, this can be done by differentiating f2 (NX , δ, NA , N ) with respect to δ, where NX , NA , and N are considered constants. If M is an increasing function, then the derivative f20 = f20 (NX , δ, NA , N ) should be f20 = 0, when δ = 0, f20 > 0, when δ > 0, and f20 < 0, when δ < 0. For decreasing M , the signs of the derivative are reversed. We note that it is enough that the function is defined and differentiable in the set of legal values (as noted in Section 3.1). The third condition is also equivalent to the classical axiom (iii), when extended to both positive and negative dependencies. Before point NX = n·m(XA=a) m(A=a) , M measures positive dependence, and after it, negative dependence. The third condition can be checked by differentiating f (NX , m(XA = a), m(A = a), n) (or – to be exact – the corresponding continuous and differentiable extension of f ) with respect to NX , where m(XA = a), m(A = a), and n are constants. For an increasing M , the derivative f 0 = f 0 (NX , m(XA = a), m(A = a), n) should be f 0 = 0, when MX = n·m(XA=a) m(A=a) (independence), f 0 < 0, when NX < n·m(XA=a) m(A=a)

n·m(XA=a) m(A=a)

(positive dependence), and

(negative dependence). For decreasing M , f 0 > 0, when NX > the signs are reversed. Similarly, the fourth condition is equivalent to the classical axiom (iv), when extended to both positive and negative dependencies. In the case of positive dependence, cf1 corresponds to confidence P (A = a|X), which is kept fixed. In the case of negative dependence, cf2 corresponds to confidence P (A = a|¬X), which is kept fixed. (Now the equation is NXA = m(A = a) − cf2 (n − NX ) ⇔ m(A = a) − NXA = cf2 (n − NX ) ⇔ N¬XA = cf2 N¬X .) We require that condition (a) holds for positive dependence (P (A = a|X) > P (A)) and (b) for negative dependence (P (A = a|¬X) > P (A = a) ⇔ P (A = a|X) < P (A = a)). However, according to our analysis of the χ2 -measure and mutual information (Appendix B.3), both (a) and (b) hold everywhere, where P (A = a|X) 6= P (A = a) and P (A = a|¬X) 6= P (A = a). It is still unproved, whether this holds for any well-behaving measure, but for the upper bound proofs the above conditions are sufficient.


53

In practice, the fourth condition can be checked by differentiating functions f (NX , cf1 NX , m(A = a), n) and f (NX , m(A = a)−cf2 (n−NX ), m(A = a), n) (the corresponding continuous and differentiable extensions) with respect to NX . For increasing M , the derivative f 0 (NX , cf1 NX , m(A = a), n) , and the derivative f 0 (n−NX , cf2 (n − should be f 0 > 0, when cf1 > m(A=a) n . For decreasing NX ), m(A = a), n) should be f 0 < 0, when cf2 > m(A=a) n M , the signs are reversed. As an example, we show that the χ2 -measure (Equation 2.6), mutual information (Equation (2.9)), two versions of the z-score (Equation (2.15) and Equation 2.11), and the J-measure (Equation (2.12)) are well-behaving. The last three measures are defined only for positive dependencies. Theorem 3.5 Let S ⊆ N4 be defined by constraints 0 < N , 0 < NX < N , 0 < NA < N , and 0 ≤ NXA ≤ min{NX , NA }. Measure M is well-behaving, if it is defined by function (a) χ2 (NX , NXA , NA , N ) =

N (N NXA −NX NA )2 NX (N −NX )NA (N −NA ) ;

(b) MI (NX , NXA , NA , N ) = NXA log NNX·NNXA A

·(NA −NXA ) X −NXA ) + (NA − NXA ) log N(N +(NX − NXA ) log NN·(N −NX )NA X (N −NA )

−NX −NA +NXA ) ; +(N − NX − NA + NXA ) log N ·(N (N −NX )(N −NA ) √

(c) z1 (NX , NXA , NA , N ) = √ N (N NXA2−NX NA ) , when N NXA > NX NA , and 0, otherwise;

NX NA (N −NX NA )

(d) z2 (NX , NXA , NA , N ) = √N NXA −NX NA , when N NXA > NX NA , and 0, otherwise; and

NX NA (N −NA )

X −NXA ) + (NX − NXA ) log( NN (e) J(NX , NXA , NA , N ) = NXA log( NNXA −NA ) A NX −NX log( N ), when N NXA > NX NA , and 0, otherwise.

Proof Appendix B.3.

2

An example of a non-behaving measure is Shortliffe’s certainty factor (Equation (2.13)), which does not satisfy the fourth condition.

3.2.2

Possible frequency values

Before we go to the theoretical results, we introduce a graphical representation, which simplifies the proofs.


54

Let us now consider the set of all legal frequency values, when the data size n and consequence A = a (corresponding frequency m(A = a)) are fixed. All legal values of NX and NXA can be represented in a twodimensional space spanned by variables NX and NXA . Figure 3.1 shows a graphical representation of the space. NXA n

positive dependence negative dependence independence line NXA = P (A = a)NX

m(A = a)

m(A 6= a)

n

NX

Figure 3.1: Two-dimensional space of absolute frequencies NX and NXA , when A = a is fixed. In a given data set of size n, all points (m(X), m(XA = a)) lie in shaded areas. In any data set of size N = n, all possible frequency combinations (m(X), m(XA = a)), where m(XA = a) ≤ m(X), must lie in the triangle {(0, 0), (n, n), (n, 0)}. If also the consequence A = a is fixed, with absolute frequency m(A = a), the area of possible combinations is restricted to the shaded areas in Figure 3.1. Boundary line [(0, m(A = a)), (n, m(A = a))] follows from the fact that m(XA = a) ≤ m(A = a) and line [(m(A 6= a), 0), (n, m(A))] from the fact that m(A 6= a) ≥ m(XA 6= a) ⇔ m(XA = a) ≥ m(X) − m(A 6= a). Line [(0, 0), (n, m(A = a))] is called the independence line, because on that line NXA = P (A = a)NX , i.e. m(XA = a) = P (A = a)m(X), and the corresponding X and A = a are statistically independent. If the point lies above the independence line, the dependency is positive (m(XA = a) > P (A = a)m(X)), and below the line, it is negative (m(XA = a) < P (A = a)m(X)). Figure 3.2 shows a point (m(X), m(XA = a)) corresponding to rule X → A = a. In this case, the dependency is positive, because (m(X), m(XA


55

NXA n

point (m(X), m(XA = a))

m(A)

NX

A

=

A P(

=

a|X

)N

X

nP (A = a|X)

∆

m(A 6= a)

n

NX

Figure 3.2: Point (m(X), m(XA = a)) corresponding to rule X → A = a. The vertical difference ∆ from the independence line measures the absolute leverage.

= a)) lies above the independence line. The slope of line [(0, 0), (n, nP (A = a|X))] is the rule confidence P (A = a|X) = m(XA=a) m(X) . The vertical difference between point (m(X), m(XA = a)) and the independence line, marked by ∆, defines the absolute leverage ∆ = nδ. If the dependency is negative and the point lies below the independence line, the leverage is negative. Figure 3.3 shows how the knowledge on a rule X → A = a can be utilized to determine the possible frequency values of more specific positive dependency rules XQ → A = a. Since m(XQA = a) ≤ m(XA = a), all points (m(XQ), m(XQA = a)) must lie under the line [(0, m(XA = a)), (n, m(XA = a))]. Because the dependencies are positive, they also have to lie above the independence line. In the next section, we will show that for a well-behaving goodness measure M , point (m(XA = a), m(XA = a)) defines an upper bound (or lower bound) for the M -value of any positive dependency rule XQ → A = a. Figure 3.4 shows the area, where all possible points for negative dependency rules XQ → A = a must lie. Once again, m(XQA = a) ≤ m(XA = a), and because the dependence is negative, the points must lie

X


56

X

N

point

A

=

N

NXA

(m(XA = a), m(XA = a))


m(A = a) NXA

m(XA = a)

=

= P (A

a)N X

NX m(XA = a)

m(X)

n

points (NX , NXA ) for positive dependencies XQ → A = a

Figure 3.3: When a point (m(X), m(XA = a)), corresponding to rule X → A = a, is given, the points associated to more specific positive dependency rules XQ → A = a lie in the shaded area.

under the independence line. In addition, the points are restricted by line [(m(X), m(XA = a)), (m(XA 6= a), 0)]. The reason is that m(XQA = a) = m(XA = a) − m(X¬QA = a), where m(X¬QA = a) ∈ [0, m(XA = a)]. On the other hand, m(XQ) = m(X) − m(X¬QA = a) − m(X¬QA 6= a) ≤ m(X) − m(X¬QA = a). So, the line contains the maximal possible values NX = m(XQ) and the corresponding NXA = m(XQA = a) for any m(X¬QA = a) ∈ [0, m(XA = a)]. In point (m(X), m(XA = a)) holds m(X¬QA = a) = 0, and in point (m(XA 6= a), 0) holds m(X¬QA = a) = m(XA = a). In the following, we will show that point (m(XA 6= a), 0) defines an upper bound for the M -value of any negative dependency rule XQ → A = a. Figures 3.5 and 3.6 show the four axioms graphically, when M is increasing. According to axioms (ii) and (iii), function f increases, when it departs from the independence line either horizontally or vertically (Figure 3.5). According to axiom (iv), f increases on lines NXA = cf1 NX , when cf1 > P (A = a), and decreases on lines m(A = a) − NXA = cf2 (n − NX ), when cf2 > P (A = a) (Figure 3.6).


N

X

A

=

N

X

NXA

57


m(A = a) NXA

m(XA = a)

=

= P (A

a)N X

point pint point (m(XA 6= a), 0) NX m(XA 6= a)

m(X)

n

points (NX , NXA ) for negative dependencies XQ → A = a

Figure 3.4: When a point (m(X), m(XA = a)), corresponding to rule X → A = a, is given, the points associated to more specific negative dependency rules XQ → A = a lie in the shaded area.

3.2.3

Bounds for well-behaving measures

Next we prove some useful upper bounds for well-behaving measures. For simplicity, we consider only increasing measures, but for decreasing measures, the lower bounds are met in the same points. First we will note a couple of trivial properties, which follow from the definition of well-behaving measures. Theorem 3.6 Let M be a well-behaving, increasing measure. Let S be the set of legal values, as before. When N = n and NA = m(A = a) are fixed, M is defined by function f (NX , NXA , m(A = a), n). For positive dependencies hold (i) f attains its maximal values in set S on the border defined by points (0, 0), (m(A = a), m(A = a)), and (n, m(A = a)). (ii) f attains its (global) maximum in set S in point (m(A = a), m(A = a)). For negative dependencies hold


58 NXA

m(A = a)

NX A

=

= P (A

a)N X

NX

Figure 3.5: Arrows show directions where f increases (Axioms (ii) and (iii)). NXA

f 1N

NX

m(A = a)

A

=c

X

NX A

f 2(

N n−

)

A m(

=a

X −N

A

=c

X

(A =P

)N X

=a

) NX

Figure 3.6: Arrows show directions where f increases (Axiom (iv)).


59

(i) f attains its maximal values in set S on the border defined by points (0, 0), (m(A 6= a), 0), and (n, m(A = a)). (ii) f attains its maximum in set S in point (m(A 6= a), 0). Proof Let us first consider positive dependencies. (i) Since M is well-behaving, f is an increasing function of δ, when δ ≥ 0, in any point NX . Therefore, it gets its maximal value on the mentioned border. (ii) When NXA = m(XA = a) is fixed, M being well-behaving, f is a decreasing function of NX , when NX ≤ n·m(XA=a) m(A=a) . Therefore, f is decreasing on line [(m(A = a), m(A = a)), (n, m(A = a))]. On the other hand, we know that for well-behaving M , f (NX , cf NX , m(A = a), n) is increasing . When cf = 1 , f (NX , cf NX , m(A = a), n) cowith NX , when cf > m(A=a) n incides line [(0, 0), (m(A = a), m(A = a))]. Therefore, f gets its maximum value, when NX = NXA = m(A = a). For negative dependencies, the proof is similar. The only notable exception is that now f is increasing on line [(0, 0), (m(A 6= a), 0)] and decreasing on line [(m(A 6= a), 0), (n, m(A = a))]. 2 This result can already be used for pruning in two ways. In the beginning, some of the possible consequences A = a may be pruned out. Given a minimum threshold minM , A = a cannot occur in the consequence of any sufficiently good positive dependency rule, if ub1 = f (m(A = a), m(A = a), m(A = a), n) < minM . Similarly, A = a cannot occur in the consequence of any sufficiently good negative dependency rule, if ub1 = f (m(A 6= a), 0 , m(A = a), n) < minM . We note that attribute A can still occur in the antecedent of good rules. Table 3.1: Upper bound ub1 = f (m(A = a), m(A = a), m(A = a), n) for common measures. measure ub1 χ2 n m(A=a) P (A 6= a)m(A6=a) ) MI −n √ log(P (A = a) z1 z2 J

√ m(A6=a) p1+P (A=a) m(A 6= a) −m(A = a) log(P (A = a))

Table 3.1 gives examples of ub1 with common goodness measures for positive dependencies. For the χ2 -measure, the upper bound is equal to


60

max{χ2 (·)} (maximal possible value in the given data set), and no pruning is possible. For the mutual information MI , ub1 is a function of P (A = a), as desired. The function gets its best value, when P (A = a) = 0.5. If P (A = a) deviates too much from 0.5, ub1 < minM , and A = a cannot occur in the consequence of any significant rule. Because MI is symmetric with respect to A = a and A 6= a, also A 6= a is an impossible consequence. For the z-scores, z1 and z2 , ub1 depends on P (A = a), but the best value is achieved, when P (A = a) is minimal. If P (A = a) is too large, ub1 < minM , and A = a is not a possible consequence. For the J-measure, ub1 is also a function of P (A = a), but now the best value is achieved, when P (A = a) = e−1 . If P (A = a) deviates too much from e−1 , then A = a is not a possible consequence. The second pruning opportunity occurs, when only m(X) is known, but m(XA = a) is unknown. Now we can estimate an upper bound ub2 for both X → A = a and all its specializations XQ → A = a, by substituting the best possible value for NXA . In the case of positive dependencies, the best possible value for NXA is min{m(X), m(A = a)}, and in the case of negative dependencies, it is max{0, m(X) − m(A 6= a)}. In practice, this means that when m(X) < m(A), the best possible M -value for positive dependence (ub1 in point (m(A = a), m(A = a)) cannot be achieved any more, and ub2 = f (m(X ), m(X ), m(A = a), n) gives a tighter upper bound than ub1 . In the case of negative dependencies, the same happens, when m(X) becomes m(X) < m(A 6= a), and point (m(A 6= a), 0) is no more reachable. Now ub2 = f (m(X ), 0 , m(A 6= a), n) < ub1 can be used for pruning. Table 3.2: Upper bound ub2 = f (m(X ), m(X ), m(A = a), n) for common measures, when m(X) < m(A = a). measure ub2 nP (X)P (A6=a) χ2 P (¬X)P ´ ³ (A=a) (X))P (A=a)−P (X) MI n log (PP (A=a)−P P (A=a) P (¬X) P (¬X) √ (A=a) m(X)P (A6=a) √ z1 √P (A=a)(1−P (X)P (A=a)) m(X)P (A6=a) √ z2 P (A=a)

J

−m(X) log(P (A = a))

Table 3.2 gives examples of ub2 with common goodness measures for positive dependencies, when m(X) < m(A = a) and thus ub2 < ub1 . In the


61

beginning, when only P (Ai )s, Ai ∈ R, are known, these upper bounds can be used to prune out possible consequences Ai = ai for rule B → Ai = ai and any of its specializations, given that P (B) < P (Ai = ai ). Based on Theorem 3.6, we can prove the following observation, which enables extra pruning with some goodness measures, like the mutual information. The importance of this observation is that it can prune out attributes also from the rule antecedents, while upper bounds ub1 and ub2 can prune only possible consequences. Observation 3.7 Let M be a well-behaving increasing (decreasing) goodness measure defined by function f , and N = n be fixed. If (i) f is exchangeable, i.e. f (NX , NXA , NA , n) = f (NA , NXA , NX , n), (ii) f (NA , NA , NA , n) is monotonically increasing (decreasing) in interval [0, Nopt ], Nopt < n, and (iii) f (m(A), m(A), m(A), n) < minM (or > maxM ) for some m(A) ≤ Nopt , then A cannot occur in the antecedent of any significant rule. Proof Let the conditions be true for some A, m(A) ≤ Nopt . Then for any rule XA → B = b, B ∈ R, B 6= A, b ∈ {0, 1}, and X ⊆ R \ {A, B}, holds M (XA → B = b) = f (m(XA), m(XAB = b), m(B = b), n) = f (m(B = b), m(XAB = b), m(XA), n)

(Condition (i))

≤ f (m(B = b), m(B = b), m(B = b), n)

(Theorem 3.6)

≤ f (m(A), m(A), m(A), n) ≤ minM

(Conditions (ii) and (iii)). 2

When the measure is MI , the optimal value is Nopt = n2 . When NA ≤ n2 , function f (NA , NA , NA , n) is increasing, and when NA > n2 , it is decreasing. Therefore, all consequences C = c, m(C = c) ≤ m(A), have at most as large upper bound for any M (Y → C = c) as M (X → A) has. Because MI is exchangeable, M (XA → B = b) = M (B = b → XA) and m(XA) ≤ m(A). Therefore, any A which is not a possible consequence and for which m(A) ≤ n 2 , cannot occur in the antecedent of any rule XA → B = b. We note that it is not necessary that f (NA , NA , NA , n) gets its global optimum in point Nopt (as the measure MI does), but it is sufficient that Nopt is the optimum in the range [0, Nopt ]. Another note is that this observation cannot be applied to the J-measure, even if f (NA , NA , NA , n) is


62

increasing in the interval [0, ne ]. The problem is that the J-measure is not exchangeable, but generally J(X → A) 6= J(A → X). The next theorem gives an upper bound for any positive or negative dependency rule, when a more general rule is already known. The resulting upper bound ub3 is always at least as tight the corresponding ub2 . Theorem 3.8 Let n and m(A = a) be fixed and M , S and f like before. Given m(X) and m(XA = a) and an arbitrary attribute set Q ⊆ R \ (X ∪ {A}) (a) for positive dependency XQ → A = a holds f (m(XQ), m(XQA = a), m(A = a), n) ≤ f (m(XA = a), m(XA = a), m(A = a), n), and (b) for negative dependency XQ → A = a holds f (m(XQ), m(XQA = a), m(A = a), n) ≤ f (m(XA 6= a), 0, m(A = a), n). Proof (a) (Positive dependence) Figure 3.3 shows the area, where possible points (m(XQ), m(XQA = a)) for positive dependence can lie. With any NX ≤ m(X), the maximum is achieved on the border defined by points (0, 0), (m(XA = a), m(XA = a)), and (m(X), m(XA = a)) (δ is maximal). On line [(0, 0), (m(XA = a), m(XA = a))], f is increasing, and on line [(m(XA = a), m(XA = a)), (m(X), m(XA = a))], it is decreasing. Therefore, the global maximum is achieved in point (m(XA = a), m(XA = a)). (b) (Negative dependence) Figure 3.4 shows the area, where possible points (m(XQ), m(XQA = a)) for negative dependence can lie. Once again, f gets its maximal value for any NX , when −δ is maximal. Therefore, the maximum must lie on the border defined by points (0, 0), (m(XA 6= a), 0), and the intersection point pint . On line [(0, 0), (m(XA 6= a), 0))], f is increasing, and the maximal value is achieved in point (m(XA 6= a), 0). Therefore, it suffices to show that f gets its maximum on line [pint , (m(XA 6= a), 0)] in the same point, (m(XA 6= a), 0). Figure 3.7 shows the proof idea. For any point p1 on line [pint , (m(XA 6= a), 0)], we can define a line of the form m(A = a) − NXA = cf2 (n − NX ), which goes through p1 . Because p1 is under the independence line, cf2 > P (A = a), and line m(A = a) − NXA = cf2 (n − NX ) intersects NX -axis in some point p2 in the interval [(0, 0), (m(XA 6= a), 0)]. According to the definition of well-behaving measures, f is

3.3 Lower bounds for Fisher’s pF

63

decreasing on line m(A = a) − NXA = cf2 (n − NX ), and therefore f gets a better value in point p2 than in point p1 . On the other hand, we already know that f gets a better value in (m(XA 6= a), 0) than any point p2 . Therefore, f must get its upper bound in point (m(XA 6= a), 0). 2 NXA

(m(X), m(XA = a)) NXA

(A =P

point pint

m(A

)

=a

)N X =a

XA −N

− f 2(n

NX

)

=c

point p1 NX point p2

point (m(XA 6= a), 0)

Figure 3.7: Proof idea (Theorem 3.8). Function f is better in p2 than in p1 and better in point (m(XA 6= a), 0) than in p2 . Theorem 3.8 (upper bound ub3 ) enables more effective pruning than Theorem 3.6 (upper bounds ub1 and ub2 ), because now pruning is possible even if m(X) > m(A = a) or m(X) > m(A 6= a). The resulting upper bounds ub3 are also tight in the sense that there can be rules XQ → A = a, which reach their ub3 values. Table 3.3 gives examples of ub3 with common goodness measures for positive dependencies.

3.3

Lower bounds for Fisher’s pF

We assume that Fisher’s pF is also a well-behaving measure, but it is difficult to prove. Therefore, we will prove the required lower bounds lb1 , lb2 , and lb3 directly.


64

Table 3.3: Upper bound ub3 = f (m(XA = a), m(XA = a), m(A = a), n) for common measures. measure ub3 nP (XA=a)P (A6=a) χ2 (1−P (XA=a))P (A=a) ³ ´ P (¬XA=a)P (¬XA=a) MI n log P (A=a)P (A=a) 1−P (XA=a) (1−P (XA=a)) √ m(XA=a)P (A6=a) √ z1 √P (A=a)(1−P (XA=a)P (A=a)) m(XA=a)P (A6=a) √ z2 P (A=a)

J

−m(XA = a) log(P (A = a))

For the first two cases (lb1 and lb2 ) the proofs are quite simple and likely to be known also in the previous research. However, to the best of our knowledge, the proof for the third lower bound lb3 is a new result. , where n is the number of Theorem 3.9 Let us notate pabs = m(A)!m(¬A)! n! rows. For any attribute A ∈ R and X ⊆ R \ {A}, pF (X → A) ≥ pabs and pF (X → ¬A) ≥ pabs . Proof pF can be expressed as pF (X → A) = pabs

J1 µ X i=0

m(X ) m(XA) + i

¶µ

m(¬X ) m(¬X ¬A) + i

¶ .

Since the sum is always ≥ 1, the minimum value is pabs . Case pF (X → ¬A) is similar. 2 This means that lb1 = pabs is the absolute minimum value for any rule, where either A or ¬A is the consequence. The minimum value is achieved, when m(X) = m(A = a) = m(XA = a). The pabs -value is lowest, when m(A) is closest to n2 . On the other hand, if m(A) or m(¬A) is very low, pabs can be so large that no rule with consequence A or ¬A could be significant. On the algorithmic level this means that A and ¬A can be marked as impossible consequences and A can occur at most in the antecedent part of a significant rule. Theorem 3.10 Let R be like before. For all X ( R, A ∈ R\X, a ∈ {0, 1}, and Q ⊆ R \ (X ∪ {A}) holds

3.3 Lower bounds for Fisher’s pF

65

If m(X) ≤ m(A = a), then pF (XQ → A = a) ≥

m(¬X)!m(A = a)! . n!(m(A = a) − m(X))!

Proof pF (XQ → A = a) is of the form t0 + . . . tJ1 , where µ ti = pabs

m(XQ) m(XQA = a) + i

¶µ

n − m(XQ) m(¬(XQ)A 6= a) + i

¶ .

Therefore, pF ≥ tJ1 , where J1 = min{m(XQA 6= a), m(¬(XQ)A = a)}. Because m(XQ) ≤ m(X) 6= a) and pF has a ³ ≤ m(A´= a), J1 =¡m(XQA ¢ n−m(XQ) m lower bound tJ1 = pabs m(A6=a) . Because l is an increasing function ³ ´ m(¬X)!m(A=a)! m(¬X ) . 2 of m, tJ1 ≥ pabs m(A6 , which is equal to n!(m(A=a)−m(X))! =a) This result defines the second lower bound lb2 . However, it applies only, when m(X) ≤ m(A = a). If m(X) > m(A = a), it is possible that there is a superset XQ such that m(XQ) = m(A = a) = m(XQA = a) and pF achieves its absolute minimum value pabs . When m(XA = a) is known, the following result gives a tighter (larger) lower bound lb3 : Theorem 3.11 Let R be like before. For all X ( R, A ∈ R\X, a ∈ {0, 1}, and Q ⊆ R \ (X ∪ {A}) holds µ pF (XQ → A = a) ≥ pabs

n − m(XA = a) m(A 6= a)

¶ .

Proof In this proof, we use the fact that for positive dependency rule (i.e. γ³ > 1). We notice X → A = a holds m(XA = a) > m(X)m(A=a) that n ´ it is enough to show that pF (X → A = a) ≥ pabs for any Q ⊆ R \ (X ∪ {A}) holds µ

n − m(XQA = a) m(A 6= a)

¶

µ ≥

n−m(XA=a) m(A6=a)

n − m(XA = a) m(A 6= a)

, because

¶ .

Let us notate pF = pabs pX . Because pabs is a constant, when the consequent is fixed, it can be omitted. For clarity, we present the proof for case a = 1. The same result is achieved for a = 0, when A and ¬A are reversed. If m(X) = m(XA), then pX (X → A) contains just one term, which is equal to its lower bound. Otherwise, pX (X → A) is a sum of several terms,


66

but it is enough to show that for the first term t0 holds: µ ¶µ ¶ µ ¶ m(X) m(¬X) n − m(XA) t0 = ≥ ⇔ m(XA) m(¬X¬A) m(¬A) m(X)!m(¬A)! (n − m(XA))! ⇔ ≥ m(XA)!m(X¬A)!(m(¬A) − m(X¬A))! m(¬X) [m(X) · . . . · (m(XA) + 1)][m(¬A) · . . . · (m(¬A) − m(X¬A) + 1)] ≥ [(n − m(XA)) · . . . · (m(¬X) + 1)]m(X¬A)! ⇔ m(X¬A)−1

Y

m(X¬A)−1

(m(X)−i)(m(¬A)−i) ≥

i=0

Y

(n−m(XA)−i)(m(X¬A)−i).

i=0

This is true, because for all i = 0, . . . , m(X¬A) − 1 holds (m(X) − i)(m(¬A) − i) ≥ (n − m(XA) − i)(m(X¬A) − i) ⇔ m(X)m(¬A) − im(X) − im(¬A) + i2 ≥ m(X)(n − m(XA)) − m(XA)(n − m(XA)) − i(n − m(XA)) − im(X) + im(XA) + i2 ⇔ −m(X)m(A)+im(A)+m(X)m(XA)+nm(XA)−m(XA)2 −2im(XA) ≥ 0. Because nm(XA) > m(X)m(A) and im(A) − 2im(XA) ≥ −im(XA), it is sufficient to show that m(X)m(XA) − m(XA)2 − im(XA) ≥ 0 ⇔ m(XA)m(X¬A) ≥ im(XA) ⇔ m(X¬A) ≥ i, which was true.

2

Table 3.4: Lower bounds lb1 , lb2 , and lb3 for Fisher’s pF . lb1 = pabs = m(A)!m(¬A)! n! m(¬X )!m(A=a)! lb2 = n!(m(A=a)−m(X ))! ´ ³ n−m(XA=a) lb3 = pabs m(A6=a)

The lower bounds lb1 , lb2 , and lb3 for Fisher’s pF are summarized in Table 3.4. In addition, we can use Observation 3.7 to prune some attributes completely, before the search begins. Like MI , pF is exchangeable and f (NA , NA , NA , n) gets its optimum value in the whole range [0, n], when NA = n2 . Therefore, attribute A cannot occur in the antecedent or consequence of any significant rule, if pabs > maxM and P (A) ≤ 0.5.

3.4 Pruning redundant rules

3.4

67

Pruning redundant rules

Next, we will introduce extra pruning strategies, which can be used with both well-behaving measures and Fisher’s pF . These strategies can be applied in the special case, when a rule X → A = a with P (A = a|X) = 1 has been discovered. Such rules are examples of minimal rules. Generally, the minimality is defined as follows: Definition 3.12 (Minimality (with respect to M )) Let M be an increasing (a decreasing) goodness measure. Rule X → A = a is minimal, if for all rules Z → A = a, such that Z ) X, holds M (Z → A = a) ≤ M (X → A = a) (M (Y → A = a) ≥ M (X → A = a)). Otherwise, rule X → A = a is non-minimal. This means that all specializations of a minimal rule X → A = a are at most as good as X → A = a itself. We notice that a minimal rule does not have to be significant or non-redundant itself. Finding a minimal rule is always desirable, because it means that there is no reason to check any of its specializations. Based on the definition of well-behaved measures (Axiom (iv)), a rule with confidence p(A = a|X) = 1 is always minimal. The same holds for Fisher’s pF by Equation (3.11). In addition, there can be minimal rules with lower confidences, but they are harder to identify and we do not have any special strategies for utilizing them. In practice, they behave like any other non-redundant rules, whose specializations can be pruned out based on the upper or lower bounds of their M -values. Let us now concentrate on minimal rules with P (A = a|X) = 1. First, we notice that for all specializations XQ → A = a holds P (A = a|XQ) = 1 and P (A 6= a|X) = 0. This trivial property means that we do not have to check any more special rules with consequence A = a or A 6= a. However, we can also prune out all positive dependency rules of form XQA → B = b, where B ∈ / X. The following theorem shows that all such rules will be either redundant or insignificant, if P (A = a|X) = 1. In the previous research, Li [45] has shown this result already for several measures (including confidence and lift), when the consequences are positive-valued. Here we extend it to cover both positive and negative consequences and all well-behaving measures as well as Fisher’s pF . Theorem 3.13 Let M be a well-behaving measure or Fisher’s pF . Let R be like before. If P (A = a|X) = 1, then all positive dependency rules XQA → B = b, where B ∈ / X, Q ⊆ R \ (X ∪ {A, B}) and b ∈ {0, 1} are either redundant or insignificant.


68

Proof First we notice that if P (¬A|X) = 1, P (XQAB = b) = 0 for all Q. Therefore, rule XQA → B = b has zero frequency and cannot be significant. So it is enough to consider the case, when P (A|X) = 1. When P (A|X) = 1, also P (A|XZ) = 1 for all Z ⊆ R \ (X ∪ {A}). Therefore, P (XQAB = b) = P (XQB = b) and P (XQA) = P (XQ). Now rule XQA → B = b has the same M -value as a more general rule XQ → B = b, and is therefore redundant: M (XQA → B = b) = f (m(XQA), m(XQAB = b), m(B = b), n) = f (m(XQ), m(XQB = b), m(B = b), n) = M (XQ → B = b). The same happens, when the goodness measure is pF . Now pabs = ³ ´−1 n and m(B =b) pF (XQA → B = b) = J1 µ X pabs i=0

m(XQA) m(XQAB = b) + i

¶µ

n − m(XQA) m(¬(XQA)B 6= b) + i

¶ .

This is equivalent to pF (XQ → B = b) = pabs

J1 µ X i=0

m(XQ) m(XQB = b) + i

¶µ

n − m(XQ) m(¬(XQ)B 6= b) + i

¶

because J1 = min{m(XQAB 6= b), m(¬(XQA)B = b)} = min{m(XQB 6= b), m(¬(XQ)B = b)} and m(XQA) = m(XQ), m(XQAB = b) = m(XQB = b), and m(¬(XQA)B 6= b) = m(¬(XQ)B 6= b). Therefore XQA → B = b is redundant with respect to XQ → B = b. 2

,

Chapter 4 Search algorithms for non-redundant dependency rules Hence thus and thus marvellous operations will be achieved. On Philosopher’s Stone, ancient Chinese text In this chapter, we consider different search strategies and introduce algorithms for searching for either the K best or all sufficiently good nonredundant dependency rules with a general goodness measure M . The goodness measure can be either a well-behaving measure or Fisher’s pF , and therefore all the pruning strategies from the previous chapter can be applied. In addition, we introduce a new efficient pruning principle, called Lapis Philosophorum, which complements the basic branch-andbound search. We give pseudocode for the Kingfisher algorithm, which implements the new pruning strategies, and analyze its time and space complexity.

4.1

Basic branch-and-bound strategy

The whole search space for dependency rules on attributes R can be represented by a complete enumeration tree. A complete enumeration tree lists all possible attribute sets in P(R). From each set X ∈ P(R), we can generate rules X \ {Ai } → Ai = ai , Ai ∈ X and ai ∈ {0, 1}, and therefore a complete enumeration tree also represents implicitly the set of all possible dependency rules. Figures 4.1 and 4.2 show two examples of complete enumeration trees, when R = {A, ..., E}. In the following, we will call the trees with the largest branches on the left (Figure 4.1) type 1 trees, and the trees with the largest branches on the right (Figure 4.2) type 2 trees. Other 69

70

4 Search algorithms for non-redundant dependency rules

kinds of enumeration trees are also possible, but these two types of trees are easiest to generate and search. Each node v in an enumeration tree corresponds to set X = {A1 , . . . , Al } ⊆ R, where A1 , . . . , Al are the labels occurring on the path from the root node to node v. In the following, we will call the nodes corresponding to sets Yi , where Yi Ai = X for some Ai ∈ X, the parent nodes of node v. For example, in Figures 4.1 and 4.2, nodes corresponding to sets AC, AD and CD are the parents of the node corresponding to set ACD. The parent node, under which v is actually linked, is called the immediate parent of v. In Figure 4.1, the node corresponding to set AC is the immediate parent of the node corresponding to set ACD, but in Figure 4.2, the immediate parent corresponds to set CD. In the node for ACD, we can generate rules AC → D, AC → ¬D, AD → C, AD → ¬C, CD → A, and CD → ¬A. In addition, we should check, which consequences A, ¬A, . . . , E, ¬E are possible in the supersets of ACD (i.e. in sets ACDE, ABCD, and ABCDE). If a consequence was impossible in any of the parent nodes, then it is impossible also in the child node. If no consequence is possible in a node, then the node and the subtree under it can be pruned out.

Type 1 tree A

B C C D D

E

E

E

D

E

D E

B

E

E

C

D E

D

E

C

E

E

D E

D

E

E

E

E

Figure 4.1: A complete enumeration tree of type 1 on attributes A, .., E. Arrows show the parent nodes of the node corresponding to set ACD. In practice, the enumeration tree can be generated – either level by level or branch by branch – when the search proceeds. However, it is not necessary to generate the enumeration tree explicitly (as a data structure, where the attribute sets are stored), but still the search usually proceeds in the same manner. For example, in the classical breadth-first search (like the

4.1 Basic branch-and-bound strategy

71

Type 2 tree A

E B

A

D

C

A

B

D

A

C

B

A

C

B

C A

A

B

A

A

A

A

B

A

A B

A

A

B A

Figure 4.2: A complete enumeration tree of type 2 on attributes A, .., E. Arrows show the parent nodes of the node corresponding to set ACD.

Apriori algorithm for frequent sets [3, 51]) one first checks all sets containing a single attribute (1-sets), then all sets containing two attributes (2-sets), etc. This corresponds to traversing an enumeration tree level by level, from top to down. In the following algorithms, the enumeration tree is actually generated, because we need anyway a storage structure to keep record on possible consequences in the previous level sets. When the attributes are ordered, the parent nodes of any node (see an example in Figures 4.1 and 4.2) are easily located e.g. with the binary search. Since the size of the search space (the complete enumeration tree) is in the worst case exponential, the main problem is, how to traverse as minimal an enumeration tree as possible without losing any significant dependency rules. The main strategy is an application of the branchand-bound method, where new subtrees are generated, if they can contain sufficiently good, non-redundant rules with respect to already discovered rules. For this purpose we need the upper bounds (lower bounds) for the goodness measure M , which define the best possible M -value for any rule (XQ) \ {Ai } → Ai = ai , Ai ∈ R, ai ∈ {0, 1}, Q ⊆ R \ X, given the information which is available in set X. The branch-and-bound search can be implemented either in a breadthfirst or depth-first manner. The main algorithm for the breadth-first search is given in Algorithm 1 and for the depth-first search in Algorithms 2 (the main program) and 3 (recursive function). For clarity, we use notation v.set to refer to the set, which contains all attributes from the root of the

72


enumeration tree to node v. In practice, it is enough to save just the last added attribute into v. In both search strategies, the first task is to create an empty enumeration tree t and add all single attributes as its children. For all Ai ∈ R, we create a node si and determine the possible consequences Aj = aj , Aj ∈ R, aj ∈ {0, 1}, for set {Ai } and its supersets X ∪ {Ai } (function initialize(si )). In this phase, it is already possible to prune out some attributes, if they cannot occur in any significant dependency rules. After that, the breadth-first search begins to expand the tree level by level, while the depth-first search expands it branch by branch. We do not yet go into details, how a set X can be expanded, but the idea is that all attribute sets should occur at most once in the enumeration tree. This is checked in condition Possible(XAi ). In addition, if we already know that all rules in X’s supersets will be redundant or insignificant, then Possible(XAi ) returns zero for all Ai . In practice, the expansion can be coded in a more elegant way. For example, if the order of attributes is fixed, then we can add an attribute Ai to set X, only if Ai is in the canonical order after all attributes in X. We note that in the breadth-first search condition (num ≥ l + 1) merely guarantees that we have at least l + 1 l-sets, before any (l + 1)-sets are generated. The reason is that each (l + 1)-set has l + 1 parent sets and each parent set exists in the tree, only if its supersets can produce nonredundant, significant rules. For clarity, we assume that all discovered rules are stored into a separate collection, and therefore all nodes containing only minimal rules or having too small upper bounds (too large lower bounds) for any non-redundant, significant rules can be deleted. An alternative is to keep also the discovered rules in the tree, until the search ends. In this case, we have to mark the leaf nodes, which cannot have any child nodes. All rules which can be derived from set X = v.set or its supersets are evaluated by process(v). Function process checks all v’s parent nodes and combines their information on possible consequences, calculates X’s frequency, estimates upper or lower bounds for M ((XQ) \ {Ai } → Ai = ai ) for all possible consequences Ai = ai , and decides which consequences remain as possible. If no consequence is possible, node v can be removed. Otherwise, the goodness of rules X \ {Ai } → Ai = ai for all possible consequences Ai = ai , where Ai ∈ X, is estimated by measure M , and the redundancy or minimality of rules is checked. For efficiency, an important concern is in which order the enumeration tree should be traversed. Should we proceed in a breadth-first manner (level by level, from to top to down) or in a depth-first manner (branch

4.1 Basic branch-and-bound strategy

73

Algorithm 1 BreadthFirst 1 create root t 2 for all Ai ∈ R 3 add child node vi for set {Ai } 4 initialize(vi ) 5 l ←1; num ←|R| 6 while (num ≥ l + 1) 7 for all nodes w at level l 8 X ←w.set 9 for all Ai ∈ R \ X 10 if (Possible(XAi )) 11 create child node v for set XAi 12 if (process(v)=0) 13 delete node v 14 l←l + 1 15 num ←number of created sets 16 output discovered rules

Algorithm 2 DepthFirst 1 create root t 2 for all Ai ∈ R 3 add child node vi for set {Ai } 4 initialize(vi ) 5 for all children vi 6 dfsearch(vi ) 7 output discovered rules

by branch, left to right)? Should we use type 1 tree or type 2 tree? And should the attributes be ordered in an ascending or descending order by their frequency? Assuming that the search proceeds from left to right and from the top to bottom, we have four alternatives: 1. type 1 tree, when the attributes are in an ascending order, i.e. P (A) ≤ P (B) ≤ ... ≤ P (E); 2. type 1 tree, when the attributes are in a descending order, i.e. P (A) ≥ P (B) ≥ ... ≥ P (E);

74


Algorithm 3 dfsearch(v) 1 2 3 4 5 6 7 8 9

if (process(v)=0) // message to the call level that v can be deleted return 0 X ←v.set for all Ai ∈ R \ X if (Possible(XAi )) create child node w for set XAi if (dfsearch(w)=0) delete node w return 1

3. type 2 tree, when the attributes are in an ascending order, i.e. P (A) ≤ P (B) ≤ ... ≤ P (E); and 4. type 2 tree, when the attributes are in a descending order, i.e. P (A) ≥ P (B) ≥ ... ≥ P (E). All four alternatives are possible with the breadth-first search, but type 1 tree is not sensible with the depth-first search. The problem is that in the type 1 tree, a set (e.g. ACD in Figure 4.1) would be checked before all of its parents had been checked (sets AD and CD in the Figure). This causes unnecessary extra work, if some of the parents contained better rules or would be deleted as useless. In the type 2 tree, this is avoided, but the depth-first search can still be inefficient, if we want to find only the K best non-redundant rules. The reason is that the best rules tend to occur on the first levels (with the highest frequencies), and there is no need to check the first branches in depth, if the K best rules are to be found on the first levels. In all four alternatives, the most important concern is to restrict the largest subtree from increasing, since it can contain half of the nodes. When the largest subtree has the least frequent attribute in its root, it is also likely to stay small. Therefore, we prefer by default alternatives 1 and 4. Alternatives 1 and 4 differ from each other only in one aspect. If we proceed a level from left to right, then the alternative 4 checks first the sets of the most common attributes, while the alternative 1 begins with the sets of the most rare attributes. The latter can be useful, if we search only the K best rules. If the measure M favours rules with a large lift but smaller frequency compared to more frequent rules with a smaller lift, then it is possible to find larger (smaller, if M is decreasing) M -values and update

4.2 The Lapis Philosophorum principle

75

minM (maxM ) in the beginning of the level. The final reason to prefer the alternative 1 is that it enables effective extra pruning together with the breadth-first search as explained in the next section. Therefore, we will consider only the alternative 1 (type 1 tree) in the following.

4.2

The Lapis Philosophorum principle

The basic branch-and-bound search prunes possible consequences only in the subtrees of a given node. However, it is also possible to prune consequences in the parent nodes, and propagate the results to other subtrees. This requires that the node is processed before any children are generated for its parents, except the immediate parent. Figure 4.1 shows an example. When we proceed level by level, from left to right, set ACD is processed before any children are created for its parent sets AD and CD. If set ACD is now removed (e.g. it has a zero frequency, which means that rule AD → ¬C was minimal), then C and ¬C become impossible consequents in the node for AD and its subtrees. Similarly, if we find in the node for ACD that no rule QCD → A (for any Q ⊆ R \ {A, C, D}) could be significant and non-redundant, then A can be marked as an impossible consequence in the node for CD and its subtrees. This simple principle performs so effective pruning that it is called Lapis Philosophorum, the legendary Philosopher’s stone. Principle 4.1 (Lapis Philosophorum) Let qXA be a node corresponding to set XA and qX a node corresponding to set X as shown in Figure 4.3. If any of the following conditions holds in node qXA , consequence A = a can be marked as impossible in node qX and all nodes qXQ in its subtrees: (i) Node qXA does not exist (i.e. no consequence was possible in set XA or its supersets). (ii) In node qXA we find that rule X → A = a is minimal. (iii) In node qXA we find that rule XQ → A = a would be insignificant or redundant, i.e. (a) For an increasing goodness measure M , upper bound ub = UB(M (XQ → A = a)) < minM or ub ≤ max{M (Y → A = a)|Y ( X }. (b) For a decreasing goodness measure M , lower bound lb = LB(M (XQ → A = a)) > maxM or lb ≥ min{M (Y → A = a)|Y ( X }.

76


A X

X

level l + 1

level l

qX

qXA

Q

qXQ

Figure 4.3: Lapis Philosophorum principle. If consequence A = a is impossible in node qXA , then it is impossible in nodes qX and qXQ . Here |X| = l. Dash lines represent paths.

The principle is based on the fact that all supersets XQA, Q ⊆ R \(X ∪ {A}), lie in the subtree under qXA , and qXQ is needed only as a parent node for rules XQ → A = a. The Lapis Philosophorum principle is especially effective, when we search positive dependency rules of the form X → A. Now all rules with the least frequent consequences Ai occur in the first subtrees on the left. If set X occurs on the right-most subtrees, it is likely to have larger frequency, P (X) > P (A). Therefore, the upper bound ub2 (Table 3.2) cannot be used for pruning. On the other hand, the upper bound ub3 (Table 3.3) cannot be used for pruning, either, because P (XA) is not known in node qX . So, in the worst case, all possible consequences in node qX are such Ai that P (Ai ) < P (X). The basic breadth-first search does not offer any means to recognize, when Ai becomes an impossible consequence for X and its supersets. In the worst case, none of the consequences Ai is possible, and X is expanded in vain. With the Lapis Philosophorum principle this is

4.3 Algorithm

77

recognized immediately and node qX can be deleted. If we search also positive dependencies of the form X → ¬A (i.e. negative dependencies between X and A with certain measures), then infrequent consequences A = a can occur also in the right subtrees. However, if A occurs in the right side of the node qX , then A is added under qX , and ub3 (Table 3.3) can be used for pruning out consequence ¬A. If A occurs on the left side of the node qX , then there are two alternatives: If P (¬A) > P (X), then we can use ub2 (Table 3.2) for pruning. If P (¬A) ≤ P (X), then we can use only Lapis Philosophorum.

4.3

Algorithm

Next, we give an efficient breadth-first algorithm for searching for the best non-redundant rules of the form X → A = a, a ∈ {0, 1}, with a wellbehaving goodness measure M . The algorithm is called Kingfisher, because it was originally developed for searching for dependency rules with Fisher’s pF . However, the same search method applies to any well-behaving goodness measure M . The algorithm is designed for searching for positive dependencies, but with certain measures, like Fisher’s pF and the χ2 -measure, the positive dependence between X and A = a is the same as the negative dependence between X and A 6= a, and therefore both positive and negative dependencies are discovered. With other measures like the z-score (Equation (2.15)), the positive dependence between X and ¬A also indicates the negative dependence between X and A, but the significance is not necessarily the same, as discussed in Section 2.3.2. The same algorithm can be used for both the optimization problem (searching for the K best rules) or enumeration problem (searching for all sufficiently good rules). For simplicity, we represent the algorithm only for the optimization problem, where both minM (or maxM for a decreasing M ) and K are given. The initial value of minM can be set to min{M (·)} (global minimum), in which case no rules are pruned out based on significance, until the first K rules have been found. After that the minM -value is updated always, when a new rule is added to the collection of the best K rules. In practice, a larger initial threshold minM is beneficial, because it can prune the first levels before any sufficiently significant rules are found. If the algorithm is applied to the enumeration problem, then only the threshold minM is required. The only difference to the optimization problem is that the threshold is not updated during the search. In addition, one should decide where to store all discovered rules. In the following al-

78


gorithm, the rules are stored into a separate rule collection, but it can be quite space consuming, if the number of discovered rules is large. One solution is to store the rules into the enumeration tree, but then the processed levels of the tree cannot be pruned as radically as with the separate rule collection. The pseudocode for the Kingfisher algorithm is given in Algorithms 4, 5, 6, 7, and 8. The node corresponding to attribute set X is denoted by N ode(X). In each node v = N ode(X), we use the following fields: • v.set set X; in practice, it is enough to store just the last attribute in the path from the root to node v. • v.children table of pointers to v’s child nodes. The size of the table is denoted by |v.children|. • v.ppossible and v.npossible bit vectors, whose jth bits indicate, whether consequence Aj or ¬Aj is a possible consequent in node v or its descendants. For simplicity, we assume that both vectors have |R| bits. • v.pbest and v.nbest tables for the best M -values of rules Y → Aj and Y → ¬Aj , Y ⊆ X, Aj ∈ X. For simplicity, we assume that both tables have |R| elements and index j corresponds to consequence Aj . In practice the tables can be implemented more compactly by tables of |X| elements. The main idea of the algorithm is the following: 1. Use Observation 3.7 to determine the maximal absolute frequency value minfr such that even the best possible rule X → A = a with m(A = a) = m(X) = m(XA = a) < minfr cannot be significant. Prune out all attributes which cannot occur in significant rules. (Algorithm 4, line 1 and Algorithm 5, lines 3–4.) This step is possible only with some goodness measures like MI and pF , satisfying the conditions of Observation 3.7. 2. Order the remaining attributes Ai ∈ R into an ascending order by frequency and add them to the enumeration tree. Use upper bounds ub1 (Table 3.1) to determine consequences, which are not possible in any node N ode(Ai ). For all Ai , use upper bounds ub2 (Table 3.2) to determine possible consequences Aj = aj such that XAi → Aj = aj can be significant. The possible consequences are marked into tables ppossible and npossible in node N ode(Ai ). If Aj is possible, then N ode(Ai ).ppossible[j] = 1, and if ¬Aj is possible, then N ode(Ai ).npossible[j] = 1. (Algorithm 5, lines 5–14.)

4.3 Algorithm

79

Algorithm 4 Kingfisher(R, r, minM , K ) Input: set of attributes R, data set r, initial threshold minM , maximal number of best rules K Output: the best K non-redundant dependency rules for which M ≥ minM Method: 1 determine minfr using Observation 3.7 2 t ←check1sets(R,r,minM ,minfr ) 3 l ←2 4 while (number of (l − 1)-sets ≥ l) // check l-sets 5 for i = 1 to |R| 6 bfs(t.children[i], l, 0) 7 l ←l + 1 8 output the K best rules

Algorithm 5 check1sets(R, r, minM , minfr ) Input: set of attributes R, data set r, thresholds minM and minfr Output: root of an enumeration tree t containing the first level of nodes Method: 1 for ∀Ai ∈ R 2 calculate absolute frequency m(Ai ) 3 if (m(Ai ) < minfr ) 4 R ←R \ {Ai } 5 order R into an ascending order by frequency 6 create root node t 7 for ∀Ai ∈ R 8 create node v = N ode(Ai ) 9 add v to t.children 10 for ∀Aj ∈ R // initialize best-tables for all consequences Aj and ¬Aj 11 v.pbest[j] ←min{M (·)} // minimal possible M -value 12 v.nbest[j] ←min{M (·)} // is Aj or ¬Aj a possible consequence for set XAi ? 13 v.ppossible[j] ←possible(Aj , 1, s, Ai ) 14 v.npossible[j] ←possible(Aj , 0, s, Ai ) 15 return t

80


Algorithm 6 bfs(st, l, len) Input: root of a subtree st, goal level l, path length len Output: discovered new best rules at level l in subtree st are stored into collection brules Method: 1 if (len = l − 2) // combine (l − 1)-sets to create new l-sets 2 for i = 1 to |st.children| − 1 3 Y ←st.children[i].set 4 for j = i + 1 to |st.children| 5 Z ←st.children[j].set 6 X ←Y ∪ Z 7 create node child = N ode(X) 8 add child to st.children[i].children 9 initialize possible- and best-tables // all possible-values are set to 1 and best-values to min{M (·)} 10 if (checknode(child)=0) 11 delete child // use Lapis Philosophorum 12 for ∀ parent nodes v = N ode(Ym ) where (X = Ym Am ) 13 v.ppossible[m] ←0 14 v.npossible[m] ←0 15 if (N ode(Y ).children = ∅) 16 delete node N ode(Y ) // st’s last child has never child nodes 17 delete st.children[|st.children|] 18 else for i = 1 to i = |st.children| 19 bfs(st.children[i], l, len + 1) 20 if (|st.children| = 0) // if all st’s children were deleted 21 delete node st

4.3 Algorithm

81

Algorithm 7 checknode(vX ) Input: node vX = N ode(X) Output: return 0, if node vX can be removed, and 1 otherwise; discovered new best rules in vX are stored into collection brules Method: 1 ismin ←0 // no minimal rules, yet 2 for ∀Y ( X, where |Y | = |X| − 1 // all parent sets 3 parY ←searchset(Y ) 4 if (parY not found) return 0 5 updatetables(vX , parY ) // update possible- and best-tables 6 if (no possible consequences left) return 0 7 calculate m(X) 8 if ((m(X) = 0) or (∃parY such that m(Y ) = m(X))) 9 ismin ←1 // minimal rule found 10 for ∀Ai ∈ R // update possible consequents 11 vX .ppossible[i] ←(vX .ppossible[i] & possible(Ai , 1, vX , X)) 12 vX .npossible[i] ←(vX .npossible[i] & possible(Ai , 0, vX , X)) 13 if ((Ai ∈ X) and (vX .ppossible[i])) // check pos. rule 14 val←M (X \ {Ai } → Ai ) 15 if ((val ≥ minM ) and (val > vX .pbest[i])) 16 add rule X → Ai to brules; vX .pbest[i] ← val 17 update minM // nothing to do, until K rules found 18 if ((Ai ∈ X) and (vX .npossible[i])) // check neg. rule 19 val←M (X \ {Ai } → ¬Ai ) 20 if ((val ≥ minM ) and (val > vX .nbest[i])) 21 add rule X → ¬Ai to brules; vX .nbest[i] ← val 22 update minM // nothing to do, until K rules found 23 if (ismin) // pruning by minimality 24 for ∀Ai ∈ R \ X 25 vX .ppossible[i] ←0; vX .npossible[i] ←0 26 for ∀parm = N ode(Ym ) where (Ym Am = X) 27 if ((P (Am |Ym ) = 1) or (P (¬Am |Ym ) = 1)) 28 vX .ppossible[m] ←0; vX .npossible[m] ←0 // by minimality 29 parm .ppossible[m] ←0; parm .npossible[m] ←0 // by Lapis P. 30 if (no possible consequences left) return 0 31 return 1

82


Algorithm 8 Auxiliary functions updatetables(s, v) // update possible- and best-tables in node s given parent node v for i = 1 to i = |R| s.ppossible[i] ←(v.ppossible[i] & s.ppossible[i]) s.npossible[i] ←(v.npossible[i] & s.npossible[i]) s.pbest[i] ←max{v.pbest[i], s.pbest[i]} s.nbest[i] ←max{n.pbest[i], s.nbest[i]} possible(Aj , aj , v, X) // can rule (XQ) \ {Aj } → Aj = aj be significant and non-redundant? // ub2 (Table 3.2) and ub3 (Table 3.3) are implemented by UB2 and UB3 if (((Aj ∈ / X) and (m(X) < minfr )) or ((Aj ∈ X) and (m(X \ {Aj }Aj = aj ) < minfr ))) return 0 // rule would be too infrequent if (Aj ∈ / X) ub ←UB2(m(X), m(Aj = aj )) else ub ←UB3(m(X), m(Aj = aj ), m(XAj = aj )) if ((ub < minM ) or ((aj = 1) and (ub ≤ v .pbest[j ])) or ((aj = 0) and (ub ≤ v .nbest[j ])) return 0 return 1 searchset(Y ) return N ode(Y )

4.3 Algorithm

83

3. Expand attribute sets as long as new non-redundant, significant rules can be found. (Algorithm 4, lines 4–7.) • Create l-sets from (l − 1)-sets. (Algorithm 6, lines 2–9.) • For each l-set X, initialize possible consequences in N ode(X), given possible consequences in its parent nodes N ode(Ym ), where X = Ym Am for some Am ∈ X. Consequence Aj = aj , Aj ∈ R, is possible in N ode(X) only if it is possible in all parent nodes N ode(Ym ). Initialize the best M -values for all consequences Aj = aj , where Aj ∈ X, using the best-values in the parent nodes. (Algorithm 7, lines 2–5 and Algorithm 8, function updatetables.) • Calculate frequency m(X) and check if P (Am = am |Ym ) = 1 for any parent set Ym . Use upper bounds ub2 (Table 3.2) and ub3 (Table 3.3) to decide, whether any rule (XQ) \ {Aj } → Aj = aj can be a non-redundant, significant rule. (Upper bound ub2 is used, when Aj ∈ / X, and ub3, when Aj ∈ X.) (Algorithm 7, lines 7–12 and Algorithm 8, function possible.) We note that in function possible (Algorithm 8), the frequency comparison does not prune out anything else than the upper bound comparison, if the threshold minfr was determined using Observation 3.7. However, this points allows the use of other minimum frequency thresholds, if desired. For example, a common requirement is that all statistically valid rules should occur on at least five rows of data. • If Aj ∈ X and Aj = aj was possible, calculate M (X → Aj = aj ). If it is sufficiently good (among the best K rules and better than the parent rules with consequence Aj = aj ), add it to the rule collection and update N ode(X).pbest[j] (if aj = 1) or N ode(X).nbest[j] (if aj = 0). Update also minM , if K rules have been found. (Algorithm 7, lines 13–22.) • If minimal rules were found, mark all redundant consequences as impossible as explained in Section 3.4. (Algorithm 7, lines 23–25 and 28.) • Use the Lapis Philosophorum principle to propagate information on possible consequences to parents. (Algorithm 7, line 29 and Algorithm 6, lines 12–14.) We note that when Fisher’s pF is used as a goodness measure, then the lower bounds in Table 3.4 are used instead of the upper bounds. Because

84


pF is a decreasing measure, all comparisons concerning the M -value are reversed, and minM is replaced by maxM . Another note concerns the interchangeability of the goodness measure M . If M is interchangeable, then M (A → B) = M (B → A) and it is sufficient to report the dependency just once. In the implementation, we report only the rule with the larger confidence. In addition, with some measures like the χ2 -measure, MI , and Fisher’s pF , also holds M (A → ¬B) = M (B → ¬A). In this case, we also report the rule, which has the larger confidence. In both cases, the best-values are updated for both consequences of equivalent rules. Technical details concerning the implementation of the enumeration tree, efficient frequency counting, and calculating Fisher’s pF (which can easily be laborious) are described in Appendix C.

4.4

Complexity

Next, we analyze the worst case time and space complexity of the Kingfisher algorithm. Both time and space complexity are exponential in the number of attributes. This is not surprising, because already a simpler problem – searching for the best classification rules of the form X → C = c, c ∈ {0, 1}, with common goodness measures like the χ2 -measure – is N P -hard [57]. One problem in the complexity analysis is that the complexity depends on the data distribution and the effect of new pruning strategies is impossible to analyze. In Chapter 5, we will see that in practice the algorithm is quite feasible with the classical benchmark data sets. The following theorem gives the worst case time complexity of the Kingfisher algorithm. We assume that the maximal transaction length (maximal number of 1s on any row), L, is given. Parameter L defines the maximal level (depth) and, thus, the maximal size of the generated enumeration tree. Typically, L NX NA , and 0, otherwise.

Proof In the proofs, we assume that N = n is fixed. We will simplify the XA , P (X) = NNX , and P (A) = NNA . functions by substituting P (XA) = NN 2 (a) Conditions (i) and (ii): For χ the alternative expression is f2 (P (X), δ, P (A), n) =

nδ 2 . P (X)(1 − P (X))P (A)(1 − P (A))

The derivative with respect to δ is f20 =

2nδ , P (X)(1 − P (X))P (A)(1 − P (A))

which satisfies the conditions (i) and (ii). Condition (iii): When P (XA), P (A), and n are fixed, f can be expressed as f (P (X), P (XA), P (A), n) =

n g(P (X)), P (A)(1 − P (A))

2

(X)P (A)) . The first factor is constant, and where g(P (X)) = (P (XA)−P P (X)(1−P (X)) therefore it is sufficient to differentiate g(P (X)) with respect to P (X).

g 0 (P (X)) =

−2P (XA)P (X)2 P (A) + P (X)2 P (A)2 − P (XA)2 + 2P (X)P (XA)2 . P (X)2 (1 − P (X))2

The denominator is [P (XA) − P (X)P (A)][−P (X)P (A) − P (XA) + 2P (X)P (XA)] = [P (XA) − P (X)P (A)][P (X)(P (XA) − P (A)) + P (XA)(P (X) − 1)]. The first factor is leverage δ and the second factor is always negative. Therefore, g 0 < 0, when δ > 0, and g 0 > 0, when δ < 0. Condition (iv): Let as first check the case, where NXA = cf1 NX . Now f becomes f (P (X), cf1 P (X ), P (A), n) =

nP (X )(cf1 − P (A))2 . P (A)(1 − P (A))(1 − P (X ))

B.3 Proofs for good behaviour

129

This is clearly an increasing function of P (X), when cf1 6= P (A). Let us then check case NXA = m(A) − cf2 (n − NX ). Now f becomes f (P (X), P (A) − cf2 (1 − P (X )), P (A), n) =

n(1 − P (X ))(P (A) − cf2 )2 . P (X )P (A)(1 − P (A))

This is clearly a decreasing function of P (X), when cf2 6= P (A). (b) In mutual information, the base of the logarithm is not defined, but usually it is assumed to be 2. However, transformation to the natural logarithm causes only a constant nominator ln(2), which disappears in differentiation. Therefore we will use the natural logarithms for simplicity. We recall that the derivative of a term of form g(x) ln(g(X)) is g 0 (x) ln(g(x)) + g 0 (x). Condition (i) and (ii): M I can be expressed as function f2 : f2 (P (X), δ, P (A), n) = n[(P (X)P (A) + δ) ln(P (X)P (A) + δ)+ (P (X)(1 − P (A)) − δ) ln(P (X)(1 − P (A)) − δ) + ((1 − P (X))P (A) − δ) ln((1 − P (X))P (A) − δ) + ((1 − P (X))(1 − P (A)) + δ) ln((1 − P (X)) (1−P (A))+δ)−P (A) ln(P (A))−(1−P (A)) ln(1−P (A))−P (X) ln(P (X)) − (1 − P (X)) ln(1 − P (X))]. The derivative of f2 with respect to δ is ¶ µ (P (X)P (A) + δ)((1 − P (X))(1 − P (A)) + δ) 0 . f2 = n ln ((P (X)(1 − P (A)) − δ)((1 − P (X))P (A) − δ) This is the same as n times the logarithm of the odds ratio odds, for which holds odds = 1, when δ = 0, odds > 1, when δ > 0, and odds < 1, when δ < 0. Therefore, the logarithm is zero, when δ = 0, negative, when δ < 0, and positive, when δ > 0. Condition (iii): When P (XA), P (A), and n are fixed, f can be expressed as g(P (X)) = n[P (XA) ln(P (XA)) + (P (X) − P (XA)) ln(P (X) − P (XA)) + (P (A) − P (XA)) ln(P (A) − P (XA)) + (1 − P (X) − P (A) + P (XA)) ln (1 − P (X) − P (A) + P (XA)) − P (A) ln(P (A)) − (1 − P (A)) ln(1 − P (A)) − P (X) ln(P (X)) − (1 − P (X)) ln(1 − P (X))]. The derivative of g with respect to P (X) is µ ¶ (P (X) − P (XA))(1 − P (X)) 0 g = n ln . (1 − P (X) − P (A) + P (XA))P (X)

B Auxiliary results

130

(P (X)−P (XA))(1−P (X)) Since q = (1−P (X)−P (A)+P (XA))P (X) = 1, when P (XA) = P (X)P (A), q > 1, when P (XA) < P (X)P (A), and q < 1, when P (XA) > P (X)P (A), the logarithm is zero, when X and A = a are independent, negative, when X and A = a are positively dependent, and positive, when X and A = a are negatively dependent.

Condition (iv): When NXA = cf1 NX , f becomes f (P (X), cf1 P (X ), P (A), n) = n[P (X )cf1 ln(P (X )cf1 ) + P (X )(1 − cf1 ) ln(P (X)(1 − cf1 )) + (P (A) − P (X )cf1 ) ln(P (A) − P (X )cf1 ) + (1 − P (X ) − P (A) + P (X)cf1 ) ln(1 − P (X ) − P (A) + P (X )cf1 ) − P (X ) ln P (X ) − (1 − P (X)) ln(1 − P (X)) − P (A) ln P (A) − (1 − P (A)) ln(1 − P (A))]. The derivative of f is f 0 = n[cf1 ln(P (X )cf1 )+(1 −cf1 ) ln(P (X )(1 −cf1 ))−cf1 ln(P (A)−P (X ) cf1 )−(1 −cf1 ) ln(1 −P (X )−P (A)+P (X )cf1 )−ln(P (X ))+ln(1 −P (X ))]. We should show that f 0 > 0, when cf1 > P (A). To find the lowest value of f 0 , we set g(cf1 ) = f 0 (P (X )) and differentiate g with respect to cf1 . ¶ · µ P (X)cf1 (1 − P (X ) − P (A) + P (X )cf1 ) g (cf1 ) = n ln P (X)(1 − cf1 )(P (A) − P (X )cf1 ) ¸ P (X)(1 − cf1 ) P (X)cf1 − + P (A) − P (X)cf1 1 − P (X) − P (A) + P (X)cf1 · µ ¶ ¸ P (XA)P (¬X¬A) P (XA)P (¬X¬A) − P (¬XA)P (X¬A) = n ln + P (X¬A)P (¬XA) P (¬XA)P (X¬A) · ¸ δ = n ln(odds) + . P (¬XA)P (X¬A) 0

The first term is the logarithm of the odds ratio, which is > 0, when δ > 0. So g 0 > 0, when δ > 0. Therefore, g is an increasing function of cf1 and gets its minimal value, when δ = 0 and cf1 = P (A). When we substitute this to f 0 , we get f 0 = ln(1) = 0. This is the minimal value of f 0 , which is achieved only, when δ = 0; otherwise f 0 > 0, as desired. Let us then check case NXA = m(A) − cf2 (n − NX ). Now P (XA) = P (A) − cf2 (1 − P (X )), P (X¬A) = P (X)(1 − cf2 ) − P (A) + cf2 , P (¬XA) =


131

cf2 (1 − P (X )), P (¬X¬A) = (1 − P (X))(1 − cf2 ), and f becomes f (P (X), P (A) − cf2 (1 − P (X )), P (A), n) = n[(P (A) − cf2 (1 − P (X ))) ln(P (A) − cf2 (1 − P (X ))) + (P (X )(1 − cf2 ) − P (A) + cf2 ) ln(P (X )(1 − cf2 ) − P (A) + cf2 ) + cf2 (1 − P (X )) ln(cf2 (1 − P (X))) + (1 − P (X))(1 − cf2 ) ln((1 − P (X ))(1 − cf2 )) − P (X ) ln P (X ) − (1 − P (X)) ln(1 − P (X)) − P (A) ln(P (A)) − (1 − P (A)) ln(1 − P (A))] = n[(P (A) − cf2 (1 − P (X ))) ln(P (A) − cf2 (1 − P (X ))) + (P (X )(1 − cf2 ) − P (A) + cf2 ) ln(P (X )(1 − cf2 ) − P (A) + cf2 ) + cf2 (1 − P (X )) ln(cf2 ) + (1 − P (X))(1 − cf2 ) ln(1 − cf2 ) − P (X ) ln P (X ) − P (A) ln(P (A)) − (1 − P (A)) ln(1 − P (A))] The derivative of f with respect to P (X) is · 0

f = n cf2 ln

P (X )(1 − cf2 ) − P (A) + cf2 P (A) − cf2 (1 − P (X )) + (1 − cf2 ) ln cf2 1 − cf2 − ln(P (X))] .

We should show that f 0 < 0, when cf2 > P (A). To find the largest value of f 0 , we set g(cf2 ) = f 0 (P (X )) and differentiate g with respect to cf2 .

¶ µ ¶ · µ P (X) − P (A) + cf2 (1 − P (X )) P (A) − cf2 (1 − P (X )) − ln g 0 (cf2 ) = n ln cf2 1 − cf2 ¸ 1 − P (A) P (A) + − P (A) − cf2 (1 − P (X )) P (X) − P (A) + cf2 (1 − P (X )) ¶ ¸ · µ P (A) P (XA)(1 − cf2 ) 1 − P (A) − . = n ln + cf2 P (X ¬A) P (XA) P (X¬A)

Because the dependence is negative, −P (XA) > −P (X)P (A) and P (X)(1− P (A)) < P (X¬A). Therefore, the sum of the last two terms is < 0. The first term becomes ln(odds), when we substitute cf2 = P (A|¬X ). For negative dependence, also ln(odds) < 0, and therefore g 0 < 0. Because g is decreasing with cf2 , f 0 gets its maximum value, when cf2 is minimal, i.e. P (A). When we substitute this to f 0 , it becomes f 0 = 0. This is the maximal value of f 0 , which is achieved only when cf2 = P (A) and δ = 0. Otherwise, f 0 < 0, as desired. (c) Now it is enough to check the conditions only for the positive dependence. Conditions (i) and (ii): z1 can be expressed as √ nδ , f2 (P (X), δ, P (A), n) = p P (X)P (A)(1 − P (X)P (A))

B Auxiliary results

132

when δ > 0, and f2 = 0, otherwise. This is clearly an increasing function of δ and gets its minimum value 0, when δ ≤ 0. Condition (iii): When P (XA), P (A), and n are fixed, f can be expressed as √ n g(P (X)), f (P (X), P (XA), P (A), n) = p P (A) where g(P (X)) = √P (XA)−P (X)P (A) . The derivative of g with respect to P (X)(1−P (X)P (A))

P (X) is

g0 =

−P (X)P (A)(1 − P (XA)) − P (XA)(1 − P (X)P (A)) 3

2(P (X)(1 − P (X)P (A))) 2

.

Because P (X)P (A) < 1 and P (XA) ≤ 1, g 0 < 0 always. Condition (iv): When NXA = cf1 NX , f becomes p nP (X )(cf1 − P (A)) , f (P (X), cf1 P (X ), P (A), n) = p P (A)(1 − P (X )P (A)) which is clearly an increasing function of P (X), when cf1 ≥ P (A). (d) Conditions (i) and (ii): z2 can be expressed as √ nδ p , f2 (P (X), δ, P (A), n) = P (X)P (A)(1 − P (A)) when δ > 0, and f2 = 0, otherwise. This is clearly an increasing function of δ and gets its minimum value 0, when δ ≤ 0. Condition (iii): When P (XA), P (A), and n are fixed, f can be expressed as √ n g(P (X)), f (P (X), P (XA), P (A), n) = p P (A)(1 − P (A)) where g(P (X)) =

P (XA)−P (X)P (A) √ . P (X)

The derivative of g with respect to

P (X) is g 0 (P (X)) = which is < 0 always.

−P (X)P (A) − P (XA) 3

2P (X) 2

,


133

Condition (iv): When P (XA) = cf1 P (X ), f becomes p nP (X )(cf1 − P (A)) , f (P (X), cf1 P (X ), P (A), n) = p P (A)(1 − P (A)) which is clearly an increasing function of P (X), when cf1 ≥ P (A). (e) Like in the mutual information, we will use the natural logarithm for simplicity. Conditions (i) and (ii): J can be expressed as ¶ · µ P (X)P (A) + δ f2 (P (X), δ, P (A), n) = n (P (X)P (A) + δ) ln P (A) ¶ µ ¸ P (X)(1 − P (A)) − δ − P (X) ln(P (X)) . +(P (X)(1 − P (A)) − δ) ln (1 − P (A)) The derivative of f2 with respect to δ is ¶ µ ¶ ¸ · µ P (X)(1 − P (A)) − δ P (X)P (A) + δ 0 − ln − ln(P (X)) f2 = n ln P (A) (1 − P (A)) ¶ µ (P (X)P (A) + δ)(1 − P (A)) . = n ln (P (X)(1 − P (A)) − δ)P (X)P (A) The argument of the logarithm is 1, if δ = 0, and otherwise it is > 1. Therefore, f20 ≥ 0, when δ ≥ 0. Condition (iii): When P (XA), P (A), and n are fixed, f can be expressed as µ g(P (X)) = n[P (XA) ln

P (XA) P (A)

¶

µ + (P (X) − P (XA)) ln

P (X) − P (XA) (1 − P (A))

¶

− P (X) ln(P (X))].

The derivative of g with respect to P (X) is ¶ ¶ µ µ P (X) − P (XA) P (X) − P (XA) 0 − ln(P (X))] = n ln . g = n[ln (1 − P (A)) P (X)(1 − P (A)) The argument of the logarithm is < 1 and thus g 0 < 0, if there is a positive dependency. Condition (iv): When NXA = cf1 NX , f becomes ¶ · µ P (X )cf1 + P (X ) f (P (X), cf1 P (X ), P (A), n) = n P (X )cf1 ln P (A) ¶ µ ¸ P (X )(1 − cf1 ) − P (X ) ln(P (X )) . (1 − cf1 ) ln (1 − P (A))

B Auxiliary results

134 The derivative of f with respect to P (X) is

¶ ¶ · µ µ P (X )cf1 P (X )(1 − cf1 ) + (1 − cf1 ) ln f = n cf1 ln P (A) (1 − P (A)) ¶ ¶¸ · µ µ cf1 (1 − P (A)) 1 − cf1 + ln . − ln(P (X))] = n cf1 ln (1 − cf1 )P (A) (1 − P (A)) 0

We should show that f 0 > 0, when cf1 > P (A). To find the lowest value of f 0 , we set g(cf1 ) = f 0 (P (X )) and derivative g with respect to cf1 . ¶ µ cf1 (1 − P (A)) 0 . g (cf1 ) = n ln (1 − cf1 )P (A) Clearly g 0 = 0, if cf1 = P (A), and g 0 > 0, if cf1 > P (A). When we substitute the minimum value cf1 = P (A) to f 0 , we get f 0 = n[P (A) ln(1)+ ln(1)] = 0. When cf1 > P (A), f 0 > 0, and f is an increasing function of P (X). 2

B.4

Non-convexity and non-concavity of the zscore

Lemma B.5 The z-score defined by function √ N (N NXA − NX NA ) , z1 (NX , NXA , NA , N ) = p NX NA (N 2 − NX NA ) when N NXA > NX NA , and 0, otherwise, is non-convex and non-concave function of NX and NXA . Proof By the definition of convexity, function f (x, y) is non-convex, if for some points (x1 , y1 ) and (x2 , y2 ) and some θ ∈]0, 1[ holds f ((1 − θ)x1 + θx2 , (1 − θ)y1 + θy2 ) ≥ (1 − θ)f (x1 , y1 ) + θf (x2 , y2 ). On the other hand, function f (x, y) is non-concave, if for some points (x1 , y1 ) and (x2 , y2 ) and some θ ∈]0, 1[ holds f ((1 − θ)x1 + θx2 , (1 − θ)y1 + θy2 ) ≤ (1 − θ)f (x1 , y1 ) + θf (x2 , y2 ). Therefore, it is enough to give one example for both cases.

B.4 Non-convexity and non-concavity of the z-score

135

Let us now notate x = P (X), y = P (XA), a = P (A), and N = n. Now the z-score can be expressed as √ n f (x, y) = √ g(x, y), a where

y − xa . g(x, y) = p x(1 − xa)

It is enough to show that g(x, y) is non-convex and non-concave in some points (x1 , y1 ) and (x2 , y2 ). For the non-convexity, let us consider points (x1 , y1 ) = (0, 0) and (x2 , y2 ) = (x2 , x2 ) and θ = 21 . Now both points are on line y = x and g(0, 0) = 0. The condition becomes ¶ √ µ √ x2 (1 − a) x2 (1 − a) 1 0 + x2 0 + x2 ≥ (g(0, 0) + g(x2 , x2 )) = √ , = √ g 2 2 2 2 − x2 a 2 1 − x2 a √ √ ⇔ 2 1 − x2 a ≥ 2 − x2 a ⇔ 4(1 − x2 a) ≥ 2 − x2 a ⇔ 2 ≥ 3x2 a, which is true at least for all x2 ≤ 32 . For the non-concavity, let us consider points (x1 , y1 ) = (a, y1 ) and (x2 , y2 ) = (x2 , x2 a), where x2 > a, and θ = 21 . Now point (x2 , x2 a) is on the independence line and g(x2 , x2 a) = 0. The condition becomes ¶ µ a − y1 a + x2 y1 + x2 a , =p g 2 2 (a + x2 )(2 − a2 − ax2 ) 1 a − y1 ≤ (g(a, y1 ) + g(x2 , x2 a)) = p 2 2 a(1 − a2 ) p p ⇔ 2 a(1 − a2 ) ≤ (a + x2 )(2 − a2 − ax2 ) ⇔ 4a(1 − a2 ) ≤ (a + x2 )(2 − a2 − ax2 ) ⇔ 2(a − x2 )(1 − a2 ) − a(a2 − x22 ) ≤ 0 ⇔ (a − x2 )(2(1 − a2 ) − a − x2 ) ≤ 0 The last inequality is always true, because x2 > a, and thus the first factor is negative, while the second factor is always positive, because now 2(1 − a2 ) − a − x2 ≥ 2(1 − x22 − x2 ) ≥ 0 for all x2 . 2

136

B Auxiliary results

Chapter C Implementation details C.1

Implementing the enumeration tree

The most natural solution is to implement the enumeration tree as a trie (prefix tree). This solution is commonly used also in the search for frequent item sets (e.g. [32, 50]). In a trie implementation, each node of the tree contains a label corresponding to one attribute Ai ∈ R. If we denote the labels by v.label, then node v corresponds to attribute set v1 .label, v2 .label, . . . , vl .label = v.label, where the path from the root to node v consists of nodes v1 , . . . , vl . In the Kingfisher algorithm, all data is stored into the leaf nodes on level l. When level l+1 is processed, new leaf nodes are created, and level l nodes become inner nodes. When no more children are created for a level l node, its data content can be removed, and it becomes a structure node. The structure nodes encode only the trie structure, i.e. all sets which are stored into the tree. In practice, they contain just the node label and pointers to the child nodes. The simplest solution to implement the structure nodes and data nodes is to use an extra node type for all the data content. The trie itself consists of only structure nodes, where the level l nodes have pointers to the corresponding data nodes. When the data content can be removed, it is enough to delete the corresponding data node. In this way, no extra space is wasted for data fields in the inner nodes. Let us now consider the implementation of the trie structure. The objective is that the structure should be space efficient but implement the following operations efficiently: • searchset(Y ) returns a pointer to a node corresponding to set Y . This is needed when the parents are searched. 137

138

C Implementation details

• deletenode(v, w) deletes node v, removes its pointer in the parent node w, and possibly compacts the child pointers in the parent node. • addnode(v, w, Ai ) adds a node v with label Ai as w’s child. In Kingfisher, the tree is always traversed in a breadth-first-manner and the nodes are deleted in the returning. Level l nodes could be accessed without traversing the upper parts of the tree (if the last level leafs were linked), but we anyway have to keep record on the corresponding attribute sets, which can be read from the root–node paths. In addition, we can delete nodes from the previous levels, if they become leafs after removals. This makes the tree structure more compact, but also saves the traversal time. Operation searchset(Y ) can be implemented in |Y | steps, if the child with a given label can be found in a constant time. This is easy to implement, if each node contains a table of pointers for all possible children, whether they existed or not. In a normal trie this would mean that we have to use tables of k pointers in each node. However, in the enumeration tree the maximal number of children depends on the node. Only the root can have k children and all other nodes have less children. Because the attributes are ordered, we know that any node with label Ai can have at most k − i children with labels Ai+1 , . . . , Ak . Therefore, we would need a table of size k − i for a node with label Ai . This can still be too space consuming, because many attribute combinations do not occur in the data at all or they would correspond to insignificant or redundant rules. Therefore, we have implemented a more time consuming table, which contains only the pointers of the existing children. Because the children are kept in an order (by their labels), a child with a given label can be searched with a binary search in O(log(d)) time, where d is the number of children. Adding a child can be done in a constant time, because the children are added in the correct order. (We have to keep record on the next free position in the children table.) We may also have to increase the children table, if there are no free positions left. (We note that the number of children to be added is not known beforehand.) For this, we use the following strategy: In the beginning, the children table has size 1, and every time the table has to be increased, its size is doubled. We can easily see that the resulting table size is at most 2d. This means that adding d children to a node takes O(d) time. When all children have been added, we still remove the extra space from the end of the table. The cost is already included to the previous O(d) time. When a child is removed, the table is compacted. In the worst case, this means that we have to shift all d pointers, which takes O(d) time. In

C.1 Implementing the enumeration tree

139

the worst-case asymptotic analysis, we cannot refer to the actual d, and the time requirements of the three operations become O(log(k)) (search child), O(k) (add child), and O(k) (delete child). We note that adding or removing all (up to k) children takes also O(k) total time, and the amortized worst case complexity of each addition and deletion operation is O(1). The trie structure could be implemented more compactly by succinct tree structures [37]. In a succinct representation, each node takes only 2 + dlog2 (k)e bits, where k is the maximum number of children [12]. The overall space requirement is close to the information theoretic lower bound. In the same time, many primitive operations, like searching a child with a given label, can be implemented in a constant time, assuming a word-RAM model (a random access machine with Ω(log(N )) bit word, where N is the total number of nodes in the tree). However, the traditional succinct trees are static structures, which do not support efficient addition or deletion of nodes. As a solution, we could apply the techniques of [65] for dynamic trees, the whole trie structure with 2N + N dlog2 (k)e + ´ ³ and implement N log log(N ) bits, where N is the number of nodes. In this structure, O log(N ) the addition and´deletion of a node as well as the search for the ith child ³ ) k take O loglog(N log(N ) time. Because N is very large (in the worst case O(2 )), this causes an overhead to the current time requirement. In practice, the data nodes are the most space consuming part of the enumeration tree. The problem is that each node has to contain two possible-tables, one for positive and one for negative consequences, and in the worst case all k attributes are possible consequences. When possibletables are implemented as bit vectors, they require in the worst case 2k bits. The further the search continues, the more consequences become impossible, but the tables are still difficult to implement more compactly. In the current implementation of Kingfisher, we use the following heuristic to compact the tables: Because the most frequent consequences Ai = ai become impossible first, we have ordered positive consequences into an ascending order and negative consequences into a descending order by frequency P (Ai ). Now the zero bits are likely to occur in the end of bit vectors, and they can be cut off. When the possibility of consequence Ai = 1 is checked, it is enough to check that index i occurs in the table, i.e. i ≤ len, where len is the length of the vector, and ppossible[i] = 1. For negative consequence Ai = 0, we have to check that k − i + 1 ≤ len and npossible[k − i + 1] = 1. Hence, checking and updating the ith bit can be done in a constant time. However, searching the cut point (after which all bits are zero) takes in the worst case O(log(k)) time, when implemented by the binary search. The bit vectors are still quite space consuming, and

140


in the future research, our aim is to implement them more efficiently, for example by hash tables or Bloom filters.

C.2

Efficient frequency counting

The data set consists of n rows and each row contains k binary-valued attributes A1 , . . . , Ak ∈ R. In the frequency counting, the problem is to calculate the absolute frequency m(X), X ⊆ R. Since we allow only positive-valued attributes in the rule antecedents, it is enough to consider the case, where Ai = 1 for all Ai ∈ X. The simplest representation for the data set is a n × k bit-matrix r. If attribute Aj occurs on the ith row, then r[i, j] = 1, and otherwise 0. In practice, there are two basic approaches, how to construct the bit-matrix. In the horizontal data layout, r consists of n bit-vectors, each of them k bits long. In the vertical data layout, r consists of k bit-vectors, each of them n bits long. In both approaches, the asymptotic time complexity of frequency counting is the same, but in practice the difference can be remarkable. The reason is that in the vertical data layout, frequency counting can be implemented by efficient bitwise logical operations. Let us now analyze the time requirement for frequency counting, when the data is stored into n vectors of k bits or k vectors of n bits. Each bit vector is implemented by m c integers, where m is the length of the bit vector (either k or n) and constant c is the number of bits in the machine word (e.g. 32 or 64, depending on the architecture). Each integer can be checked fast by bitwise logical operations. In practice, we can consider the bitwise logical operations on integers as atomic operations which take constant time. When the horizontal layout is used, counting the frequency of each l-set requires in the worst case n · min{l, kc } time steps. If l < k/c, the attributes of an l-set can occur in at most l integers, and checking one row takes at most l steps. The checking is repeated on all n rows. In the vertical layout, the frequency of X = A1 , . . . , Al is the same as the number of 1-bits in the bitwise and of vectors v(A1 ), . . . , v(Al ). This observation was made already 1996 by Yen and Chen [85], but in that time the memory sizes were so small that the technique could not be utilized for large data sets. Counting the frequency of one l-set takes nc (l + 1) time steps, which consist of nl c bitwise and operations and counting the number n of 1-bits in the result ( c steps) on each of l rows. When we compare the requirements we see that the vertical layout is in practice more efficient: nc (l + 1) < n · min{ kc , l} for all l < k − 1. The

C.3 Efficient implementation of Fisher’s exact test

141

asymptotic time complexity is still the same O(nl).

C.3

Efficient implementation of Fisher’s exact test

Fisher’s exact test (goodness measure pF ) is often the preferred method for significance testing, but it is not trivial to calculate in a large data set. In the following, we consider the technical problems and introduce feasible solutions.

C.3.1

Calculating Fisher’s pF

Calculating Fisher’s pF (Equation 2.4) can be a time consuming operation, because the value is a sum of several terms, each containing binomial coefficients. In addition, the evaluation can easily cause an overflow or underflow. For the latter problem, a common solution is to use logarithms in the middle steps, when¡ each binomial ¢ term is evaluated. For P example, toPevaluate Pm−l coefl ficient c = ml , we calculate ln(c) = m ln(i) − ln(i) − i=1 i=1 i=1 ln(i), ln(c) and take the inverse of the result, i.e. c = e . This approach is used for example in MagnumOpus [76] ([74]), when the significance of the productivity is tested by Fisher’s pF . However, the technique is unnecessarily worksome, because the same factorials are usually calculated several times during the search. In Kingfisher, we use a P more efficient strategy, where the logarithms m of all factorials, ln(m!) = i=1 ln(i), m = 1, . . . , n, are calculated just once and stored into a table. Thus, each term in pF can be evaluated in constant time. Evaluating pF (Y → A = a) takes J = min{m(Y A 6= a), m(¬Y A = a)} ≤ n4 steps. Therefore, the asymptotic time complexity is O(n). The initialization of the factorial table takes also O(n), because (i + 1)! = (i + 1)i!, i.e. ln((i + 1)!) = ln(i!) + ln(i + 1). The asymptotic complexity of evaluating pF is still quite large. In Kingfisher, we have two solutions to improve the efficiency. The first solution is to use exact pF -values, but calculate them only, when a rule can be significant and non-redundant. This possibility is first checked by calculating only the first term of pF . Since the first term contributes most to the overall pF -value, it gives a good lower bound for the expected goodness. If the lower bound is sufficiently good, then the rest of the terms are evaluated. The second solution is to use a tight approximation for pF as a goodness measure, instead of the exact pF . In the following, we introduce three approximations which can be used for this purpose. All of them are upper bounds for pF , but they approach pF fast, when the dependency becomes


142

stronger. The loosest of the upper bounds are especially suitable as goodness measures, because they can be evaluated in a constant time. However, if the dependency is really weak, these approximations may be too inaccurate. In this case, the first terms of pF can be calculated exactly and the approximation is used to bound only the rest of the terms. In practice, it is usually sufficient to calculate just one or two terms exactly.

C.3.2

Upper bounds for pF

The following theorem gives two useful upper bounds, which can be used to approximate Fisher’s pF . The first upper bound is more accurate, but it contains an exponent, which makes it more difficult to evaluate. The latter upper bound is always easy to evaluate and also intuitively appealing. Theorem C.1 Let us notate pF = p0 + p1 + . . . + pJ and qi = For positive dependency rule X → A with lift γ(X, A) holds Ã pF ≤ p0

1 − q1J+1 1 − q1

pi pi−1 ,

i ≥ 1.

!

µ ¶ 1 − P (A)γ(X, A) − P (X)γ(X, A) + P (X)P (A)γ(X, A)2 ≤ p0 1 + . γ(X, A) − 1 Proof Each pi can be expressed as pi = pabs ti , where pabs = m(A)!m(¬A)! n! F . is constant. Therefore, it is enough to show the result for pX = ppabs pX = t0 + t1 + ... + tJ = t0 + q1 t0 + q1 q2 t0 + ... + q1 q2 ...qJ t0 , where ti ti−1 (m(XA) + i − 1)!(m(¬X¬A) + i − 1)!(m(X¬A) − i + 1)!(m(¬XA) − i + 1)! = (m(XA) + i)!(m(¬X¬A) + i)!(m(X¬A) − i)!(m(¬XA) − i)! (m(X¬A) − i + 1)(m(¬XA) − i + 1) = . (m(XA) + i)(m(¬X¬A) + i)

qi =

Since qi decreases when i increases, the largest value is q1 . We get an upper bound pX = t0 + q1 t0 + q1 q2 t0 + ... + q1 q2 ...qJ t0 ≤ t0 (1 + q1 + q12 + q13 + ... + q1J ).

C.3 Efficient implementation of Fisher’s exact test The sum of the geometric series is t0 On the other hand, t0

1−q1J+1 1−q1 ,

143

which is the first upper bound.

µ ¶ t0 1 − q1J+1 q1 ≤ = t0 1 + . 1 − q1 1 − q1 1 − q1

m(X¬A)m(¬XA) Let us insert q1 = (m(XA)+1)(m(¬X¬A)+1) , and express the frequencies using lift γ = γ(X, A). For simplicity, we use notations x = P (X) and a = P (A). Now m(XA) = nxaγ, m(X¬A) = nx − nxaγ, m(¬XA) = na − nxaγ and m(¬X¬A) = n(1 − x − a + xaγ). We get

m(X¬A)m(¬XA) q1 = 1 − q1 (m(XA) + 1)(m(¬X¬A) + 1) − m(X¬A)m(¬XA) n2 xa − n2 xa2 γ − n2 x2 aγ + n2 x2 a2 γ 2 = 2 n xaγ + 2nxaγ + n − nx − na + 1 − n2 xa nxa − nxa2 γ − nx2 aγ + nx2 a2 γ 2 = . nxaγ + 2xaγ + 1 − x − a + 1/n − nxa

The nominator is ≥ nxaγ − nxa, because 2xaγ + 1 − x − a + 1/n ≥ 0 ⇔ P (XA) + P (¬X¬A) + 1/n ≥ 0. Therefore 1 − aγ − xγ + xaγ 2 q1 ≤ . 1 − q1 γ−1 2 In the following, we will denote the looser (simpler) upper bound by ub1 and the tighter upper bound (sum of the geometric series) by ub2. In ub1, the first term of pF is always exact and the rest are approximated, while in ub2, the first two terms are always exact and the rest are approximated. We note that ub1 can be expressed equivalently as ¶ µ ¶ µ P (X¬A)P (¬XA) P (XA)P (¬X¬A) = p0 1 + , ub1 = p0 δ(X, A) δ(X, A) where δ(X, A) is the leverage. This is expression is closely related to the odds ratio odds(X, A) =

P (XA)P (¬X¬A) δ =1+ , P (X¬A)P (¬XA) P (X¬A)P (¬XA)


144

which is often used to measure the strength of the dependency. The odds ratio can be expressed equivalently as odds(X, A) =

ub1 . ub1 − 1

We see that when the odds ratio increases (dependency becomes stronger), the upper bound decreases. In practice, it gives a tight approximation to Fisher’s pF , when the dependency is sufficiently strong. The error is difficult to bind tightly, but the following theorem gives a loose upper bound for the error, when ub2 is used for approximation. Theorem C.2 When pF is approximated by ub2, the error is bounded by µ 2 ¶ q1 . err ≤ p0 1 − q1 Proof Upper bound ub2 ³ can2 ćause error only, if J > 1. If J = 0, ub2 = p0 1−q and if J = 1, ub2 = p0 1−q11 = p0 (1 + q1 ) = p0 + p1 = pF . Let us now assume that J > 1. The error is err = ub2 − pF = p0 (1 + q1 + . . . + q1J − 1 − q1 − q1 q2 − . . . − q1 q2 . . . qJ ) = p0 (q12 + q13 + . . . q1J − q1 q2 − . . . − q1 q2 . . . qJ ).

It has an upper bound Ã err
1, because


145

m(X¬A)m(¬XA) (m(XA) + 1)(m(¬X¬A) + 1) P (X¬A)P (¬XA) xa(1 − aγ)(1 − xγ) . = < P (XA)P (¬X¬A) xaγ(1 − x − a + xaγ)

q1 =

This is ≤ γ1 , because 1 − aγ − xγ + xaγ 2 ≤ 1 − x − a + xaγ ⇔ 0 ≤ x(γ − 1) + a(γ − 1) − xaγ(γ − 1) = (x + a − P (XA))(γ − 1).

A sufficient condition for q1 ≤ 1 ≤ γ

√ 5−1 2

is that

√ √ 2 5−1 5+1 ⇔γ≥ √ . = 2 2 5−1 2

This result also means that ub2 ≤ 2pF , when the lift is as large as required. The simpler upper bound, ub1, can cause a somewhat larger error than ub2, but it is even harder to analyze. However, we note that ub1 = pF only, when J = 0. When J = 1, there is already some error, but in practice the difference is marginal. The following theorem gives guarantees for the accuracy of ub1, when γ ≥ 2. Theorem C.4 If pF is approximated with ub1 and γ(X, A) ≥ 2, the error is bounded by err ≤ p0 . Proof The error ´ err = ub1 − pF = ub1 − ub2 + ub2 − pF , where ³ 2 is q1 ub2 − pF ≤ p0 1−q1 by Theorem C.2. When γ ≥ 2, ub1 (being a decreasing function of γ) is µ ub1 = p0

γ(1 − aγ − xγ + xaγ 2 ) γ−1

¶ ≤ p0 2(1 − a − x + 2xa).


146 Therefore, the error is bounded by

! q12 1 − q1J+1 + err ≤ p0 2(1 − a − x + 2xa) − 1 − q1 1−q ! Ã (1 − q12 − q1J+1 ) = p0 2 − 2a − 2x + 4xa − (1 − q1 ) ! Ã 2(1 − q1 ) − 2a(1 − q1 ) − 2x(1 − q1 ) + 4xa(1 − q1 ) − 1 + q12 + q1J+1 = p0 1 − q1 ! Ã 1 − 2q1 + q12 + q1J+1 2a(1 − q1 ) − 2x(1 − q1 ) + 4xa(1 − q1 ) . = p0 1 − q1 Ã

When γ ≥ 2, q1 ≤ 12 , and thus 1 − 2q1 + q12 + q1J+1 ≤ 1 − q1 ⇔ 0 ≤ q1 − q12 − q1J+1 ⇔ 0 ≤ q1 (1 − q1 − q1J ). Therefore, ¶ µ (1 − q1 )(1 − 2a − 2x + 4xa) = p0 (1 − 2a − 2x + 4xa). err ≤ p0 1 − q1 1−2a−2x+4xa ≤ 1, because 2a+2x−4xa = 2a(1−x)+2x(1−a) ≥ 0. Therefore err ≤ p0 . 2 Our experimental results support the theoretical analysis, according to which both upper bounds, ub1 and ub2, give tight approximations to Fisher’s pF , when the dependency is sufficiently strong. However, if the dependency is weak, we may need a more accurate approximation. A simple solution is to include more larger terms p0 + p1 + . . . pl−1 to the approximation and estimate an upper bound only for the smallest terms pl + . . . pJ using the sum of the geometric series. The resulting approximation and the corresponding error bound are given in the following theorem. We omit the proofs, because they are essentially identical with the previous proofs for Theorems C.1 and C.2. Theorem C.5 For positive dependency rule X → A holds ! Ã J−l+1 1 − ql+1 , pF ≤ p0 + . . . pl−1 + pl 1 − ql+1

C.3 Efficient implementation of Fisher’s exact test where ql+1 =

(m(X¬A) − l)(m(¬XA) − l) (m(XA) + l + 1)(m(¬X¬A) + l + 1)

and l + 1 ≤ J. The error of the approximation is Ã err ≤ p0

C.3.3

147

2 ql+1 1 − ql+1

! .

Evaluation

Figure C.1 shows the typical behaviour of the new upper bounds, when the strength of the dependency increases (i.e. m(XA) increases and m(X) and m(A) remain unchanged). In addition to upper bounds ub1 and ub2, we consider a third upper bound, ub3, based on Theorem C.5, where the first three terms of pF are evaluated exactly and the rest is approximated. All three upper bounds approach each other and the exact pF -value, when the dependency becomes stronger. 0.5 p ub1 ub2 ub3

0.4

0.3

0.2

0.1

0 60

80

100

120

140

160

180

200

Figure C.1: Exact pF and three upper bounds as functions of m(XA), when m(X) = 200, m(A) = 250, and n = 1000. The strength of the dependency increases on the x-axes. Figure C.2 shows a magnified area from Figure C.1. In this area, the dependencies are weak, and the upper bounds diverge from the exact pF . The reason is that in this area the number of approximated terms is also the largest. For example, when m(XA) = 55, pF contains 146 terms, and


148

when m(XA) = 65, it contains 136 terms. In these points the lift is 1.1 and 1.3, respectively. The difference between ub1 and ub2 is marginal, but ub3 clearly improves ub1. 0.5 p ub1 ub2 ub3

0.4

0.3

0.2

0.1

0 55

60

65

70

75

Figure C.2: A magnified area from Figure C.1 showing the differences, when the dependency is weak.

Because the new upper bound gives accurate approximations for strong dependencies, we evaluate the approximations only for the potentially problematic weak dependencies. As an example, we consider two data sets, where the data size is either n = 1000 or n = 10000. For both data sets, we have three test cases: 1) when P (X) = P (A) = 0.5, 2) when P (X) = 0.2 and P (A) = 0.25, and 3) when P (X) = 0.05 and P (A) = 0.2. (The second case with n = 1000 is shown in Figures C.1 and C.2.) For all test cases we have calculated the exact pF , three versions of the upper bound, ub1, ub2 and ub3, and the p-value achieved from the one-sided χ2 -measure. The χ2 -based p-values were calculated with an online Chi-Square Calculator [31]. The values are reported for the cases, where pF ≈ 0.05, pF ≈ 0.01, and pF ≈ 0.001. Because the data is discrete, the exact pF -values always deviate somewhat from the reference values. The results for the first data set (n = 1000) are given in Table C.1 and for the second data set (n = 10000) in Table C.2. As expected, the χ2 -approximation works best, when the data size is large and the distribution is balanced (case 1). According to the classical rule of thumb, the χ2 -approximation can be used, when all expected counts are ≥ 5 [28]. This requirement is not satisfied in the third case in the smaller data set. The


149

resulting χ2 -based p-values are also the least accurate, but the χ2 -test produced inaccurate approximations also for the case 2, even if the smallest expected frequency was 50. In the smaller data set, the χ2 -approximation overperformed the new upper bounds only in the first case, when pF ≈ 0.05. If we had calculated the first four terms exactly, the resulting ub4 would have already produced a better approximation. In the larger data set, the χ2 -measure gave more accurate results for the first two cases, when pF ≈ 0.05 and for the case 1, when pF ≈ 0.01. When pF ≈ 0.001, the new upper bounds gave always more accurate approximations. If we had calculated the first eight terms exactly, the resulting ub8 would have overperformed the χ2 -approximation in case 1 with pF ≈ 0.01 and case 2 with pF ≈ 0.05. Calculating eight exact terms is quite reasonable compared to all 2442 terms, which have to be calculated for the exact pF in case 1. With 15 exact terms, the approximation for the case 1 with pF ≈ 0.05 would have also been more accurate than the χ2 -based approximation. However, in so large data set (especially with an exhaustive search), a p-value of 0.05 (or even 0.01) is hardly significant. Therefore, we can conclude that for practical search purposes the new upper bounds give better approximations to the exact pF than the χ2 -measure.

150


Table C.1: Comparison of the exact pF value, three upper bounds, and the p-value based on the χ2 -measure, when n = 1000. The best approximations are emphasized. E(m(XA)) is the expected frequency of XA under the independence assumption. 1) n = 1000, m(X) = 500, m(A) = 500, E(m(XA)) = 250 m(XA) pF ub1 ub2 ub3 p(χ2) 263 0.0569 0.0696 0.0674 0.0617 0.050 269 0.0096 0.0107 0.0105 0.0100 0.0081 275 0.00096 0.00103 0.00101 0.00100 0.00080 2) n = 1000, m(X) = 200, m(A) = 250, E(m(XA)) = 50 m(XA) pF ub1 ub2 ub3 p(χ2) 60 0.0429 0.0508 0.0484 0.0447 0.0340 63 0.0123 0.0137 0.0132 0.0125 0.088 68 0.00089 0.00094 0.00092 0.00089 0.00050 3) n = 1000, m(X) = 50, m(A) = 200, E(m(XA)) = 1 m(XA) pF ub1 ub2 ub3 p(χ2) 15 0.0559 0.0655 0.0605 0.0565 0.0349 17 0.0123 0.0135 0.0128 0.0124 0.0056 19 0.00194 0.00205 0.00198 0.00194 0.00050


151

Table C.2: Comparison of the exact pF value, three upper bounds, and the p-value based on the χ2 -measure, when n = 10000. The best approximations are emphasized. E(m(XA)) is the expected frequency of XA under the independence assumption. 1) n = 10000, m(X) = 5000, m(A) = 5000, E(m(XA)) = 2500 m(XA) pF ub1 ub2 ub3 p(χ2) 2541 0.0526 0.0655 0.0647 0.0621 0.0505 2559 0.0096 0.0109 0.0109 0.0106 0.0091 2578 0.00097 0.00105 0.00104 0.00102 0.00090 2) n = 10000, m(X) = 2000, m(A) = 2500, E(m(XA)) = 500 m(XA) pF ub1 ub2 ub3 p(χ2) 529 0.0504 0.0623 0.0611 0.0579 0.047 541 0.0100 0.0113 0.0112 0.0108 0.0090 554 0.00109 0.00118 0.00116 0.00114 0.00090 3) n = 10000, m(X) = 500, m(A) = 2000, E(m(XA)) = 10 m(XA) pF ub1 ub2 ub3 p(χ2) 115 0.0498 0.0608 0.0583 0.0541 0.0427 121 0.0105 0.0118 0.0115 0.0109 0.0080 128 0.00106 0.00114 0.00112 0.00108 0.00070

152


References [1] C.C. Aggarwal and P.S. Yu. A new framework for itemset generation. In Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS 1998), pages 18–24. ACM Press, 1998. [2] R. Agrawal, T. Imielinski, and A.N. Swami. Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, pages 207–216. ACM Press, 1993. [3] R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proceedings of the 20th International Conference on Very Large Data Bases, VLDB’94, pages 487–499. Morgan Kaufmann, 1994. [4] A. Agresti. A survey of exact inference for contingency tables. Statistical Science, 7(1):131–153, 1992. [5] A. Agresti and Y. Min. Frequentist performance of Bayesian confidence intervals for comparing proportions in 2 × 2 contingency tables. Biometrics, 61:515–523, 2005. [6] M.-L. Antonie and O. R. Za¨ıane. Mining positive and negative association rules: an approach for confined rules. In Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD’04), pages 27–38. Springer-Verlag, 2004. [7] A. Asuncion and D.J. Newman. UCI machine learning repository, 2007. [8] G.A. Barnard. Significance tests for 2 × 2 tables. 34(1/2):123–138, 1947.

Biometrika,

[9] Y. Bastide, N. Pasquier, R. Taouil, G. Stumme, and L. Lakhal. Mining minimal non-redundant association rules using frequent closed item153

154

References sets. In Proceedings of the First International Conference on Computational Logic (CL’00), volume 1861 of Lecture Notes in Computer Science, pages 972–986. Springer-Verlag, 2000.

[10] Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B, 57(1):289–300, 1995. [11] Y. Benjamini and M. Leshno. Statistical methods for data mining. In The Data Mining and Knowledge Discovery Handbook, pages 565–87. Springer, 2005. [12] D. Benoit, E.D. Demaine, J.I. Munro, R. Raman, V. Raman, and S.S. Rao. Representing trees of higher degree. Algorithmica, 43(4):275–292, 2005. [13] F. Berzal, I. Blanco, D. Sánchez, and M. Amparo Vila Miranda. A new framework to assess association rules. In Proceedings of the 4th International Conference on Advances in Intelligent Data Analysis (IDA’01), pages 95–104. Springer-Verlag, 2001. [14] J. Blanchard, F. Guillet, R. Gras, and H. Briand. Using informationtheoretic measures to assess association rule interestingness. In Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05), pages 66–73. IEEE Computer Society, 2005. [15] C. Borgelt. Apriori v5.14 software, 2010. http://www.borgelt.net/ apriori.html. Retrieved 7.6. 2010. [16] C. Borgelt and R. Kruse. Induction of association rules: Apriori implementation. In Proceedings of the 15th Conference on Computational Statistics (COMPSTAT 2002). Physica Verlag, Heidelberg, Germany, 2002. [17] J.-F. Boulicaut, A. Bykowski, and C. Rigotti. Approximation of frequency queris by means of free-sets. In Proceedings of the 4th European Conference Principles of Data Mining and Knowledge Discovery (PKDD’00), volume 1910 of Lecture Notes in Computer Science, pages 75–85. Springer-Verlag, 2000. [18] S. Brin, R. Motwani, and C. Silverstein. Beyond market baskets: Generalizing association rules to correlations. In Proceedings ACM SIGMOD International Conference on Management of Data, pages 265– 276. ACM Press, 1997.

References

155

[19] D. Bruzzese and C. Davino. Visual post-analysis of association rules. Journal of Visual Languages & Computing, 14:621–635, December 2003. [20] K.C. Carriere. How good is a normal approximation for rates and proportions of low incidence events? Communications in Statistics – Simulation and Computation, 30:327–337, 2001. [21] G.W. Cobb and Y.-P. Chen. An application of Markov chain Monte Carlo to community ecology. The American Mathematical Monthly, 110:265–288, April 2003. [22] H.J. Cordell. Detecting gene–gene interactions that underlie human diseases. Nature Reviews Genetics, 10:392–404, June 2009. [23] T.H. Cormen, C.E. Leiserson, and R.L. Rivest. Introduction to Algorithms. The MIT Press, Cambridge, Massachussetts, London, England, 1990. [24] L. Dehaspe and H. Toivonen. Discovery of relational association rules. In S. Dˇzeroski and N. Lavraˇc, editors, Relational Data Mining, pages 189–212. Springer-Verlag, Berlin, Heidelberg, 2001. [25] E.S. Edgington. Randomization Tests. Marcel Dekker, Inc., New York, third edition, 1995. [26] FIMI. Frequent itemset mining dataset repository. http://fimi.cs. helsinki.fi/data/. Retrieved 10.2. 2009. [27] P.D. Finch. Description and analogy in the practice of statistics. Biometrika, 66:195–206, 1979. [28] R.A. Fisher. Statistical Methods for Research Workers. Oliver and Boyd, Edinburgh, 1925. [29] D. Freedman, R. Pisani, and R. Purves. Statistics. Norton & Company, London, 4th edition, 2007. [30] A. Gionis, H. Mannila, T. Mielikäinen, and P. Tsaparas. Assessing data mining results via swap randomization. ACM Transactions on Knowledge Discovery from Data, 1(3):14:1–14:32, 2007. [31] Inc GraphPad Software. Quickcalcs online calculators for scientists. http://www.graphpad.com/quickcalcs/contingency1.cfm. Retrieved 18.5. 2010.

156

References

[32] L. Guner and P. Senkul. Frequent itemset minning with trie data structure and parallel execution with PVM. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, Proceedings of the 14th European PVM/MPI User’s Group Meeting (PVM/MPI 2007), volume 4757 of Lecture Notes in Computer Science, pages 289– 296. Springer, 2007. [33] M. Haber. A comparison of some continuity corrections for the chisquared test on 2× 2 tables. Journal of the American Statistical Association, 75(371):510–515, 1980. [34] L.W. Hahn, M.D. Ritchie, and J.H. Moore. Multifactor dimensionality reduction software for detecting gene–gene and gene–environment interactions. Bioinformatics, 19:376–382, 2003. [35] M. Hahsler, K. Hornik, and T. Reutterer. Implications of probabilistic data modeling for mining association rules. In From Data and Information Analysis to Knowledge Engineering. Proceedings of the 29th Annual Conference of the Gesellschaft f¨ ur Klassifikation, Studies in Classification, Data Analysis, and Knowledge Organization, pages 598–605. Springer-Verlag, 2006. [36] J. V. Howard. The 2×2 table: A discussion from a bayesian viewpoint. Statistical Science, 13(4):351–367, 1998. [37] G. Jacobson. Space-efficient static trees and graphs. In Proceedings of the 30th Annual Symposium on Foundations of Computer Science (SFCS’89), pages 549–554. IEEE Computer Society, 1989. [38] R.J. Bayardo Jr., R. Agrawal, and D. Gunopulos. Constraint-based rule mining in large, dense databases. Data Mining and Knowledge Discovery, 4(2/3):217–240, 2000. [39] A. Kirsch, M. Mitzenmacher, A. Pietracaprina, G. Pucci, E. Upfal, and F. Vandin. An efficient rigorous approach for identifying statistically significant frequent itemsets. In Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems (PODS’09), pages 117–126. ACM, 2009. [40] Y.S. Koh and R. Pears. Efficiently finding negative association rules without support threshold. In Advances in Artificial Intelligence, Proceedings of the 20th Australian Joint Conference on Artificial Intelligence (AI 2007), volume 4830 of Lecture Notes in Computer Science, pages 710–714. Springer, 2007.

References

157

[41] S. Lallich, O. Teytaud, and E. Prudhomme. Association rule interestingness: Measure and statistical validation. In F. Guillet and H.J. Hamilton, editors, Quality Measures in Data Mining, volume 43 of Studies in Computational Intelligence, pages 251–275. Springer, 2007. [42] S. Lallich, B. Vaillant, and P. Lenca. Parametrised measures for the evaluation of association rule interestingness. In Proceedings of the 11th Symposium on Applied Stochastic Models and Data Analysis (ASMDA’05), pages 220–229, 2005. [43] E.L. Lehmann. The Fisher, Neyman-Pearson theories of testing hypotheses: One theory or two? Journal of the American Statistical Association, 88:1242–1249, December 1993. [44] E.L. Lehmann and J.P. Romano. Testing Statistical Hypotheses. Texts in Statistics. Springer, New York, 3rd edition, 2005. [45] J. Li. On optimal rule discovery. IEEE Transactions on Knowledge and Data Engineering, 18(4):460–471, 2006. [46] B.W. Lindgren. Statistical Theory. Chapman & Hall, Boca Raton, U.S.A., 4th edition, 1993. [47] B. Liu, W. Hsu, and Y. Ma. Pruning and summarizing the discovered associations. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’99), pages 125–134. ACM Press, 1999. [48] G. Liu, J. Li, and L. Wong. A new concise representation of frequent itemsets using generators and a positive border. Knowledge and Information Systems, 17(1):35–56, 2008. [49] J.A. Major and J.J. Mangano. Selecting among rules induced from a hurricane database. Journal of Intelligent Information Systems, 4:39– 52, January 1995. [50] G.S. Manku and R. Motwani. Approximate frequency counts over data streams. In Proceedings of 28th International Conference on Very Large Data Bases (VLDB 2002), pages 346–357. Morgan Kaufmann, 2002. [51] H. Mannila, H. Toivonen, and A.I. Verkamo. Efficient algorithms for discovering association rules. In Papers from the AAAI Workshop on Knowledge Discovery in Databases (KDD’94), pages 181–192. AAAI Press, 1994.

158

References

[52] N. Megiddo and R. Srikant. Discovering predictive association rules. In Proceedings of the 4th International Conference on Knowledge Discovery in Databases and Data Mining, pages 274–278. AAAI Press, 1998. [53] R. Meo. Theory of dependence values. ACM Transactions on Database Systems, 25(3):380–406, 2000. [54] J.H. Moore, F.W. Asselbergs, and S. M. Williams. Bioinformatics challenges for genome-wide association studies. Bioinformatics, 26(4):445– 455, 2010. [55] J.H. Moore and M.D. Ritchie. The challenges of whole-genome approaches to common diseases. JAMA The Journal of the American Medical Association, 291(13):1642–1643, 2004. [56] S. Morishita and A. Nakaya. Parallel branch-and-bound graph search for correlated association rules. In Revised Papers from Large-Scale Parallel Data Mining, Workshop on Large-Scale Parallel KDD Systems, SIGKDD, volume 1759 of Lecture Notes in Computer Science, pages 127–144. Springer-Verlag, 2000. [57] S. Morishita and J. Sese. Transversing itemset lattices with statistical metric pruning. In Proceedings of the nineteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems (PODS’00), pages 226–236. ACM Press, 2000. [58] J. Neyman and E.S. Pearson. On the use and interpretation of certain test criteria for purposes of statistical inference: Part II. Biometrika, 20A(3/4):263–294, 1928. [59] S. Nijssen, T. Guns, and L. De Raed. Correlated itemset mining in ROC space: a constraint programming approach. In Proceedings the 15th ACM SIGKDD conference on Knowledge Discovery and Data Mining (KDD’09), pages 647–656. ACM Press, 2009. [60] S. Nijssen and J.N. Kok. Multi-class correlated pattern mining. In Proceedings of the 4th International Workshop on Knowledge Discovery in Inductive Databases, volume 3933 of Lecture Notes in Computer Science, pages 165–187. Springer-Verlag, 2006. [61] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. In Proceedings of the 7th International Conference on Database Theory (ICDT’99), volume 1540 of

References

159

Lecture Notes in Computer Science, pages 398–416. Springer-Verlag, 1999. [62] E.S. Pearson. The choice of statistical tests illustrated on the interpretation of data classed in a 2 × 2 table. Biometrika, 34(1/2):139–167, 1947. [63] G. Piatetsky-Shapiro. Discovery, analysis, and presentation of strong rules. In G. Piatetsky-Shapiro and W.J. Frawley, editors, Knowledge Discovery in Databases, pages 229–248. AAAI/MIT Press, 1991. [64] Jr. R. J. Bayardo and R. Agrawal. Mining the most interesting rules. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’99), pages 145–154. ACM Press, 1999. [65] K. Sadakane and G. Navarro. Fully-functional succinct trees. In Proceedings of the 21st Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’10), pages 134–149. Society for Industrial and Applied Mathematics, 2010. [66] J.P. Shaffer. Multiple hypothesis testing. Annual Review of Psychology, 46:561–584, 1995. [67] E.H. Shortliffe and B.G. Buchanan. A model of inexact reasoning in medicine. Mathematical Biosciences, 23:351–379, 1975. [68] C. Silverstein, S. Brin, and R. Motwani. Beyond market baskets: Generalizing association rules to dependence rules. Data Mining and Knowledge Discovery, 2(1):39–68, 1998. [69] T.M.F. Smith. On the validity of inferences from non-random samples. Journal of the Royal Statistical Society. Series A (General), 146(4):394–403, 1983. [70] P. Smyth and R.M. Goodman. An information theoretic approach to rule induction from databases. IEEE Transactions on Knowledge and Data Engineering, 4(4):301–316, 1992. [71] D.R. Thiruvady and G.I. Webb. Mining negative rules using GRD. In Advances in Knowledge Discovery and Data Mining, Proceedings of the 8th Pacific-Asia Conference, (PAKDD 2004), volume 3056 of Lecture Notes in Computer Science, pages 161–165. Springer, 2004.

160

References

[72] G.J.G. Upton. A comparison of alternative tests for the 2 x 2 comparative trial. Journal of the Royal Statistical Society. Series A (General), 145(1):86–105, 1982. [73] R. Vilalta and D. Oblinger. A quantification of distance bias between evaluation metrics in classification. In Proceedings of the Seventeenth International Conference on Machine Learning (ICML’00), pages 1087–1094. Morgan Kaufmann Publishers Inc., 2000. [74] G. Webb, 2009. Personal communication. [75] G. I. Webb. Discovering significant patterns. 68(1):1–33, 2007.

Machine Learning,

[76] G.I. Webb. MagnumOpus software. http://www.giwebb.com/index. html. Retrieved 10.2. 2009. [77] G.I. Webb. Discovering significant rules. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD’06), pages 434–443. ACM Press, 2006. [78] G.I. Webb and S. Zhang. K-optimal rule discovery. Data Mining and Knowledge Discovery, 10(1):39–79, 2005. [79] S.S. Wilks. The likelihood test of independence in contingency tables. The Annals of Mathematical Statistics, 6(4):190–196, 1935. [80] X. Wu, C. Zhang, and S. Zhang. Efficient mining of both positive and negative association rules. ACM Transactions on Information Systems, 22(3):381–405, 2004. [81] X. Wu, C. Zhang, and S. Zhang. Efficient mining of both positive and negative association rules. ACM Transactions on Information Systems, 22(3):381–405, 2004. [82] H. Xiong, S. Shekhar, P.-N. Tan, and V. Kumar. Exploiting a supportbased upper bound of pearson’s correlation coefficient for efficiently identifying strongly correlated pairs. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining (KDD’04), pages 334–343. ACM, 2004. [83] Y.Y. Yao and N. Zhong. An analysis of quantitative measures associated with rules. In Proceedings of the Third Pacific-Asia Conference on Methodologies for Knowledge Discovery and Data Mining (PAKDD’99), pages 479–488. Springer-Verlag, 1999.

References

161

[84] F. Yates. Test of significance for 2 x 2 contingency tables. Journal of the Royal Statistical Society. Series A (General), 147(3):426–463, 1984. [85] S.-J. Yen and A.L.P. Chen. An efficient approach to discovering knowledge from large databases. In Proceedings of the fourth international conference on on Parallel and distributed information systems (DIS’96), pages 8–18. IEEE Computer Society, 1996.

Index J-measure, 34 χ2 -test, 30 z-score, 30, 33, 38 2 × 2 independence trial, 24 anti-monotonic property, 6 asymptotic tests, 29 binomial model, 36 Bonferroni adjustment, 20 branch-and-bound search, 71 breadth-first search, 71 certainty factor, 10, 34 closed set, 8, 125 closure, 124 comparative trial, 24 contingency table, 18 curse of dimensionality, 6 dependence, 17 dependence value, 17 dependency rule, 2, 15 dependent events, 17 dependent variables, 18 depth-first search, 71 double binomial model, 25, 35 double dichotomy, 24 enumeration tree, 69 enumeration problem, 5, 47 exchangeable measure, 26, 35, 38

extremeness relation, 23, 35 Fisher’s p, 28, 63, 141 Fisher’s exact test, 28, 36, 141 free set, 8, 125 generator, 125 goodness measure, 46 decreasing, 46 increasing, 46 well-behaving, 49, 51 horizontal data layout, 140 hypergeometric model, 27, 36 hypothesis testing, 20 improvement, 40 independence, 17 independent events, 17 independent variables, 18 interest, 17 Lapis Philosophorum principle, 75 leverage, 9, 17 lift, 7, 17 log likelihood ratio, 30, 31 measures, 23 minimal generator, 125 minimality, 67 monotonic property, 6 multinomial model, 24, 35 162

163

Index multiple testing problem, 20 mutual information, 31 non-redundant rule, 3, 39, 47 odds ratio, 25 optimization problem, 5, 47 permutation test, approximate, 22 permutation test, exact, 22 permutation test, random, 22 Poisson model, 28 prefix tree, 137 productivity, 5, 41 randomization test, 20, 21 redundancy, 3 with respect to M , 47 with respect to p, 39 redundancy coefficient, 43 redundant rule, 4, 39, 47 sampling scheme, 24 search problem, 47 significance of productivity, 41 significance testing, Bayesian, 21 significance testing, frequentist, 20 significance testing, traditional, 20 significance, value-based, 33, 35 significance, variable-based, 23 spurious dependency, 2 spurious discovery, 20 spurious rule, 20 spuriously non-redundant rule, 40 statistical dependence, 17 statistical independence, 17 statistical significance, 20 succinct tree structure, 139 swap randomization, 23 trie, 137 type I error, 20

type II error, 20 urn metaphors, 24 value-based semantics, 18, 32 variable-based semantics, 18, 23 vertical data layout, 140 word RAM model, 139

Efficient search for statistically significant dependency ...

Efficient search for statistically significant dependency ...

Suggest Documents

Efficient search for statistically significant dependency ...

an efficient algorithm for searching statistically significant ... - CiteSeerX

Detecting Statistically Significant Communities

Identifying statistically significant combinatorial markers for survival ...

hyperfractionated radiotherapy, there statistically significant ...

Efficient search methods for statistical dependency rules - UEF

Discovering Statistically Significant Clusters by ... - Semantic Scholar

Finding Statistically Significant Communities in Networks - Csic

Finding Statistically Significant Communities in Networks - Plos

Finding Statistically Significant Communities in Networks - Plos

Assessing Statistically Significant Heavy-Metal Concentrations ... - MDPI

Mining Statistically Significant Sequential Patterns - Computing Science

Finding Statistically Significant Communities in Networks - PLOS

Mining Statistically Significant Sequential Patterns - Computing Science

Supplemental Table 1. The statistically significant ...

Finding Statistically Significant Communities in ... - Semantic Scholar

Finding Statistically Significant Communities in Networks - PLOS

Discovering statistically significant pathways in expression ... - UMIACS

S4 Figure. Statistically significant paths for Total behavioral ... - PLOS

S3 Figure. Statistically significant paths for Externalized ... - PLOS

Efficient Convolution Kernels for Dependency and Constituent ...

Rigorous Analysis for Efficient Statistically Accurate Algorithms for ...

Statistically significant relation between radon flux and weak ...

Table S1. Spearman rank correlation results. * Statistically significant ...