Association Mining in Time-Varying Domains A ... - CiteSeerX

Association Mining in Time-Varying Domains

A Dissertation Presented to the Graduate Faculty of the University of Louisiana at Lafayette In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy

Antonin Rozsypal Fall 2003

c °Antonin Rozsypal 2003 All Rights Reserved

Association Mining in Time-Varying Domains Antonin Rozsypal

APPROVED:

Vijay V. Raghavan, Co-Chair

Miroslav Kubat, Co-Chair

Distinguished Professor of Computer Science

Associate Professor of Electrical and Computer Engineering, University of Miami

Chee-Hung Henry Chu

Jongpil Yoon

Associate Professor of Computer Science

Assistant Professor of Computer Science

Nabendu Pal

C. E. Palmer

Professor of Mathematics

Dean of the Graduate School

This thesis is dedicated to Marie and Marie, my beloved wife and daughter.

ACKNOWLEDGEMENTS I would like to thank my advisors, Dr. Miroslav Kubat and Dr. Vijay Raghavan, for their continuous support, advice, encouragement, and all their time throughout my doctoral studies. Their patient guidance helped me to discover an amazing world of science with its thirst for knowledge. I also appreciate valuable feedback from my other committee members: Dr. Henry Chu, Dr. Jongpil Yoon, and Dr. Nabendu Pal. I would also like to thank Dr. William Edwards, who served on my prospectus committee. I would like to thank all my friends and colleagues in CACS, especially Wei Kian Chen and Dr. Ryan Benton, who taught me how to balance work and life and “maintain sanity,” as Ryan liked to say. I am especially thankful to my friend, Dr. Jared Pearce, who took the time and carefully proofread my prospectus document. I would like to thank all others, not listed here, but contributing to this work in one way or another. My biggest thanks go to my wife, Marie, for her love, support, and encouragement at all times. I really value her decision to leave her home, parents, and friends and join me in a new country and culture. I would like to thank my parents for their continuous moral support. Last, but not least, I am grateful to our daughter, Marie, who has been with us for one year already, for enlightening our lives and giving me the strength to continue in my work.

TABLE OF CONTENTS DEDICATION

iv

ACKNOWLEDGEMENTS

v

LIST OF TABLES

ix

LIST OF FIGURES

x

LIST OF ALGORITHMS 1 INTRODUCTION 1.1 KDD . . . . . . . . . . . . . . . . 1.1.1 Data Selection . . . . . . . 1.1.2 Data Preprocessing . . . . 1.1.3 Transformation . . . . . . 1.1.4 Data Mining . . . . . . . . 1.1.5 Interpretation/Evaluation 1.2 Data Mining . . . . . . . . . . . . 1.2.1 Data Mining Categories . 1.2.2 Data Mining Algorithms . 1.2.3 Association Mining . . . . 1.3 Motivation . . . . . . . . . . . . . 1.4 Approach and Contributions . . . 1.5 Thesis Organization . . . . . . . .

xvi . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

2 RELATED WORK 2.1 Problem Definition . . . . . . . . . . . . . . . . . 2.2 Algorithms for Itemset Mining . . . . . . . . . . . 2.2.1 Apriori . . . . . . . . . . . . . . . . . . . . 2.2.2 AprioriTid . . . . . . . . . . . . . . . . . . 2.2.3 Partitioning . . . . . . . . . . . . . . . . . 2.2.4 Other Association Mining Algorithms . . . 2.3 Time-Varying Domains . . . . . . . . . . . . . . . 2.3.1 Preliminary Work . . . . . . . . . . . . . . 2.3.2 DEMON . . . . . . . . . . . . . . . . . . . 2.3.2.1 Significance of Deviation between 2.3.2.2 Automated Pattern Detection . . 2.4 Time-Varying Domains in Other Disciplines . . . 2.4.1 Change-Point Detection . . . . . . . . . . 2.4.2 FLORA . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Two Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . .

1 2 3 3 4 5 5 5 6 7 9 11 11 12

. . . . . . . . . . . . . .

14 14 16 17 20 21 22 25 25 27 28 29 30 30 31

TABLE OF CONTENTS

vii

3 FLORA IN ASSOCIATION MINING 3.1 Definitions . . . . . . . . . . . . . . . . 3.2 FLORA-Based Association Mining . . 3.3 Context Change Detection . . . . . . . 3.3.1 Ganti’s Heuristic . . . . . . . . 3.3.2 History-Based Heuristics . . . . 3.3.2.1 Emerged and Vanished 3.3.2.2 Jaccard Coefficient . . 3.3.2.3 Frequency Distance . . 3.3.2.4 Window vs. Increment 3.3.3 Multivariate Binomial Heuristic 3.4 Window Adjustment . . . . . . . . . . 3.5 Window Update . . . . . . . . . . . . . 3.6 Complexity Analysis . . . . . . . . . . 3.7 Summary . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . .

35 35 37 41 41 43 46 47 48 49 52 56 57 58 60

. . . . . . . . . . . . . . . .

62 62 62 65 67 68 69 69 71 72 72 74 78 81 81 93 101

FUTURE WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

102 102 104 105

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

4 EXPERIMENTAL EVALUATION 4.1 Data Generators . . . . . . . . . . . . . . . 4.1.1 IBM-Generator . . . . . . . . . . . . 4.1.2 RK-Generator . . . . . . . . . . . . . 4.2 Evaluation Metrics . . . . . . . . . . . . . . 4.2.1 Frequency Distance . . . . . . . . . . 4.2.2 Penrose Distance . . . . . . . . . . . 4.2.3 Distance Measure Comparison . . . . 4.2.4 Context Boundary Detection . . . . . 4.3 Component Testing . . . . . . . . . . . . . . 4.3.1 Ganti’s Heuristic Complexity . . . . 4.3.2 History-Based Heuristic Assumptions 4.3.3 Multivariate Heuristic Assumptions . 4.4 System-Wide Testing . . . . . . . . . . . . . 4.4.1 Abrupt Change . . . . . . . . . . . . 4.4.2 Gradual Change . . . . . . . . . . . . 4.4.3 Discussion . . . . . . . . . . . . . . . 5 CONCLUSION AND 5.1 System Summary . 5.2 Conclusion . . . . . 5.3 Future Work . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

BIBLIOGRAPHY

109

APPENDICES

117

TABLE OF CONTENTS

A HEURISTIC ASSUMPTIONS

viii

117

B EXPERIMENTS WITH ABRUPT CHANGE DOMAINS 122 B.1 Accuracy of Context Approximation . . . . . . . . . . . . . . . . . . . . 122 B.2 Window Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 C EXPERIMENTS WITH GRADUAL CHANGE DOMAINS 147 C.1 Accuracy of Context Approximation . . . . . . . . . . . . . . . . . . . . 147 C.2 Window Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 ABSTRACT

172

BIOGRAPHICAL SKETCH

173

LIST OF TABLES 2.1

Notation used in the Apriori and AprioriTid algorithms. . . . . . . . . .

17

4.1

The input parameters of the new data generator. . . . . . . . . . . . . .

65

4.2

P-values of normality tests for the change detection variables. . . . . . .

75

4.3

P-values of the normality tests for five randomly selected itemsets. . . . .

79

4.4

Properties of the four test domains . . . . . . . . . . . . . . . . . . . . .

82

LIST OF FIGURES 2.1

Partitioning algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

2.2

FLORA2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

32

3.1

Time-varying domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

36

3.2

Essence of FLORA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

37

3.3

A system with fixed-size window . . . . . . . . . . . . . . . . . . . . . . .

39

3.4

The course of a distance measure in a domain with four contexts . . . . .

44

3.5

Emerged and Vanished itemsets . . . . . . . . . . . . . . . . . . . . . . .

46

3.6

An example of a 2-variate normal distribution. . . . . . . . . . . . . . . .

54

4.1

Abrupt-change domain created by the IBM-generator . . . . . . . . . . .

64

4.2

Gradual-change domain created by the IBM-generator

. . . . . . . . . .

64

4.3

The course of the frequency distance and the Penrose distance . . . . . .

70

4.4

Run-times of the three groups of change detection heuristics . . . . . . .

73

4.5

The course of the four distance measures . . . . . . . . . . . . . . . . . .

77

4.6

The immediate window size for two variants of the multivariate change detection heuristic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

80

The abbreviations used in all graphs depicting results of the system-wide experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

84

4.8

Comparison of average heuristic errors on RK-generator big-change data

85

4.9

Comparison of average heuristic errors on RK-generator small-change data 85

4.7

4.10 Comparison of average heuristic errors on IBM-generator big-change data

86

4.11 Comparison of average heuristic errors on IBM-generator small-change data 86 4.12 Percentage of detected true changes on RK-generator big-change data . .

87

4.13 Percentage of spurious changes on RK-generator big-change data . . . . .

87

4.14 Percentage of detected true changes on RK-generator small-change data .

88

LIST OF FIGURES

xi

4.15 Percentage of spurious changes on RK-generator small-change data . . .

88

4.16 Percentage of detected true changes on IBM-generator big-change data .

89

4.17 Percentage of spurious changes on IBM-generator big-change data . . . .

89

4.18 Percentage of detected true changes on IBM-generator small-change data

90

4.19 Percentage of spurious changes on IBM-generator small-change data . . .

90

4.20 The numbers of emerged/vanished frequent itemsets in both the test domains: the RK-domain and the IBM-domain . . . . . . . . . . . . . . . .

92

4.21 The system’s error on RK-generator data with gradual big change . . . .

95

4.22 The system’s error on RK-generator data with gradual small change . . .

95

4.23 The system’s error on IBM-generator data with gradual big change . . .

96

4.24 The system’s error on IBM-generator data with gradual small change . .

96

4.25 Percentage of detected true changes on RK-generator big-change data with gradual change . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

97

4.26 Percentage of spurious changes on RK-generator big-change data with gradual change . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

97

4.27 Percentage of detected true changes on RK-generator data with gradual small change . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

98

4.28 Percentage of spurious changes on RK-generator small-change data with gradual change . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

98

4.29 Percentage of detected true changes on IBM-generator big-change data with gradual change . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

99

4.30 Percentage of spurious changes on IBM-generator big-change data with gradual change . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

99

4.31 Percentage of detected true changes on IBM-generator small-change data with gradual change . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.32 Percentage of spurious changes on IBM-generator small-change data with gradual change . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.1

System with and without reusing the knowledge of previous contexts . . 107

LIST OF FIGURES

xii

A.1 Histograms of four distance measures in the RK-domain . . . . . . . . . 118 A.2 Histograms of four distance measures in the IBM-domain . . . . . . . . . 119 A.3 Histograms of the observed distributions for five randomly selected itemsets in the RK-domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 A.4 Histograms of the observed distributions for five randomly selected itemsets in the IBM-domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 B.1 Error of baseline algorithms in domains with abrupt change . . . . . . . 124 B.2 Error of Ganti’s heuristic in domains with abrupt change . . . . . . . . . 125 B.3 The error rate for the history-based heuristic with the emerged/vanished conservative distance measure. The increment is compared with the last block only. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 B.4 The error rate for the history-based heuristic with the emerged/vanished progressive approach. The increment is compared with the last block only. 127 B.5 The error rate for the history-based heuristic with the Jaccard distance measure approach. The increment is compared with the last block only. . 128 B.6 The error rate for the history-based heuristic with the frequency distance measure approach. The increment is compared with the last block only. . 129 B.7 The error rate for the history-based heuristic with the emerged/vanished conservative approach. The entire window is being compared with the increment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 B.8 The error rate for the history-based heuristic with the emerged/vanished progressive approach. The entire window is being compared with the increment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 B.9 The error rate for the history-based heuristic with the Jaccard distance measure approach. The entire window is being compared with the increment.132 B.10 The error rate for the history-based heuristic with the frequency distance measure approach. The entire window is being compared with the increment.133 B.11 The error rate for the multivariate heuristic. . . . . . . . . . . . . . . . . 134 B.12 Window size for the baseline approaches in abrupt-change domains . . . 136

LIST OF FIGURES

xiii

B.13 The changing size of the window for the Ganti’s change detection heuristic in abrupt-change domains. . . . . . . . . . . . . . . . . . . . . . . . . . . 137 B.14 The changing size of the window for the history-based heuristic with the emerged/vanished conservative approach in abrupt-change domains. The increment is compared with the last block only. . . . . . . . . . . . . . . 138 B.15 The changing size of the window for the history-based heuristic with the emerged/vanished progressive approach in abrupt-change domains. The increment is compared with the last block only. . . . . . . . . . . . . . . 139 B.16 The changing size of the window for the history-based heuristic with the Jaccard distance measure approach. The increment is compared with the last block only. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 B.17 The changing size of the window for the history-based heuristic with the frequency distance measure approach in abrupt-change domains. The increment is compared with the last block only. . . . . . . . . . . . . . . . 141 B.18 The changing size of the window for the history-based heuristic with the emerged/vanished conservative approach in abrupt-change domains. The entire window is being compared with the increment. . . . . . . . . . . . 142 B.19 The changing size of the window for the history-based heuristic with the emerged/vanished progressive approach in abrupt-change domains. The entire window is being compared with the increment. . . . . . . . . . . . 143 B.20 The changing size of the window for the history-based heuristic with the Jaccard distance measure approach in abrupt-change domains. The entire window is being compared with the increment. . . . . . . . . . . . . . . . 144 B.21 The changing size of the window for the history-based heuristic with the frequency distance measure approach in abrupt-change domains. The entire window is being compared with the increment. . . . . . . . . . . . . 145 B.22 The changing size of the window for the multivariate heuristic in abruptchange domains. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 C.1 Error for baseline algorithms in gradual-change domains . . . . . . . . . 149 C.2 The error rate for the Ganti’s detection heuristic. . . . . . . . . . . . . . 150 C.3 The error rate for the history-based heuristic with the emerged/vanished conservative approach. The increment is compared with the last block only.151

LIST OF FIGURES

xiv

C.4 The error rate for the history-based heuristic with the emerged/vanished progressive approach. The increment is compared with the last block only. 152 C.5 The error rate for the history-based heuristic with the Jaccard distance measure approach. The increment is compared with the last block only. . 153 C.6 The error rate for the history-based heuristic with the frequency distance measure approach. The increment is compared with the last block only. . 154 C.7 The error rate for the history-based heuristic with the emerged/vanished conservative approach. The entire window is being compared with the increment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 C.8 The error rate for the history-based heuristic with the emerged/vanished progressive approach. The entire window is being compared with the increment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 C.9 The error rate for the history-based heuristic with the Jaccard distance measure approach. The entire window is being compared with the increment.157 C.10 The error rate for the history-based heuristic with the frequency distance measure approach. The entire window is being compared with the increment.158 C.11 The error rate for the multivariate heuristic. . . . . . . . . . . . . . . . . 159 C.12 The changing size of the window for the baseline approaches in gradualchange domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 C.13 The changing size of the window in gradual-change domains for the Ganti’s detection heuristic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 C.14 The changing size of the window in gradual-change domains for the historybased heuristic with the emerged/vanished conservative approach. . . . . 163 C.15 The changing size of the window in gradual0change domains for the historybased heuristic with the emerged/vanished progressive approach. . . . . . 164 C.16 The changing size of the window in gradual-change domains for the historybased heuristic with the Jaccard distance measure approach. . . . . . . . 165 C.17 The changing size of the window in gradual-change domains for the historybased heuristic with the frequency distance measure approach. . . . . . . 166 C.18 The changing size of the window in gradual-change domains for the historybased heuristic with the emerged/vanished conservative approach. The entire window is being compared with the increment. . . . . . . . . . . . 167

LIST OF FIGURES

xv

C.19 The changing size of the window in gradual-change domains for the historybased heuristic with the emerged/vanished progressive approach. The entire window is being compared with the increment. . . . . . . . . . . . . 168 C.20 The changing size of the window in gradual-change domains for the historybased heuristic with the Jaccard distance measure approach. The entire window is being compared with the increment. . . . . . . . . . . . . . . . 169 C.21 The changing size of the window in gradual-change domains for the historybased heuristic with the frequency distance measure approach. The entire window is being compared with the increment. . . . . . . . . . . . . . . . 170 C.22 The changing size of the window in gradual-change domains for the multivariate heuristic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

LIST OF ALGORITHMS 2.1

Procedure for generating all association rules from large itemsets. . . . .

15

2.2

apriori-gen – the candidate-generating step of the Apriori algorithm . . .

18

2.3

Algorithm Apriori . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

2.4

AprioriTid – a modified version of the Apriori algorithm . . . . . . . . .

21

2.5

Determining statistical significant deviation between two datasets . . . .

29

2.6

FLORA2 – Window Adjustment Heuristic . . . . . . . . . . . . . . . . .

33

3.1

The essence of association mining in time-varying domains . . . . . . . .

40

3.2

Modified version of Ganti’s algorithm . . . . . . . . . . . . . . . . . . . .

43

3.3

History-Based Change Detector . . . . . . . . . . . . . . . . . . . . . . .

45

3.4

Multivariate Binomial Change Detector . . . . . . . . . . . . . . . . . . .

55

3.5

Main loop of the FLORA algorithm . . . . . . . . . . . . . . . . . . . . .

58

CHAPTER 1: INTRODUCTION The rapid development of computers in the second half of the 20th century enabled individuals and organizations to store and maintain huge amounts of data. The volume of data stored daily has been increasing every year and the future will almost certainly keep this trend. [20] estimated that the amount of information in the world doubles every twenty months. With such an accumulation of data, traditional statistical and database management techniques are becoming insufficient for many tasks. The statistical techniques were not intended for large volumes of data and suffer from the issue of efficiency. The existing database query languages are unable to answer simple queries, like “Give me all fraudulent uses of credit cards.” This situation calls for developing novel techniques for automated and effective discovery of new, implicit, or surprising knowledge hidden in these data. These circumstances gave rise to data mining which is a part of a broader area known as Knowledge Discovery in Databases (KDD)1 . A popular subfield of data mining is association mining used for finding frequently co-occurring items in transactional databases. A single transaction in this type of database can be defined, for example, as a list of items a customer pays for at a checkout register. The stored transactions are later analysed for items frequently found in the same market baskets. Such information can be used for placing associated items on neighboring shelves, advertising them in the same catalogues, or avoiding discounts on more than one of the associated items. However, the application of association mining goes well beyond market-basket analysis and includes areas like the stock 1

Sometimes the terms KDD and data mining are used interchangeably. In this work, the term data mining is used for the actual application of a data mining algorithm to already preprocessed data. On the other hand, the term KDD is used for the whole process, starting with data collection and ending with interpretation of the obtained results.

CHAPTER 1. INTRODUCTION

2

market, the medical field, and the Internet. Many papers have been published dealing with various aspects of association mining, mainly focusing on the issue of efficiency. Interestingly, most of the research considered data as a static entity. On the other hand, the basic tenet of this dissertation is that data evolves over time. First, the database is not a static set of transactions; new transactions are added as customers register their purchases at the checkout desk. Secondly, customer buying habits change over time – they are affected by fashion, season, or the introduction of a new product. This dissertation first introduces a technique for detecting changes in associations, then evaluates this technique with a series of experiments under various conditions.

1.1

KDD The exchange (nyse) gets a mere $ 130 million a year from selling its tape of share prices to outsiders, who then reconfigure the information and sell it on, earning, by some estimates, over $ 20 billion. The Economist, June 17, 2000 The essential task of KDD is “the nontrivial extraction of implicit, previously

unknown, and potentially useful information from data” ([20]). However, to be able to perform any analysis upon the given data, it must be first preprocessed and transformed into a suitable format and, after the process of mining, interpreted and possibly visualized for the end user. The discipline of KDD can be naturally divided into the following five stages ([19]): • Data Selection • Data Preprocessing • Transformation


3

• Data Mining • Interpretation/Evaluation

1.1.1

Data Selection

The first step in analyzing any problem is to select relevant data to sufficiently represent the given task. Here, the emphasis is not placed on the amount of data, rather on the quality of the data selected. Selecting irrelevant data can considerably slow down the whole process of KDD; moreover, it can lead to false conclusions. For example, an analysis of a time-variant system requires selection of recent data, or a task involving time series needs data with appropriate sampling rate. In addition, it is necessary to make sure that the units used in the data are consistent (for instance, Canadian dollars vs. U.S. dollars, British gallons vs. American gallons, etc.).

1.1.2

Data Preprocessing

After obtaining all the necessary data, the preprocessing stage takes place. Preprocessing consists of two main subtasks. The first subtask deals with incomplete data. For example, a customer forgets to fill in his phone number or a patient refuses to provide his ethnic origin. In this case, the record is considered incomplete and the preprocessing stage needs to take care of it. There are two options for handling this task: • remove incomplete examples – the most straightforward solution. However, with every deleted example, a piece of valuable information is lost. • estimate missing values – requires further processing. The three most common approaches are to estimate the values by: (1) averaging, (2) filling in by the most frequent value in the domain of the missing value, or (3) determining values from


4

other attributes. This option keeps all the information; however, the estimation process is not guaranteed to be one hundred percent accurate. The second obstacle is noisy data. This may be a measure error, a typo, or purposely misleading information. This is a very difficult task, since there is no certain way to obtain the true values. Moreover, not all the noisy instances may be accounted for. However, there are at least two ways the noise might be determined and solved based on simple rules or known facts of the type: • remove outliers; e.g., values outside a predefined interval tend to be a noise • correct typos based on known facts; for instance, a husband is always male, never female

1.1.3

Transformation

In this stage, the data is transformed to a form suitable for the mining process. Depending on the goal of the task, a subset of useful attributes is selected to decrease dimensionality of the dataset. This, in turn, leads to a more dense space to explore. There are two main approaches to attribute selection: • wrapper approach – select a subset of attributes, perform data mining on a testing dataset, and test the quality of the induced knowledge. The approach then compares the quality of the induced knowledge for different subsets of attributes. • filter approach – select the attributes before learning, based on a certain set of criteria (for instance, in a classification problem, attributes that have high correlation with the decision attribute). Other techniques for data transformation include selecting representative examples, turning multi-value attributes into binary (thresholding), grouping several attributes into one category or breaking a single attribute into many (for example, Coke, Sprite,


5

Pepsi, etc. can be represented as a soft drink), or normalizing data.

1.1.4

Data Mining

This is the part of KDD that receives the most attention. In the data mining step the preprocessed data is actually analyzed by a specific data mining algorithm. According to [19], such algorithms include searching for patterns of interest in a particular representational form or a set of such representations, including classification rules or trees, regression, clustering, and association mining. A more detailed overview of data mining methods is given in Section 1.2.

1.1.5

Interpretation/Evaluation

Any knowledge that cannot be interpreted or explained has a very little value for the end user. Therefore, once the data is analyzed, the results need to be interpreted and presented in an accessible manner. In many cases, graphs or tables serve this purpose since graphical information is fairly straightforward.

1.2

Data Mining This section focuses on the description of the main tasks in data mining and the

common ways to accomplish them. Also, some basic algorithms are discussed.2 Data mining can be viewed as an intersection of many domains, rather than a pure discipline. The four main disciplines contributing to data mining include: • Database Management: since data mining deals with huge volumes of data, an efficient way of accessing and maintaining data is necessary. 2

The following categorization of the main goals, principles and basic algorithms comes from [19].


6

• Artificial Intelligence (AI): the field of AI contributes to tasks involving knowledge encoding or search techniques. • Machine Learning (sometimes perceived as a subfield of AI): machine learning provides algorithms for inducing knowledge from given data. • Statistics: statistics can provide tools for measuring significance, constructing models of given data, estimating probabilities and many other tasks. The two main goals of data mining are: verification and discovery. In verification, the system is used for verifying users’ hypothesis. In discovery, the main goal is to search for unknown knowledge hidden in the data. This section further focuses on the discovery part, since this is the scope of this dissertation. The discovery part of data mining can be further divided into prediction, where the system learns a concept and this concept is later used for predicting future behavior of the domain, and description, where the system extracts knowledge or a model of the given entity in a user-accessible way.3

1.2.1

Data Mining Categories

The discovery part of data mining comprises several categories. As stated above, data mining can be categorized into two groups: predictive and descriptive data mining. However, sometimes the distinction between the two is not clear. Some predictive models are easy to understand and can be used, therefore, for description (for instance, decision trees), while some descriptive models may have certain predictive power. This section lists the main data mining models and provides their commonly agreed categorization. The goal of classification is to learn a function that maps an instance into one of 3

The association mining task is an instance of a descriptive system.


7

several prespecified classes. The function is commonly learned through the analysis of a set of classified training examples. Primarily, classification is a prediction technique. Regression is a method of estimating a variable’s dependence on other variables. [19] describes a special type of regression, a nonlinear regression, as “a family of techniques for prediction that fit linear or nonlinear combinations of basis functions (sigmoids, splines, polynomials) to combinations of the input variables.” If, instead of a general function, a linear function is used, a special type of regression is obtained, called linear regression. Regression belongs to the class of predictive methods. A common descriptive method is clustering. Clustering is specified as categorizing a given set of data into a finite set of clusters, where the set of clusters may or may not be mutually exclusive and exhaustive. The generated clusters can also be represented in a richer structure, such as hierarchical categories. Dependency modeling describes significant dependencies between variables. The dependencies have two levels: the structural level describes which variables are locally dependent, and the quantitative level specifies the strength of the dependencies.4 Change and deviation detection is used for analyzing changes in data and finding significant deviations based on previously measured values or properties of the data.

1.2.2

Data Mining Algorithms

This section gives a brief overview of some of the most popular Data Mining algorithms, mainly those dealing with classification. One type of classification algorithm, a decision tree, is a structure used for classifying an object based on univariate or multivariate splits. Each node of the tree contains a test, and the outcome of the test is used to decide the branch to be followed. 4

Association mining can be viewed as a special kind of dependency modeling.


8

The classification of a new instance starts at the root node of the tree, and the class label of the object is decided by the leaf the object reaches. Many tree-learning and rule-induction algorithms have been described in machine learning and applied statistics literature. For further details, the reader may refer to [57] or [11]. Classification trees are primarily used for prediction modeling, both for classification and regression. Perhaps the most known nonlinear-regression technique is a feedforward neural network. A typical neural network is a collection of nodes (neurons) arranged in layers, and every node in a layer is connected via links to all the nodes in the immediate neighborhood. A weight is associated with every link. Each node is a stand-alone unit that sums up weighted values on its inputs, and passes them through a given test. The class of an instance to be classified is decided by the neuron with highest output value in the last layer. Neural networks are a very powerful regression tool; however, the learned concept is hidden in the learned weights and, thus, almost impossible to comprehend. Instance-Based methods are based on the idea of selecting representative examples from a training set. A new object is then classified upon its similarity5 to the stored examples. The closest one (or ones) is used to decide about the object’s class label. Similar to the nonlinear regression methods, instance based classifiers can be asymptotically powerful in terms of approximation accuracy. However, they are very difficult to interpret because the model is hidden in the stored data and not explicitly formulated. 5

The similarity is measured with a given distance metric. In many cases the Euclidian distance is used.


1.2.3

9

Association Mining

Yet another instance of data mining tasks is association mining. The goal of association mining is to search for frequently co-occurring patterns in transactional databases. The original application of association mining was in market basket analysis (see [3]). A market basket is defined as a list of items customers buy, as registered at the checkout desk. Association mining searches for items that are frequently found in the same market baskets. Supermarkets can then benefit from knowing these highly correlated items in the way that they place associated items on neighboring shelves, advertise them in the same catalogues, or avoid discounts on more than one associated item. Associations can be further transformed into rules of the form: if a customer purchases items A and B, he is likely to also purchase item C. The applications of association mining go well beyond the original scope of market basket analysis. Association mining may be applied to any area where the data can be expressed in the form of transactions. In the Internet environment, a transaction may be formed by links pointing to a Web page, and frequently co-occurring links then signal correlations among Web sites. In a medical application, the items can be the records in a patient’s medical history. Associations among them may help in finding frequently co-occurring ailments. In stock market analysis, a transaction may contain holdings in a mutual-fund portfolio, and the discovered associations can reveal companies that tend to be held at the same time. Association mining takes a transactional type of database as an input and returns a list of frequent itemsets that have been found in at least θ % of the transactions. The seminal paper by [3] introduced an algorithm, Apriori, that recursively generates the desired itemsets. The algorithm starts with finding all frequent itemsets, each consisting of a single item. In every consecutive step, say k, the itemsets of size k − 1


10

are combined to create candidate itemsets of size k. These newly generated candidate itemsets can be further pruned without the need to consult the database. The idea behind this pruning is that every frequent itemset of size k must have all of its k − 1 subsets among the itemsets of size k − 1. Only those candidates that “survive” the process are checked against the entire database. The process continues for gradually increasing itemset sizes until no knew frequent itemsets are found. Apriori is known to be slow when applied to large databases. Therefore, several expediting solutions have been proposed. For example, [15] focuses on frequently updated databases and proposes an incremental technique, [4] explores the possibilities of parallel search for associations, and [65] puts forth a sampling approach. Most of the existing approaches consider the database to be a uniform entity. As such, the associations obtained from different parts of the database are assumed to be very similar and all differences found are due to statistical fluctuations. In contrast, another research strand suggests that this assumption does not hold in all real-world databases. The associations may evolve over time as a result of changes in purchasing habits, swings of fashion, introduction of a new product, or seasonal changes. This time-varying paradigm is explored by [22]. The authors advocate the idea that real-world databases evolve by adding and deleting batches of transactions, and propose a way of efficiently maintaining such types of data. Ganti’s focus is to create descriptive models of the domain using the three most popular data mining concepts: decision trees, clustering, and association mining. The user can decide whether the whole database, or only the most recent “window” of records is to be used for model induction. In addition, a calendar choosing only certain blocks can be specified. For example, the analyst may require a model of data collected on all Mondays. Moreover, the authors propose a system for automatic discovery of interesting calendars.


11

Section 2.3.2 provides more details about Ganti’s approach.

1.3

Motivation The main focus of this dissertation are time-evolving databases. Associations,

usually called itemsets, in such databases tend to vary their importance as a result of many factors influencing customers’ buying habits. Such factors may include seasonal changes, swings of fashion, or introduction of a new or discontinuation of an old product. Some itemsets may have only a local importance, whereas others can be found throughout the entire database. Those local itemsets, even though strong in their period of time, may not have sufficient frequency in the whole database, and thus may not be detected by traditional techniques. Moreover, analysts may be interested in immediate supports of itemsets in a given time frame, rather than their average supports.

1.4

Approach and Contributions The proposed approach deals with domains where the market baskets arrive in

batches. At each moment the system depends on the most recent batches, since the older ones may be outdated. The number of batches considered “recent” is likely to change over time. In addition, a notion of a context is introduced, which denotes a stable part of the database. The proposed system uses a window of the most recent market baskets that changes its size over time. The window grows in the periods of stability, yielding a more precise description of the current context. Conversely, the window shrinks when a change is detected, thus allowing a faster adaptation to the new context. The main emphasis of the system is on the accuracy of the induced model with respect to the current context. The system is able to detect context boundaries and to provide a list of frequent


12

itemsets local to a given context. The system is able to generate a list of frequent itemsets obtained from the entire database as well. The two principal research question related to the proposed system are: 1. How to detect a change in context? 2. How to adjust the window upon change detection? These questions are answered in Chapter 3, and the proposed system’s performance is evaluated in Chapter 4. The proposed system involves: • A variable-sized window for association mining. • A change detection heuristic; the dissertation models and discusses three heuristics for measuring significant changes in the database. • A window adjustment operator; the dissertation models and discusses three operators for window size adjustment. The thesis contributions include the following issues: • A system that is concerned with model accuracy with respect to the current context. • A system that is able to partition the database into uniform segments by detecting context boundaries.

1.5

Thesis Organization In Chapter 1 this dissertation introduces a perspective of data mining, as a subfield

of a broader area of KDD. The focus then narrows in Chapter 2: first, a detailed attention is paid to association mining in general, then the specific issue of time-evolving domains is discussed. Chapter 3 gives details concerning the proposed system for association mining in time-evolving databases. Then, the proposed method


is experimentally evaluated in Chapter 4. Finally, the dissertation concludes with Chapter 5, a discussion of overall conclusions and directions for future work.

13

CHAPTER 2: RELATED WORK Association mining deals with large databases comprising market basket type of transactions. The goal is to find strong associations among items – the items that frequently occur in the same transactions. This chapter provides definition of the association mining task and then discusses the basic algorithms developed for mining associations. 2.1

Problem Definition The problem of association mining was first defined in the Agrawal’s seminal paper

([3]): Definition 2.1.1 Let I = i1 , i2 , . . . , im be a set of literals, called items. Let D be a set of transactions, where each transaction T is a set of items such that T ⊆ I. Associated with each transaction is a unique identifier, called its T ID. We say that a transaction T contains X, a set of items in I if X ⊆ T . Such an X is called an itemset. An itemset X has support s in transaction set D if s % transactions in D contain X. Definition 2.1.2 An association rule is an implication of the form X ⇒ Y , where X ⊂ I, Y ⊂ I, and X ∩ Y = ∅. The rule X ⇒ Y holds in the transaction set D with confidence c if among transactions in D that contain X c % also contain Y . The rule X ⇒ Y has support s in the transaction set D if s % transactions in D contain X ∪ Y . Informally stated, a transaction T is the set of items a customer purchases at the checkout desk. An itemset X is a set of items being bought together, its support s is the fraction of baskets containing all items from X.

CHAPTER 2. RELATED WORK

15

Algorithm 2.1 Procedure for generating all association rules from large itemsets. Require: L is the set of all large itemsets Ensure: R is the set of all association rules generated from L for all l ∈ L do for all a ∈ l and a 6= ∅ do if support(l)/support(a) ≥ minconf then add the rule a ⇒ (l − a) to R end if end for end for

An association rule X ⇒ Y , with confidence c and support s encodes the information that c percent of customers buying items from X also buy items from Y (for instance, 80 % customers buying bread and butter also buy milk). The percentage of baskets containing items from both X and Y is the support associated with {bread, butter} ⇒ {milk} (40 % of all customers buy bread, butter, and milk). [3] provides the following definition for the problem of mining association rules: “Given a set of transactions D, the problem of mining association rules is to generate all association rules that have support and confidence greater than the user-specified minimum support (called minsup) and minimum confidence (called minconf ) respectively.” The problem of mining all association rules can be naturally decomposed into two subtasks: 1. Find all itemsets with support above the user-specified minsup. The support of an itemset X is the percentage of transactions that contain X. Itemsets with support above minsup are called large, or frequent itemsets, and all other itemsets are called small, or infrequent itemsets. 2. Use the large itemsets to generate the desired rules. The rule-generating process is shown in Algorithm 2.1.


16

The first step – generating large itemsets – is a more challenging task, where providing efficiency is the main concern. The main research issues being: (1) search space pruning, and (2) effective access to the storage medium. Some approaches to the subtask of finding all large itemsets are discussed in the next Section 2.2 The second step – generating association rules – is more straightforward. The number of itemsets generated tends to be relatively small comparing to the size of the database, and thus generating all the rules in the main memory doesn’t bring any substantial difficulties.

2.2

Algorithms for Itemset Mining The first research issue in itemset generation is the avoidance of exploring

exponentially large search space. Potentially, all the subsets of I are candidates for large itemsets. Without any additional knowledge, a naive search algorithm would generate 2||I|| − 1 itemset candidates and check their support. Since a store may deal in thousands of items, this problem clearly becomes intractable. Fortunately, techniques for significant pruning of the search space have been developed, thus making the task computationally manageable. The second issue is the effectiveness of accessing the database storage medium. The size of the database typically ranges from hundreds of thousands of transactions to several million. Despite the fact that the capacity of the main memory of today’s computers grows at a substantial rate, databases of this size are still too large to fit in the main memory. Hence, the databases need to be stored in an external medium, typically a hard drive, which has substantially longer access time compared to the operational memory. This, in turn, leads to a constraint placed on the association mining algorithms to minimize the number of passes over the database.


k-itemset Lk

Ck

Ck

17

An itemset having k items. Set of large k-itemsets (those with minimum support). Each member of this set has two fields: (1) itemset and (2) support count. Set of candidate k-itemsets (potentially large itemsets) Each member of this set has two fields: (1) itemset and (2) support count. Set of candidate k-itemsets when the TIDs of the generating transactions are kept associated with the candidates.

Table 2.1: Notation used in the Apriori and AprioriTid algorithms.

2.2.1

Apriori

Probably, the first algorithm that was capable of dealing with large data was the algorithm Apriori, introduced in [6]. Since then, Apriori has become, de facto, a benchmark for measuring performance of any association mining algorithm. In Apriori the basic intuition employed in pruning the itemset search space can be formulated as: “Any non-empty subset of a frequent itemset must be frequent as well.” Conversely, all supersets of an infrequent itemset are also infrequent. Therefore, if in some stage of the mining process an itemset X is discovered to be infrequent, none of its extensions can be frequent and it does not have to be tested for the minimum support condition. The dissertation first introduces the notation used in the Apriori algorithm. Every database record takes the form , where T ID is a unique identifier of the transaction and item* is a list of items the transaction comprises. The number of items in an itemset is called its size, and an itemset of size k is called a k-itemset. All items in a transaction and/or an itemset are kept in lexicographic order. The notation c[1] · c[2] · · · c[k] is used to represent a k-itemset consisting of items c[1], c[2], . . . , c[k]. The notation used in the algorithm is summarized in Table 2.1. Apriori mines the database in passes, each pass dedicated to itemsets of size k (k


18

Algorithm 2.2 apriori-gen – the candidate-generating step of the Apriori algorithm Require: Lk−1 , set of large (k − 1)-itemsets Ensure: Ck , set of candidate k-itemsets, Ck ⊇ Lk {First, in the join step, Lk−1 is joined with Lk−1 } insert into Ck select p.item1 , p.item2 , . . . , p.itemk−1 , q.itemk−1 from Lk−1 p, Ll−1 q where p.item1 = q.item1 , . . . , p.itemk−2 = q.itemk−2 , p.itemk−1 < q.itemk−1 ; {The second step, prune, weeds out candidates that do not have all their (k−1)-subsets in Lk−1 } for all itemsets c ∈ Ck do for all (k − 1)-subsets s of c do if (s 6∈ Lk−1 ) then delete c from Ck end if end for end for

being the pass number). The algorithm terminates after no large itemsets are found in the current pass. In the first pass, the algorithm simply counts occurrences of items and determines large 1-itemsets. Any subsequent pass, say k, consists of two steps. In the first step, all the candidate k-itemsets, Ck , are generated, based on the large (k − 1)-itemsets, Lk−1 . This step consists of two substeps: • join substep – which creates all possible candidates by joining pairs of (k − 1)-itemsets with a common (k − 2) prefix and differing in the last item only. • prune substep – which weeds out all the candidates that do not have all of their (k − 1) subsets in Lk−1 . This candidate-generating step is summarized in Algorithm 2.2. The second step, counting, involves one pass over the database. Here, the real support of the candidates is counted and only those with support over the user-specified


19

Algorithm 2.3 Algorithm Apriori Require: SL1 = {large 1-itemsets} Ensure: k Lk contains all large itemsets for (k=2; Lk−1 6= ∅; k++) do Ck = apriori-gen(Lk−1 ) {new candidates} for all transactions t ∈ D do Ct = subset(Ck , t); {candidates contained in t} for all candidates c ∈ Ct do c.count++; end for end for Lk = {c ∈ Ck |c.count ≥ minsup} end for

minsup are retained. The critical part is looking up the candidates in each of the transactions. The authors employ a structure, called hash tree, for eliminating a number of candidates that need to be considered in each transaction. The items contained in the transaction are used for descending the hash tree. Each leaf of the tree contains a set of candidates with a common prefix found in the transaction. Then these candidates are searched in the transaction and, in case of success, their count is incremented by one. The pseudocode for Apriori is given in Algorithm 2.3. To illustrate the operation of Apriori, the dissertation examines one epoch of the algorithm. Let L3 be {{1, 2, 3}, {1, 2, 4}, {1, 3, 4}, {1, 3, 5}, {2, 3, 4}}. After the join substep, C4 will be {{1,2,3,4},{1,3,4,5}}. The prune substep will delete the itemset {1,3,4,5} because the itemset {1,4,5} is not in L3 . C4 will finally contain only the itemset {1,2,3,4} and the database will be scanned for its support. If the support is above minsup the itemset will be placed in L4 , otherwise it will be discarded.


2.2.2

20

AprioriTid

Algorithm AprioriTid is a variant of Apriori. It uses the same procedure for generating candidates as Apriori (see Algorithm 2.2). The main difference between the two algorithms is the way they count support of candidates after the generating step. Unlike Apriori that needs to scan the database once for each pass to count the support, AprioriTid uses a new structure, named C k , to count support. Each member of C k takes the form , where each Xk is a potentially large itemset present in the transaction with identifier T ID. In the first pass, the C 1 structure is equivalent to the database. In any subsequent pass, say k, the algorithm works as follows: (1) A set of candidates (Ck ) is generated from the set of large (k − 1)-itemsets (Lk−1 ) by the apriori-gen algorithm (Algorithm 2.2). Then, every entry in the C k−1 structure is checked, if a candidate c ∈ Ck can be formed out of that entry. If yes, an appropriate entry is added to the C k structure. The algorithm’s pseudocode is given in Algorithm 2.4. The main reason for introducing this algorithm was the fact that it does not need to scan the database repeatedly. Instead, it uses the information stored in the C k structure to get the counts of the candidates. Because the C k structure keeps only those entries that may contribute to the support of the large itemsets, it is likely to be much smaller than the entire database. Hence, it may fit in the main memory and considerably speed up the association mining process by avoiding disc accesses. Experimental results in [6] suggest that the AprioriTid algorithm has better performance in later passes (higher k). However, AprioriTid is slower than Apriori in the earlier passes. Therefore, the authors introduce a new algorithm, called AprioriHybrid, that uses both the algorithms Apriori and AprioriTid. It starts with Apriori and after specified number of passes it switches to AprioriTid.


21

Algorithm 2.4 AprioriTid – a modified version of the Apriori algorithm Require: SL1 = {large 1-itemsets} Ensure: k Lk contains all large itemsets for (k=2; Lk−1 6= ∅; k++) do Ck = apriori-gen(Lk−1 ) {new candidates} Ck = ∅ for all entries t ∈ C k−1 do Ct = {c ∈ Ck |(c − c[k]) ∈ t.set-of-itemsets ∧(c − c[k − 1] ∈ t.set-of-itemsets}; for all candidates c ∈ Ct do c.count++; end for if (Ct 6= ∅) then C k += ; end if end for Lk = {c ∈ Ck |c.count ≥ minsup} end for

2.2.3

Partitioning

The issue of multiple passes over the database was further addressed in the work of [61], and explored by their Partitioning algorithm. The need for multiple passes comes from the fact that the set of all potentially frequent itemsets, as generated from all available items, is too large to fit in the main memory.1 Therefore, most of the existing association mining algorithms exploit the property that all the supersets of a small itemset are small as well. Applying this idea requires the set of large itemsets to be built gradually, from size 1 to the maximal length. However, suppose a relatively small superset of potentially large itemsets is given. In this case, the support counting can be done in exactly one pass over the database. The Partitioning algorithm uses two scans of the database to obtain all large 1

As mentioned before, the number of all itemset candidates is 2n − 1, n being the number of items.


L1

22

L2

Lk

L Figure 2.1: The Partitioning algorithm. The database is divided into smaller parts, each of them being mined for large itemsets. Union of all the local large itemsets is finally checked for support in one scan over the database.

itemsets. In the first pass, the Partition algorithm logically divides the database into a series of non-overlapping partitions, the size of partitions being less than the available main memory. Each of the partitions is mined for all local large itemsets. The term local means that the generated large itemsets are large with respect to the given partition. At the end of the first phase, the locally large itemsets from all the partitions are merged together to form a superset of the global large itemsets. An itemset is global if it is large with respect to the entire database. In the second pass the exact support for members of this superset is determined, and only those with support above minsup are reported to the user. The idea of the Partitioning algorithm is summarized in the Figure 2.1. 2.2.4

Other Association Mining Algorithms

Since the baseline algorithm presented in [3] tends to be slow when applied to large databases, considerable research effort has focused on accelerating the computations. The goal of this section is to present a brief overview of the work done in this area and


23

to point out some interesting approaches. [12] introduced DIC, a variant of the Apriori algorithm. DIC does not strictly separate the generating and counting steps. Instead, whenever a candidate’s support during counting reaches the minsup, the algorithm generates additional candidates based on it and starts counting their support as continuing the scan over the database. A preprocessing technique based on the use of lattice structures was put forth by [2]. The Apriori algorithm benefits from reducing the size of candidate sets. However, under some circumstances (prolific frequent patterns, long patterns, or low minsup) the number of candidates can still be enormous. Moreover, repeated scanning the database and matching the candidates to every transaction is a substantial burden. These issues were addressed by [25] who developed a structure, the so-called FP-tree, for highly condensed representation of the database. The authors propose a technique for efficient searching the FP-tree for frequent itemsets. Another algorithm based on TID (Transaction ID) intersections was developed by [74]. The idea is to group related items into clusters representing potential maximal frequent itemsets2 . Clearly, every subset of a maximal frequent itemset must be frequent and the exact support of all these subsets can be counted in one scan over the database. [14] applies the idea of association mining to exploring traversal patterns from a WWW-like environment. The authors develop a new technique to reduce the number of passes over the database. Another technique aiming at lowering the computational costs was proposed by [65] who suggests to pick a sample of the database and find all association rules in this sample. Providing the database is homogeneous and the sample is sufficiently large, the discovered itemsets are likely to be frequent in the whole database. The set of candidates is then verified by one pass over the database. If a 2

An itemset is called maximal frequent itemset if none of its supersets is frequent.


24

large itemset is suspected to be missing, another pass over the database is required. In many cases, the database is not located at one side, instead it can be spread among many centers. Consider, for example, sales data of a large retail chain. Such a chain may have many regional centers, each maintaining its own records. Since each center can maintain millions of transactions, it would be very costly to download all of them to one centralized location. Instead, [4] explores the possibilities of parallel search and presents three algorithms with almost ideal scale-up properties. The issue of lowering communication overhead in parallel environment is further addressed by [17] and their DMA algorithm. All the techniques introduced so far are of a batch nature. They assume the database to be a static entity that does not evolve over time. If new transactions are appended to or obsolete ones are deleted from the database, the association mining algorithm needs to be re-run on the entire database. This approach is sufficient for domains with only rare updates. However, in domains with frequent changes, when only a small portion of the database is added or deleted, the batch techniques pose a huge computational overhead. [15] address the issue of database additions with an incremental association mining algorithm and later in [16] extend the technique to deletion, and general update. This research direction was further explored by [64], who employ the so-called negative border3 to reduce the number of scans over the database, and [9], who claim to need at most one scan over the existing database and exactly one scan over the new database in the case of addition. [10] and [43] question the fundamental concept of association mining – finding all associations present in the database. The amount of large itemsets found, they claim, can be too high for the user to comprehend. They solve this issue by applying a This concept was introduced by [65]. The negative border, Bd− , is a set of minimal infrequent itemsets such that every subset of an itemset I ∈ Bd− is frequent. 3


25

heuristic to retain only those “interesting” itemsets. This research strand was also addressed in the works of [13], [62], and [54]. [52] focus on the issue of interactive systems where previously discovered associations may be needed by the next query. Therefore, they implement a knowledge cache for storing associations. Similar to the previous work, [55] stores found associations in a hash table. The notion of itemset tree, as presented in [24] and [37], addresses the fact that users are often interested only in a portion of itemsets containing some specified items. The itemset tree enables response to such targeted queries. In addition, the itemset tree is easy to update whenever a new batch of transactions is added, thus making it especially suitable for frequently updated databases.

2.3

Time-Varying Domains Another research strand in association mining suggests that the set of large itemsets

may vary in time, as a result of changes in purchasing habits. Such changes may be caused by swings of fashion, the introduction of a new product, or seasonal changes.

2.3.1

Preliminary Work

Most of the research in association mining in time-varying databases aimed at reducing the overhead of rediscovering association rules in the presence of data updates. Fewer researchers have addressed the issue of evolution of the concept underlying the database. Still, some effort has been spent on this area. One may mention the work of [15], which was primarily developed for maintaining evolving databases. In Cheung’s approach the database is updated by blocks of transactions that arrive in batches. Such a mechanism can be used for monitoring evolution in the data or to update the


26

knowledge, based on more recent batches. However, the system is very slow in reacting to abrupt changes. [1] suggests an approach for discovering localized associations using a clustering technique. This method is not explicitly targeted for time-varying domains, however the extension should be quite straightforward – itemsets local to one time frame are likely to be in the same cluster. Yet another technique, developed by [58], was originally meant to speed up the process of association mining. The approach accepts data that arrives in blocks. At the end the algorithm reports two groups of itemsets – large and emerged large. Large itemsets were found to be large in all blocks of the database, while emerged large itemsets were missing in some of the blocks, but later “emerged”. Together the large and emerged large itemsets form a superset of the knowledge obtained by a traditional association mining algorithm. This approach can be used for time-varying domains, however its reaction to vanished itemsets can be quite slow. [5] introduced so-called active data mining. In this approach the database is partitioned according to time periods and each of the periods is mined for a set of association rules. These rules are collected in a rulebase and the user can create trigger conditions on specific patterns in the history of the rules. The issue of evolving rules was further explored in the work of [44]. Similarly to the previous approach, the database is partitioned according to some user specifications and each of the partitions is mined for association rules. The algorithm looks for changes in rules’ supports and confidences that cannot be “explained” by other changes. These so-called fundamental rules are reported to the user. The concept of cyclicity in itemset supports was addressed by [53]. The database is, again, assumed to be partitioned into disjoint segments. The proposed system is able to find itemsets that periodically “emerge” above the minsup but their overall support may not be sufficiently high. The concept of “seasonal” patterns was further extended by [60] who proposed a system that uses the


27

idea of mining association rules from parts of the database being selected via a calendar algebra. An idea similar to this was put forth also by [12] and [42]. Another research strand suggests that even some of the items can be found only in specific time intervals. Thus, the average support of itemsets comprising these emerged items may not have sufficient support throughout the entire database. For example, in a bookstore sales database a recently published bestseller will not have as high support as a less favorite book that has been published earlier. This issue was investigated in the works of [7] and [41]. [59] proposed a window-based approach for mining time-evolving databases. Specifically, the association mining task is not performed over the whole database but only over the transactions that can be seen through a user-specified window.

2.3.2

DEMON

The most recent work directly addressing the issue of systematically evolving data was developed by [22]. The authors point out that real world databases do not evolve in an arbitrary way, rather the evolution can be described as addition and deletion of record batches at a time. The database is considered as a series of blocks and the framework describes a way of efficient maintaining such type of data. The main goal of the work is to create descriptive models of the data using the most popular concepts in data mining – decision trees, clustering and association mining. The models can be created from the entire database or from a window of the most recent records. In the case of the window, the user has to specify how many blocks form the most recent history should be used. The work also uses another type of selection constraint, the block selection predicate, which is a form of a calendar. The calendar is a binary sequence, each 1 denoting the block is used for model creation, 0 meaning the block is skipped. In the case of the most


28

recent window, the calendar can be both relative to the beginning of the database, or relative to the beginning of the window. For example, the analyst may need a model of data collected on all Mondays to analyze the sales after the weekend, or he may be interested in modeling data collected on all Mondays in past month, or the required model should include the same days of week as today from the past month. The use of calendar brings up another issue. The user or analyst using the system must know exactly what blocks from the database he is interested in. This is not always the case and the user may be interested in discovering new calendar patterns. Therefore, the authors propose a system for automatic detection of interesting selection constraints.

2.3.2.1

Significance of Deviation between Two Datasets

A system capable of detecting patterns in block series needs to be able to measure deviations between sets of blocks and to determine a significant deviation. The authors have developed such a framework in [23]. They propose a general framework for computing a measure of difference between datasets, called deviation. The framework is applicable to any concept where models can be expressed in the form of a structural and a measure component. The structural component identifies the “interesting” regions, and the measure component quantifies the subset of the data that is mapped to each of the regions. For instance, in association mining model, the structural component as the set of large itemsets and the measure component are their supports. Given two datasets and models induced from these datasets, the system creates the least common structural component and computes the measure component for the


29

Algorithm 2.5 Determining statistical significant deviation between two datasets Require: M1 , M2 are models induced by D1 , D2 , δ is a deviation measure, and m, n are user-specified constants. Ensure: pˆd = P (F > δ(M1 , M2 )). Draw random samples S11 , S12 , . . . , Sn1 , Sn2 of size m % each from D1 . Compute D = d1 , . . . , dn where di = δ(Mi1 , Mi2 ) and Mi1 , Mi2 are induced by Si1 , Si2 . Let D( δ(M1 , M2 )). if new context then Draw random samples S11 , S12 , . . . , Sn1 , Sn2 of size m % each from D1 . Compute D = d1 , . . . , dn where di = δ(Mi1 , Mi2 ) and Mi1 , Mi2 are induced by Si1 , Si2 . Let D( 5.

CHAPTER 3. FLORA IN ASSOCIATION MINING

53

pD (X) follows a normal distribution with the following parameters:

µX = fD (X),

σX =

s

fD (X)(1 − fD (X)) |D|

Informally, the Lemma states that the actual probability of an itemset can be estimated by its frequency in the database. The properties of the estimate depend on the frequency of the itemset and the size of the database. The Lemma may now be used to compare the frequencies of itemsets in the window and the block. First, the derivation deals with the case of a single itemset. Let X be an itemset and let fW (X) and fB (X) denote the frequencies of X in the window and the block, respectively. Let nW and nB denote the sizes of the two samples. If nW and nB are large, dX = fW (X) − fB (X) can be approximated by a normal distribution with the following parameters:

µdX = fW (X) − fB (X),

s dX =

s

fW (X)(1 − fW (X) fB (X)(1 − fB (X) + nW nB

(3.9)

If the distribution of X is the same in the window and the block, the next statistic follows a normal distribution N (0, 1), and confidence intervals can be used to decide whether the difference of the two frequencies is statistically significant:

zX =

fW (X) − fB (X) s dX

(3.10)


54

0.2 0.15 0.1 0.05 0 −0.05 −0.1 −0.15 −0.2 2 2

1 1

0 0 −1

−1 −2

−2

Figure 3.6: An example of a 2-variate normal distribution.

At this point, the method naturally extends to the multiple itemset scenario. Assume that the differences of all the itemsets follow a normal distribution. Under the simplifying assumption that the occurrences of individual itemsets are pairwise independent, the itemset frequencies form together a multivariate normal distribution (Figure 3.6 shows an example of a 2-variate normal distribution). The estimated mean of the distribution is D = (dX1 , . . . , dXk )

(3.11)

and the estimated covariance matrix is

Σ = sij =

   s2 , di   0,

i=j i 6= j

(3.12)


55

Algorithm 3.4 Multivariate Binomial Change Detector Require: Frequencies of large itemsets in the window and the block Ensure: Change detection Calculate the zk statistic using the equation 3.13 if zk > χ20.05;k then return ’Change detected’ end if

[32] derives that for a random observation X from Nk (X, Σ), the quantity (X − X)0 Σ−1 (X − X) follows a χ2 distribution with k degrees of freedom. Applying this result to the above scenario with k itemsets leads to a statistic:

zk = D0 ∗ Σ−1 ∗ D k P di s−2 = di d i =

i=1 k P

i=1

= =

(3.13)

(fW (Xi )−fB (Xi ))2 f (X )(1−f (X )) fW (Xi )(1−fW (Xi )) + B i n B i n1 2

The derived statistic zk follows a χ2 distribution with k degrees of freedom. Hence, the fact that zk satisfies the expression zk > χ20.05;k can be interpreted as an evidence that there is a significant difference between the concepts that underlie the window and the block. The change detection heuristic is summarized in Algorithm 3.4. Another research question is the issue of which itemsets should be considered as the algorithm input. As was shown before (Section 3.3.2), not all itemsets that are large in one context are large in another context. Generally, three answers to this question can be thought of: 1. Consider union of large itemsets in both the models and substitute missing frequency values with zero. 2. Consider union of large itemsets in both the models and scan the database once


56

more to obtain the missing values. 3. Consider only itemsets that are large in both the models. Intuitively, the first solution will artificially increase the z k statistic, which in turn leads to biased decisions. Most probably, this solution would lead to many false change detections. The other two solutions have both advantages and drawbacks and it is hard to directly answer which one is better. The evaluation of the last two approaches is carried out in Chapter 4.

3.4

Window Adjustment The FLORA-inspired Algorithm 3.1 suggests growing the window of transactions in

periods of stability by adding new increments. When a concept drift is suspected, the window needs to react to this change by removing older blocks that may not be relevant to the new concept. This work introduces three operators for window size control: 1. Harsh operator – if a change alarm is issued by the change detection heuristic, remove all blocks except for the latest one. 2. Reluctant operator – if a change alarm is issued by the change detection heuristic, do nothing. Only if the change is confirmed in the next step, remove all old blocks except for the two latest. 3. Opportunistic operator – if a change is detected by the change-detection heuristic, remove 50 % of the oldest blocks. If the change is detected in the next step, remove all the remaining blocks except for the latest two. The harsh operator reacts immediately to a detected drift. This approach is designed to work well in domains with sudden abrupt changes where keeping old blocks in the window would introduce a high bias in the induced model. However, this operator may easily overreact to false alarms from the change detection heuristic.


57

Preventing this danger is the task of the other two operators, reluctant and opportunistic. However, these operators may be slow to react to a sudden change and prolong the period of adjustment.

3.5

Window Update Each epoch of the FLORA algorithm concludes by merging the window and the new

increment (apart from the cases when a change is detected and the window content is discarded). This step has two phases: (1) merge the two datasets, and (2) update the frequent itemsets in the window. Section 3.2 suggests accomplishing the second phase by running Apriori on the updated window and argues that the window size is relatively small; hence, the computational costs will not be prohibitive. This section proposes a much cheaper way of updating the window description based on the work of [61] (see Section 2.2.3). The Partitioning algorithm divides the database into a series of non-overlapping partitions that can fit into main memory and obtains frequent itemsets that are local to each of them. The work then proves that for an itemset to be frequent in the whole database, it must be frequent in at least one of the partitions. The actual frequency of all such candidates is then verified in one pass over the database. This idea can be easily adapted to the window-update task. The dataset – the updated window – is naturally divided into two partitions: the old window and the increment. All the frequent itemsets in both the partitions are already known and according to Savasere, no other itemset can be frequent in the new window. The real support of these candidates can be easily calculated by scanning the updated widow. The effort spent in scanning can be further reduced by dividing the candidate itemsets into three groups according to their frequency in the old window and the increment:


58

Algorithm 3.5 Main loop of the FLORA algorithm 1: read a new block 2: check the supports of itemsets in the increment; 3: calculate the difference between the window and the increment 4: if the difference is significantly large then 5: reduce the window size by deleting from it some older blocks 6: end if 7: add the block to the window 8: update the supports accordingly

1. itemsets frequent in both the partitions – no scan is needed, the support can be obtained by adding their local supports in the two partitions 2. itemsets frequent in the old window and infrequent in the increment – only the increment needs to scanned for support of these itemsets 3. itemsets infrequent in the old window and frequent in the increment – itemset supports need to be obtained only from the old window 3.6

Complexity Analysis This section aims at comparing the theoretical computational costs of the FLORA

algorithm when employing the three change detection heuristics. The analysis shows that a costly detection heuristic can become a bottleneck of the whole algorithm. Algorithm 3.5 depicts the main loop the system goes through. Let TI , TW be the sets of transactions in the increment and the window, resp. Let l be the maximum transaction size; let FW be the set of frequent itemsets in the window and FI be the set of itemsets with high support in the new block. Let H (History) be the number of blocks in the window and m,n be the parameters of the Ganti detection heuristics (Algorithm 3.2). Analysis: 1: The system reads at most t · l bytes from the storage media.

O(t · l)


59

2: Checking for itemset supports requires one run of an association mining algorithm

O(2|I| · |TI |)

(Apriori). 3: The calculation of difference measures requires a loop over all itemsets

O(|FW ∪ FI |). 4: The complexities of the three change detection heuristics are:

a) in the Ganti’s heuristics, n (typically 50) pairs of bootstrap samples of size n are generated and checked for itemset supports; e.g., the procedure requires 2n runs of 2n · O(2|I| · m)

Apriori.

b) in History-Based heuristics, the significance of the difference is assessed based on a t-test using the history of previous differences.

O(|H|)

c) in the Multivariate Binomial heuristic, the change detection procedure involves checking the significance of the computed zk statistics (table look-up).

O(1).

5: In the case that a change is suspected, a certain percentage of outdated

transactions needs to be removed from the window. This can be done in a constant time (pointer redirection).

O(1)

6: No operation.

O(1)

7: Adding a block of transactions to the window is a constant-time operation, since all

the transactions are already in the memory and merging them involves only pointer redirection.

O(1)

8: Updating the support of itemsets using the window-update algorithm means

scanning the new window for supports of all candidate itemsets. O(|FW ∪ FI | · (|TW | + |TI |))

The above analysis shows that the critical points in the system are all the steps that involve the use of Apriori. The rest of the steps have a polynomial complexity in


60

number of frequent itemsets and/or the size of the database, but Apriori suffers from an exponential complexity. In the case, the system employs either the History-Based or Multivariate Binomial change detection heuristics, each loop of the system involves only one run of Apriori. Hence, the complexity of one loop is O(2|I| · |TI |). On the other hand, the use of the original Ganti’s change detection heuristic (Algorithm 2.5) requires 2n + 1 runs of Apriori. It means that Ganti’s heuristics slows down the system roughly by the factor of 2n as compared to the other heuristics. The modified Ganti’s heuristic (Algorithm 3.2) can substantially reduce the computational costs. The bootstrap table is calculated once for each stable context and needs to be recalculated only when encountering a new context. In the case of stable context, Step 4 of the loop reduces to a look-up in the bootstrap table and the complexity becomes the same as of the other heuristics. However, in the case of a change alert, regardless if a true or a spurious change was detected, the bootstrap procedure needs to be always run with the above-mentioned costs.

3.7

Summary This chapter has proposed a window-based system for association mining in

time-varying domains. The system’s main goal is to achieve high accuracy in approximating the current context, where a context is defined as a continuous series of transactions with a stable set of frequent itemsets. The main components of the system are: (1) a variable-sized window that grows in periods of stability and shrinks whenever a change in context is detected, (2) a change detection heuristic, and (3) a window adjustment operator. This chapter has discussed three change detection heuristics: Ganti’s heuristic, history-based heuristic, and multivariate heuristic. The history-based heuristic employs a difference measure and this dissertation works with four such


61

measures: emerged/vanished conservative, emerged/vanished progressive, Jaccard coefficient, and frequency distance. Each of these measures can operate in two modes: (1) only the last block in the window is compared to the increment, and (2) all the blocks within the window are compared to the increment. Totally, this thesis operates with ten change detection heuristics that belong to three main groups. As for the window adjustment operator, three such operators have been proposed: harsh, reluctant, and opportunistic. Considering all these change detection heuristics and window adjustment operators, the proposed system can operate in thirty different settings. The accuracy of context approximation is likely to be different for every setting and is experimentally evaluated in Chapter 4.

CHAPTER 4: EXPERIMENTAL EVALUATION This chapter describes several experiments conducted to evaluate the quality of the proposed technique. Researchers in many areas run experiments on datasets with easily controlled properties to better target specific features of their methods. The following section, 4.1, introduces a time-varying modification of the standard IBM-generator and a new data generator for concept-drift modeling. Section 4.2 discusses two measures for evaluating the accuracy of concept approximation and two measures for assessing the quality of context change detection. Section 4.3 experimentally evaluates individual components of the proposed system: Section 4.3.1 compares the computational costs due to the change detection heuristics and Sections 4.3.2 and 4.3.3 verify the normality assumptions made by the history-based and normality heuristics (see Section 3.3). The next section, 4.4, provides results for experiments with abrupt and gradual concept drift. Each of these experiments is followed by a brief discussion and the chapter concludes with an overall discussion.

4.1

Data Generators

4.1.1

IBM-Generator

Most of the research in the association mining field uses a generator of testing data which was introduced by [6]. Essentially, the generating algorithm works in two steps. In the first step, the generator creates a set of large itemsets – patterns – along with their probabilities. These patterns are used in the second step, transaction generation. For each transaction, its length is drawn from a Poisson distribution, and a pattern is randomly selected from the repository according to its probability. The transaction then comprises the selected pattern, and if the pattern’s length is less then the desired length

CHAPTER 4. EXPERIMENTAL EVALUATION

63

of the transaction, the rest of the transaction is filled with randomly picked items. The properties of the dataset to be generated can be controlled by a set of parameters, such as the average transaction length, the number of patterns, or the average pattern length. However, the Agrawal’s generator was designed to generate databases with a context that does not evolve through the course of the database. Although there are ways of forcing the generator to produce databases with different contexts, the contexts and the differences between them cannot be easily determined. Moreover, the actual probability of each itemset is not known and the only way is to estimate the probability by the itemset’s frequency in a sufficiently large sample. This work proposes a way to modify the IBM generator for creating time-varying datasets. The modified generator is able to create two types of data: (1) with abrupt change where the dataset consists of several stable contexts and the transition between them is abrupt, and (2) with gradual change where the dataset consists of several stable contexts with an intermediate transition period between every two adjacent contexts. In the transition period, the former context gradually evolves into the latter one over a certain number of blocks. This dissertation adopts an approach where different contexts are generated by varying the number of patterns. Each context consists of two groups of patterns. The first group is constant throughout the database and represents the globally frequent itemsets. The second group is unique to each context, representing, thus, the locally frequent itemsets. Since the transactions arrive in blocks, this situation can be modeled by generating two datasets: the first one with a constant number of patterns and the second one with varying pattern counts. These two datasets are then mixed in the way that the first half of each block contains transactions from the first database and the second half is filled with transactions from the second database. This situation is

IBM Generator: Abrupt Change


64

IBM Generator: Gradual Change Figure 4.1: Abrupt-change domain created by the IBM-generator. The database comprises two contexts, each consisting of several blocks. The lower part of each block contains transactions defining globally frequent itemsets (constant throughout the database). The transactions in upper part of each block define locally frequent itemsets.

Figure 4.2: Gradual-change domain created by the IBM-generator. The database consists of two contexts and a transition period (in the middle). The varying part of each block in the transition period contains a part from both the contexts. By modifying the relative sizes of these two parts the context gradually evolves from the former context to the latter one.

illustrated by Figure 4.1. An abrupt change in context is modeled by changing the number of patterns in the second database. Modeling a gradual change requires a little more effort. A transition period between two contexts is generated by mixing varying proportions of transactions from the two contexts. The first block of the transition period comprises 90 % transactions from the first context and 10 % transactions from the second context. The ratio in the second block is 80 % transactions from the first context to 20 % transactions from the second context, and so on until the last block of the transition period consists of 10 % transactions from the first context and 90 % transactions from the second context. This process is illustrated by Figure 4.2.


n λ µ σ

65

number of transactions to be generated range the items are taken from average size of a transaction mean of the itemset generating normal distribution standard deviation of the itemset generating normal distribution Table 4.1: The input parameters of the new data generator.

4.1.2

RK-Generator

The modifications introduced to the IBM-generator allow the user to create a time-varying dataset. However, the control over the properties of the dataset is only partial. The difference between the contexts is controlled indirectly by means of varying the number of patterns, and the distances between contexts need to be determined experimentally. Furthermore, the probabilities of itemsets are not known in advance and can only be approximated by the itemset frequencies in a sample dataset. This section proposes a new data generator that is capable of overcoming the above-mentioned difficulties. Before the generating process starts, each item is assigned its probability of appearance in a transaction. Since a domain usually contains several thousand items, it would be a tedious task to assign the probabilities manually. Instead, each item is assigned a probability by a normal distribution with user specified parameters. The generator creates each transaction in two steps. First, its length is determined as a random number drawn from a user specified Poisson distribution. Second, the transaction is filled with items according to their probability. A summary of generator inputs is given in Table 4.1. The generator is able to create two types of data. The first type is one with a static concept that is permanent for the entire database. The second type works with a gradually changing concept. The user specifies the parameters of two concepts, and the generator gradually, in ten steps, decreases the fraction of transactions within the first


66

concept and increases the fraction of transactions within the second concept. Thus, the first subpart of the database contains 95 % of the transactions within the first concept, and only 5 % of the transactions within the second concept. The next subpart consists of 85 % of the transactions within the first concept and 15 % of the transactions within the second concept until the last, tenth subpart of the database contains 5 % of the transactions within the first concept and 95 % of the transactions within the second concept. The first type of data can be used to model domains with abrupt concept drift, while the second can model the more realistic, gradual shift from one concept to another. The next advantage of this generator is that the probability of an itemset being in a transaction can be exactly calculated from the properties of the generating distribution. The following Lemma gives a formula for calculating the probability p(I) of an itemset I: Lemma 4.1 Let µ and σ be the parameters of the normal distribution N (µ, σ) the itemsets are drawn from. Let ϕ(x) =

√ 1 e−0.5( 2πσ 2

x−µ 2 σ

) be the probability density function

of the distribution N (µ, σ) and φ(X) be its cumulative distribution function. Let be the item ID range from which the items are selected, and the following constant be defined as c =

1 . φ(B)−φ(A)

Let λ be the average length of transactions.

Then the probability of an item x to be in a transaction T is:

p(x) =

   ϕ(x) ∗ c ∗ λ , x ∈< A, B >   0

, otherwise


67

The probability of an itemset I to appear in a transaction T is:

p(I) =

Y

p(x)

x∈I

Proof: The probability of a number x to be drawn from a normal distribution N (µ, σ) is given by the probability density function ϕ(x). Because we are interested only in items from the range , all the items outside of this region are discarded. Therefore, the function ϕ is modified in the way that both the tails outside of the region are forced to be zero. Since the integral of any probability density +∞ R function must satisfy: ϕ(x)dx = 1, the function ϕ(x) must be adjusted by dividing

it with the term

RB

−∞

ϕ(x) = φ(B) − φ(A), hence the constant c.

A

Next, because a transaction contains λ items on average, the probability that an itemset is found in a transaction is λ-times its probability to be drawn from the normal distribution. Because all the items in an itemset are chosen independently, the probability that the itemset is found in a transaction is the product of the probabilities of the single items.

4.2

Evaluation Metrics The next step in evaluating the quality of concept approximation, achieved by the

proposed algorithm, is to measure the deviation between the induced concept and the theoretical concept underlying the database. This section proposes two metrics for quantifying the difference between any two concepts in Sections 4.2.1 and 4.2.2, and discusses their advantages and drawbacks in Section 4.2.3.


68

Section 4.2.4 defines two measures for assessing the quality of context boundary detection.

4.2.1

Frequency Distance

The first distance metric inputs the frequencies of large itemsets in two given models and computes their difference. The differences are subsequently summed up and normalized by the total number of itemsets. Since not all itemsets are common in both models, itemsets not found in a model are assessed zero frequency in that particular model by definition.1 The formal definition of the metric is as follows: Definition 4.2.1 Let A, B be two sets of transactions; let LA and LB be the sets of large itemsets (induced or theoretical) in A and B. Let L = LA ∪ LB . Let fA (X), fB (X) be the frequencies of X in A, B, respectively. Let a generalized frequency, fZ0 (X), Z ∈ {A, B}, be defined:

fZ0 (X) =

   fZ (X) , X ∈ LZ   0

, otherwise

Let the distance between A, B be:

1 dist(A, B) = |L| 1

sX

X∈L

A similar distance metric was developed in [23].

(fA0 (X) − fB0 (X))2

(4.1)


4.2.2

69

Penrose Distance

The following distance is commonly used in statistical literature (for instance, [47]). In addition to working with frequencies of itemsets in the two models, the difference is normalized by the joint standard deviation of the itemset frequency in the two models. 2 Similarly to the previous distance, the normalized difference is then summed up and divided by the number of itemsets. The Penrose distance is defined as: Definition 4.2.2 Let A, B be two sets of transactions; let LA and LB be the sets of large itemsets (induced or theoretical) in A and B. Let L = LA ∪ LB . Let fA (X), fB (X) be the frequencies of X in A, B, respectively. Let a generalized frequency fZ0 (X), Z ∈ {A, B} be defined:

   fZ (X) , X ∈ LZ 0 fZ (X) =   0 , otherwise Let the Penrose distance be:

distP (A, B) = =

4.2.3

1 |L| 1 |L|

P

X∈L

P

X∈L

0 (X)−f 0 (X))2 (fA B 2 (X)+σ 2 (X) σA B

=

0 (X)−f 0 (X))2 (fA B f 0 (X)(1−f 0 (X)) f 0 (X)(1−f 0 (X)) B B A A + |A| |B|

Distance Measure Comparison

The purpose of distance used as an error measure is to evaluate how well the concept behind a database is approximated. On the one hand, in the case of the 2

The standard deviation has been derived in Formula 3.9.

(4.2)

70

0.01

50

0.008

40

0.006

30

Error

Error


0.004 0.002

20 10

0

0 0

20

40

60

80

0

20

40

60

Transaction (x 1000)

Transaction (x 1000)

(a) Frequency distance

(b) Penrose distance

80

Figure 4.3: The course of the frequency distance and the Penrose distance. The dataset consists of one stable context. The graphs depict the accuracy of context approximation after “seeing” certain number of blocks. The error of approximation is measured by the two distances.

frequency distance, the error is the absolute difference between the two concepts. On the other hand, the Penrose distance considers both the absolute difference between the two concepts as well as the size of the two databases. The size of the databases is employed indirectly through the standard deviation – the larger the database the smaller the standard deviation. The measure best suited for the error measure role is best illustrated by example. A typical example showing the course of the error throughout the database is depicted in Figure 4.3. The dataset used in this experiment comprises one hundred thousand transactions, all having the same underlying concept. The experiment used a steadily growing window, where in each increment one thousand transactions were added. At each step the whole window was mined for associations, and the induced concept description was compared with the theoretical concept. Both of the above graphs depict the course of the error as measured by the two distance measures


71

previously defined. It is note worthy that both the error measures were calculated at the same time for the same induced concept description. The left subgraph reveals that the frequency distance decreases as the window comprises more transactions. The concept description becomes more precise with every piece of additional information and the absolute difference between the concept and the induced concept description becomes smaller. The Penrose distance considers both the concept difference and the size of the window. The reader can see that the distance between the concept and the induced model grows at first and only later starts decreasing, which is on the contrary with what one may expect from an error measure. The conclusion from this simple experiment is that frequency distance seems to be more suitable for the purposes of error measure. On the other hand, the Penrose distance seems to be concerned with relative difference between two concepts rather than with absolute difference between two concepts. Hence, the Penrose distance may be more suitable for measuring the distance between two datasets, especially datasets of different sizes. The rest of this chapter uses the frequency distance as the error measure.

4.2.4

Context Boundary Detection

Another important factor giving evidence of the system quality is its ability to correctly detect context change. Clearly, the main contributor to the system’s ability to detect a change is the change detection heuristic used, however, the system’s performance can be affected by the window adjustment operator. Well performing system needs to be sensitive – detect all true changes, and robust – avoid detecting spurious changes. Since the context boundaries are known in the test domains, it is possible to determine whether the reported change is a real or spurious one.


72

This section proposes two measures to asses the quality of change detection. The first one, percentage of true changes, expresses the sensitivity of the system. The measure divides the number of changes detected where expected by the number of real changes in context. In the abrupt-change domains a change is expected between every two contexts. In the gradual-change domains the change is expected in every block of the transitional period. The second measure, percentage of spurious changes, evaluates the system’s robustness. The number of changes detected in stable parts of the database is divided by the number of blocks in these stable parts.

4.3

Component Testing The aim of this section is to test the proposed change detection heuristics from

several aspects. Section 4.3.1 compares the computational costs of the three change detection heuristics. Section 4.3.2 focuses on the history-based heuristic variations and verifies the claims made in Sections 3.3.2. The last part of this section, Section 4.3.3, checks the validity of the normality assumptions made by the multivariate heuristic.

4.3.1

Ganti’s Heuristic Complexity

Theoretical analysis performed in Section 3.6 suggests that the original Ganti’s change detection heuristics slows down the FLORA algorithm approximately 2n-times, typically n = 50, as compared to the other heuristics. This section aims at experimental verification of this claim. The experiment used the RK-generator to create a dataset consisting of three pairs of contexts, each context being 20 blocks long. Figure 4.4 plots the time needed for each step of the algorithm throughout the database, as required by the three groups of


73

Run Time, RKbig 10000

Time (s)

1000

Ganti Multivariate History

100

10

1 0

20

40

60

80

100

Block

Figure 4.4: Run-times of the three groups of change detection heuristics. The graph compares the average time required for processing one block of transactions when using different change detection heuristics.

change detection heuristics. The black solid line represents the Ganti’s heuristic, the dashed line displays the multivariate heuristic, and the gray solid line depicts the history-based heuristic, represented by the emerged/vanished conservative version. However, very similar results would be obtained, should a different distance measure been used. The above-mentioned theoretical analysis predicts that the Ganti’s change detection heuristic would have 100-times higher computational costs. In the experiment the Ganti’s heuristic exhibited approximately two orders of magnitude higher complexity than the two other heuristics, which is in accordance with the prediction.


4.3.2

74

History-Based Heuristic Assumptions

Apart from Ganti’s heuristic, all the other change detection heuristics are built upon the assumption that a certain variable(s) is/are normally distributed. This section aims at verifying this claim. The normality assumption can be tested in a number of ways. The first could be a visual assessment based on a histogram of a given dataset. In addition, many numerical tests have been developed in the field of Statistics. Perhaps, the most common tests are: χ2 Goodness-of-Fit test, Shapiro-Wilks test, test for kurtosis, and test for skewness. The χ2 Goodness-of-Fit test divides the range of the variable into a series of equally probable classes and compares the number of observations in each class to the number expected. The Shapiro-Wilks test is based upon comparing the quantiles of the fitted normal distribution to the quantiles of the data. The skewness test looks for lack of symmetry in the data. The kurtosis test looks for a distributional shape which is either flatter or more peaked than the normal distribution. The outcome of these tests is the so-called “P-value” which determines how likely it is that the observed values are normally distributed. Typically, a random variable with P-values less than five per cent is considered non-normally distributed. The first experiment checked the assumption that the number of emerged and vanished frequent itemsets, the jaccard coefficient, and the frequency distance are normally distributed in the case the new increment is being compared with the previous block (see Section 3.3.2). The experiment was conducted on domains created by both the RK-generator and the IBM-generator. Each domain comprised 760 blocks of 1, 000 transactions and the minimum threshold was set to one per cent. The histograms of the variable distributions are depicted in Figure A.1 for the RK-generator and Figure A.2 for a dataset created by the IBM-generator. The respective P-values are recorded in


Emerged Vanished Jaccard Frequency

G-o-F 0 0 0.204 0.164

Shapiro-Wilks 0.121 0.349 0.977 0.271

75

Skewness 0.497 0.829 0.604 0.594

Kurtosis 0.389 0.334 0.176 0.297

Skewness 0 0 0.321 0

Kurtosis 0.018 0.002 0.291 0

(a) RK data.

Emerged Vanished Jaccard Frequency

G-o-F 0.012 0.001 0.223 0

Shapiro-Wilks 0 0 0.008 0 (b) IBM data.

Table 4.2: P-values of normality tests for the change detection variables.

Table 4.2. The P-values that exceed the required minimum of five per cent are highlighted. The results clearly show that the variables in question pass the normality test for the RK-domain, except for the numbers of emerged and vanished that did not pass the χ2 Goodness-of-Fit test. However, closer look at the histograms of these variables reveals that they quite closely follow the estimated normal distribution. Hence, the change detection heuristic is expected to work satisfactorily with any of these distance measures. On the other hand, only the Jaccard coefficient performed well on the IBM-dataset. Also, a visual examination of histograms (Figure A.2) confirms that the distribution of the variables significantly departs from the normal distribution shape. Experiments in Sections 4.4.1 and 4.4.2 further investigate whether the t-test employed in the change detection heuristic is robust enough to handle the non-normal distribution of the distance measures. Most probably, the test will suffer from many false alarms. And, according to the normality test, the Jaccard coefficient is expected


76

to achieve the best performance. The second experiment focuses on the case the entire window and the increment are the inputs of the four distance measures. According to Section 3.3.2.4, the Jaccard coefficient is expected to increase and the frequency distance is expected to decrease, as the window grows. No claim has been made regarding the number of emerged and vanished itemsets, however, they are likely to evolve as well. The experiment was run on two domains created by the RK-generator and the IBM-generator. For each domain there were ten datasets, each generated with a different random seed for the data generator. Each dataset comprised 20 blocks of 1, 000 transactions and the minimum threshold was set to one per cent. The growing-window environment was simulated by measuring the difference between the increment and all the previous blocks. The average results from the ten runs are depicted in Figure 4.5. The left-column graphs show the average difference in the RK-domain and the right column depicts the IBM-domain. Each row provides the results for one distance measure. As expected, the values of the Jaccard coefficient increase as a function of the window size and the frequency distance values decrease with the growing window. Similarly, the number of vanished itemsets follows the same trend as the frequency distance and decreases as a function of the window size, however, the number of emerged itemsets decreases in the RK-domain and increases in the IBM-domain. The observed behavior suggests that the change detection heuristic based on the Jaccard coefficient and the frequency distance will have a superior performance in the window-vs.-increment variant as compared to the last-block-vs.-increment one. As for the emerged/vanished heuristic, the performance will most probably be better in the RK-domain and less effective in the IBM-domain.


77

Emerged

Emerged

250

185 Emerged/Vanished

Emerged/Vanished

180 200 150 100 50

175 170 165 160 155 150

0

145 0

5

10

15

20

0

5

10

Block

15

20

15

20

15

20

15

20

Block

Vanished

Vanished

250

180 Emerged/Vanished

Emerged/Vanished

160 200 150 100 50

140 120 100 80 60 40 20

0

0 0

5

10

15

20

0

5

10

Block

Block

Jaccard Coefficient

Jaccard Coefficient

0.76 Jaccard coefficient

Jaccard coefficient

0.74 0.72 0.7 0.68 0.66 0.64 0.62 0.6 0

5

10

15

20

0.37 0.36 0.35 0.34 0.33 0.32 0.31 0.3 0.29 0.28 0.27 0

5

10

Block

Block

Frequency dist.

Frequency dist.

0.009

0.0102 0.01

0.007

Frequency dist.

Frequency dist.

0.008 0.006 0.005 0.004 0.003 0.002

0.0098 0.0096 0.0094 0.0092

0.001 0

0.009 0

5

10 Block

15

20

0

5

10 Block

Figure 4.5: The course of the four distance measures when comparing the growing window with the increment. The left-column graphs deal with the RK-domain and the right-column graphs depict the results for the IBM-domain.


4.3.3

78

Multivariate Heuristic Assumptions

This experiment is intended to check on the normal distribution of itemset frequencies. Section 3.3.3 describes a change detection heuristic that is based on a multivariate normal distribution of itemset frequencies. Although it is difficult to confirm multivariate normality of a sample, it is possible to check the necessary condition for multivariate normality, the normality of every single variable. The following experiment used two databases, one for each generator. Both the databases consisted of 200 blocks, each containing 1, 000 transactions. All the itemsets satisfying the minimum threshold condition; e.g., having support greater than one per cent, were stored for every block. From those itemsets frequent in all the blocks, five were randomly selected for every database. The reason for such a selection was to reduce the number of itemsets to be tested. Furthermore, the observed distribution of an itemset that was infrequent in some of the blocks would incorrectly represent the itemset’s underlying distribution. The experiment used the same testing methods as in Section 4.3.2. The frequency histograms of the five selected itemsets are provided in Figures A.3 and A.4. Table 4.3 lists the P-values of the four above-mentioned normality tests: χ2 Goodness-of-Fit, Shapiro-Wilks, test for skewness and test for kurtosis. Again, those values exceeding the required five per cent are printed in bold face. The results of the tests reveal that the frequencies of the randomly selected itemsets approximately follow a normal distribution. With exception of the χ2 Goodness-of-Fit test, the itemsets passed all the other tests. Nevertheless, this test verified only the necessary condition for the desired multivariate normality. Since the distribution of single itemset frequencies is not perfectly normal, one may expect somewhat lowered accuracy of the multivariate heuristic. Section 3.3.3 proposes three ways to handle the fact that not every itemset frequent


Itemset1 Itemset2 Itemset3 Itemset4 Itemset5

G-o-F 0 0 0 0 0

Shapiro-Wilks 0.217 0.254 0.061 0 0.143

79

Skewness 0.076 0.906 0.522 0.075 0.429

Kurtosis 0.029 0.729 0.652 0.666 0.974

Skewness 0.140 0.079 0.289 0.724 0.773

Kurtosis 0.453 0.351 0.661 0.125 0.448

(a) RK data.

Itemset1 Itemset2 Itemset3 Itemset4 Itemset5

G-o-F 0 0 0 0 0

Shapiro-Wilks 0.180 0.093 0.060 0.844 0.147 (b) IBM data.

Table 4.3: P-values of the normality tests for five randomly selected itemsets.

in the window is frequent in the increment as well. The second experiment targets the three possible inputs of the multivariate heuristic: 1. Consider union of large itemsets in both the models and substitute missing frequency values with zero. 2. Consider union of large itemsets in both the models and scan the database once more to obtain the missing values. 3. Consider only itemsets that are large in both the models. The experiment was performed on two datasets, one generated by the RK-generator and the other one created by the IBM-generator, each consisting of one stable context. The datasets contained 100 blocks of 1, 000 transactions. Since type 1 solution clearly exaggerates the difference between the window and the increment, only type 2 and type 3 solutions were tested. The resulting course of the window size is depicted in


80

Multivariate, type 3

3500

120000

3000

100000

2500

Window Size

Window Size

Multivariate, type 2

2000 1500 1000

80000 60000 40000 20000

500 0

0 0

20

40

60

80

Block

(a) Type 2

0

20

40

60

80

Block

(b) Type 3

Figure 4.6: The immediate window size for the two variants of the multivariate change detection heuristic. The test datasets were created by the RK-generator and the IBM-generator and consisted of one stable context. The solid line represents the RK-domain and the dashed line represents the IBM-domain.

Figure 4.6. Each graph consists of two curves: the solid line represents the RK-domain and the dashed line depicts the IBM-domain. The observed behavior gives evidence that the type 2 variant - using all itemsets frequent in at least one model - strongly overreacts and reports many false alarms. In the test datasets, consisting of 100 blocks, the heuristic reported 75 spurious changes for the RK-domain and 87 spurious changes for the IBM-domain. On the other hand, the type 3 variant - only those itemsets frequent in both the models are considered correctly reported no change. Nevertheless, this experiment only tested the robustness of the proposed heuristic, e.g., the resistance to noise in the data. The ability to correctly detect change in context is the goal of the next experiments. Since the type 2 variation of the heuristic exhibited a poor performance in this test, the remaining experiments consider only the type 3 variant.


4.4

81

System-Wide Testing The goal of this section is to experimentally evaluate the accuracy of context

approximation achieved by the proposed system (see Section 3.2) when employing the different change detection heuristics in conjunction with the three window adjustment operators. The dissertation has proposed the following change detection heuristics: • Ganti’s change detection heuristic; the heuristic is based on the use of a bootstrapping technique for obtaining the database underlying distribution (Section 3.3.1). • History-Based heuristics (Section 3.3.2); a group of heuristics based on the use of the history of previous differences. The thesis proposes four difference measures that can be used for assessing the difference between the window and the increment: emerged/vanished conservative, emerged/vanished progressive, Jaccard coefficient, and frequency distance. Each of the difference measures can operate in two modes: (1) only the last block from the window is compared to the increment, and (2) all blocks from the window are compared to the increment. • Multivariate change detection heuristic (Section 3.3.3); relies on the fact that frequency of every itemset follows a binomial distribution. The heuristic creates a multivariate model of both the window and the increment and compares these models with a statistical test.

4.4.1

Abrupt Change

These experiments model the situation where the change between two contexts is abrupt. The experiments worked with pairs of alternating contexts. For both RK-generator and IBM-generator, two domains were considered: one with relatively big differences between adjacent contexts (dRKbig = 252, dIBM big = 232) and one with


Transactions Items Trans. length Distr. mean Distr. std

Context1 20, 000 10, 000 20 4, 000 500

Context2 20, 000 10, 000 20 6, 000 500

(a) RK-generator big-change domain.

Transactions Items Trans. length Patterns

Context1 20, 000 10, 000 20 500

Context2 20, 000 10, 000 20 2, 000

(c) IBM-generator big-change domain.

Transactions Items Trans. length Distr. mean Distr. std

82

Context1 20, 000 10, 000 20 4, 900 500

Context2 20, 000 10, 000 20 5, 100 500

(b) RK-generator small-change domain.

Transactions Items Trans. length Patterns

Context1 20, 000 10, 000 20 500

Context2 20, 000 10, 000 20 800

(d) IBM-generator small-change domain.

Table 4.4: Properties of the four test domains. The tables on the left provide properties of the two contexts in the big-change domains. The right tables list properties of the small-change domains. The upper tables deal with the RK-generator data and the bottom tables depict the properties of the IBM-generator domains. Each domain is composed of two repetitions of the two contexts.

relatively small differences between adjacent contexts (dRKsmall = 54, dIBM small = 108).3 The differences were measured by the Penrose distance (equation 4.2). The domain specifications can be found in Table 4.4. For each domain there were 10 datasets, each generated with a different initial seed for the random generator. The data was presented to the algorithm in batches of 1, 000 transactions. The minimum support was always set to one per cent. The resulting error was calculated as an average from 10 runs, each run covering one dataset, using the frequency distance measure (equation 4.1). First, the experiments were conducted for the three baseline approaches: the first of them did not use any window; instead, all the transactions seen so far were used for 3 For RK-generator, the distance is calculated from the underlying probabilistic distributions; for IBMgenerator, the distances are obtained as averages of measurements from several random data generations (standard deviations for IBM data were typically between 5 and 8)..


83

inducing the model; the second approach used a fixed-size window of 2, 000 transactions (two blocks); the third one used a fixed-size window of ten blocks. The remaining experiments used the proposed system with variable-sized window. All the combinations of change detection heuristics and window adjustment operators were tested. The first setting used the Ganti’s change detection heuristic. The next group of experiments tested all the variants of the history-based heuristic: emerged/vanished conservative, emerged/vanished progressive, Jaccard, frequency; both for a setting comparing the last block with the increment and for a setting comparing the whole window with the increment. The last experiment focused on the multivariate heuristic. The error of every experiment was averaged by summing over the entire test data and dividing by number of blocks. These averaged errors for all the above mentioned settings are depicted in Figures 4.8, 4.9, 4.10, and 4.11. Each graph displays the average errors in one of the four domains. The white, light gray, and dark gray bars represent the three possible window adjustment operators; the bars are grouped by the different change detection heuristics used, the abbreviations are explained in Figure 4.7. The next set of graphs, Figures 4.12 through 4.19, reveals the percentage of true and spurious changes (Section 4.2.4) detected by the system when employing the different change detection heuristics and window adjustment operators. The white, light gray, and dark gray bars represent the three possible window adjustment operators; the bars are grouped by the different change detection heuristics used, the abbreviations are explained in Figure 4.7. In this case, the graphs do not contain information for the baseline approaches since these approaches do not involve change detection. The detailed results, depicting the course of the error, are provided in Appendix B, Figure B.1 through Figure B.11. The immediate window size is summarized by Figure B.12 through Figure B.22 in Appendix B.


84

nowin baseline approach where the entire database is used for model induction. fixwin 2k baseline approach with a fixed-size window of two blocks. fixwin 10k baseline approach with a fixed-size window of ten blocks. ganti variable-size window with Ganti’s change detection heuristic. E/V cons variable-size window, history-based change detection heuristic with emerged/vanished conservative difference measure, only the last block is compared with the increment. E/V progressive variable-size window, history-based change detection heuristic with emerged/vanished progressive difference measure, only the last block is compared with the increment. jaccard variable-size window, history-based change detection heuristic with Jaccard coefficient difference measure, only the last block is compared with the increment. frequency variable-size window, history-based change detection heuristic with frequency difference measure, only the last block is compared with the increment. cons. win variable-size window, history-based change detection heuristic with emerged/vanished conservative difference measure, all the window blocks are compared with the increment. prog. win variable-size window, history-based change detection heuristic with emerged/vanished progressive difference measure, all the window blocks are compared with the increment. jaccard win variable-size window, history-based change detection heuristic with Jaccard coefficient difference measure, all the window blocks are compared with the increment. freq. win variable-size window, history-based change detection heuristic with frequency difference measure, all the window blocks are compared with the increment. multivariate variable-size window, multivariate change detection heuristic Figure 4.7: The abbreviations used in all graphs depicting results of the system-wide experiments


85

RKbig, Average Error 0.012 0.01

Error

0.008

harsh reluctant opport.

0.006 0.004 0.002

n fix owi w n fix in w 2k in 10 k g E/ an V ti E/ con V s p ja rog fre c c a qu r d c o en n s cy pr . w j a og. i n cc w a r in d fre wi m q. n ul w tiv in ar ia te

0

Figure 4.8: Comparison of average heuristic errors on RK-generator big-change data. The error is calculated as an average over all the blocks in the dataset. The three leftmost bars represent the baseline approaches: no window, short fixed-size window, and long fixed-size window. The next bar shows the error for the Ganti’s heuristic. The next eight bars deal with all the variations of the history-based heuristic. The right-most bar depicts the error for the multivariate heuristic. RKsmall, Average Error 0.012 0.01

Error

0.008 0.006 0.004


0.002


0

Figure 4.9: Comparison of average heuristic errors on RK-generator small-change data. The error is calculated as an average over all the blocks in the dataset. The three leftmost bars represent the baseline approaches: no window, short fixed-size window, and long fixed-size window. The next bar shows the error for the Ganti’s heuristic. The next eight bars deal with all the variations of the history-based heuristic. The right-most bar depicts the error for the multivariate heuristic.


86

IBMbig, Average Error 0.012 0.01

Error

0.008


0.006 0.004 0.002


0

Figure 4.10: Comparison of average heuristic errors on IBM-generator big-change data. The error is calculated as an average over all the blocks in the dataset. The three left-most bars represent the baseline approaches: no window, short fixed-size window, and long fixed-size window. The next bar shows the error for the Ganti’s heuristic. The next eight bars deal with all the variations of the history-based heuristic. The right-most bar depicts the error for the multivariate heuristic. IBMsmall, Average Error 0.012 0.01

Error

0.008 0.006 0.004


0.002


0

Figure 4.11: Comparison of average heuristic errors on IBM-generator small-change data. The error is calculated as an average over all the blocks in the dataset. The three left-most bars represent the baseline approaches: no window, short fixed-size window, and long fixed-size window. The next bar shows the error for the Ganti’s heuristic. The next eight bars deal with all the variations of the history-based heuristic. The right-most bar depicts the error for the multivariate heuristic.


87

100 90 80 70 60 50 40 30 20 10 0


ga E/ nti V co E/ ns V pr o ja g cc fre a r d qu e c o ncy ns .w i pr og n . ja c c win ar d w fre in q. m ul win tiv ar ia te

True Changes (%)

RKbig, True Changes

Figure 4.12: Percentage of detected true changes on RK-generator big-change data. The points of change were known in advance and this graph shows what percentage of them was detected by the system. The left-most bar represents the Ganti’s heuristic. The next eight bars deal with all the variations of the history-based heuristic. The right-most bar depicts the results for the multivariate heuristic.

100 90 80 70 60 50 40 30 20 10 0



True Changes (%)

RKbig, Spurious Changes

Figure 4.13: Percentage of spurious changes on RK-generator big-change data. This graph evaluates the system’s robustness against spurious changes. The percentage of spurious changes is defined as the relative frequency of reported changes in blocks where no change was expected. The left-most bar represents the Ganti’s heuristic. The next eight bars deal with all the variations of the history-based heuristic. The right-most bar depicts the results for the multivariate heuristic.


88

100 90 80 70 60 50 40 30 20 10 0



True Changes (%)

RKsmall, True Changes

Figure 4.14: Percentage of detected true changes on RK-generator small-change data. The points of change were known in advance and this graph shows what percentage of them was detected by the system. The left-most bar represents the Ganti’s heuristic. The next eight bars deal with all the variations of the history-based heuristic. The right-most bar depicts the results for the multivariate heuristic.

100 90 80 70 60 50 40 30 20 10 0



True Changes (%)

RKsmall, Spurious Changes

Figure 4.15: Percentage of spurious changes on RK-generator small-change data. This graph evaluates the system’s robustness against spurious changes. The percentage of spurious changes is defined as the relative frequency of reported changes in blocks where no change was expected. The left-most bar represents the Ganti’s heuristic. The next eight bars deal with all the variations of the history-based heuristic. The right-most bar depicts the results for the multivariate heuristic.


89

100 90 80 70 60 50 40 30 20 10 0



True Changes (%)

IBMbig, True Changes

Figure 4.16: Percentage of detected true changes on IBM-generator big-change data. The points of change were known in advance and this graph shows what percentage of them was detected by the system. The left-most bar represents the Ganti’s heuristic. The next eight bars deal with all the variations of the history-based heuristic. The right-most bar depicts the results for the multivariate heuristic.

100 90 80 70 60 50 40 30 20 10 0



True Changes (%)

IBMbig, Spurious Changes

Figure 4.17: Percentage of spurious changes on IBM-generator big-change data. This graph evaluates the system’s robustness against spurious changes. The percentage of spurious changes is defined as the relative frequency of reported changes in blocks where no change was expected. The left-most bar represents the Ganti’s heuristic. The next eight bars deal with all the variations of the history-based heuristic. The right-most bar depicts the results for the multivariate heuristic.


90

100 90 80 70 60 50 40 30 20 10 0



True Changes (%)

IBMsmall, True Changes

Figure 4.18: Percentage of detected true changes on IBM-generator small-change data. The points of change were known in advance and this graph shows what percentage of them was detected by the system. The left-most bar represents the Ganti’s heuristic. The next eight bars deal with all the variations of the history-based heuristic. The right-most bar depicts the results for the multivariate heuristic.

100 90 80 70 60 50 40 30 20 10 0



True Changes (%)

IBMsmall, Spurious Changes

Figure 4.19: Percentage of spurious changes on IBM-generator small-change data. This graph evaluates the system’s robustness against spurious changes. The percentage of spurious changes is defined as the relative frequency of reported changes in blocks where no change was expected. The left-most bar represents the Ganti’s heuristic. The next eight bars deal with all the variations of the history-based heuristic. The right-most bar depicts the results for the multivariate heuristic.


91

The reader can see that the no-window and fixed-window approaches performed poorly on the big-change domains and were usually outperformed by the variable-size window algorithms. On the small-change domains the difference, however, was not very pronounced. Interestingly, having a mix of several contexts (the no-window approach) produces lower error than having a too short window (the fixed-size window of two blocks). The Ganti’s change detection showed a very good performance on all the domains. However, the price for this is a very high complexity (see Section 4.3.1). The history-based heuristics exhibit a wide range of behaviors, depending on the distance measure used. The emerged/vanished conservative version is as good as Ganti’s heuristic on the RK-domain; however, it reaches a very high error on the IBM-domain. This can be explained by Figure 4.20. The left graph displays the values of the emerged and vanished variables on the RK-domain. The right graph plots the two variables on the IBM-domain. Both the test datasets consisted of four contexts, each comprising 20 blocks. In the RK-domain, both the variables signal the context change at the same moment. However, in the IBM-domain, only one variable marks each change. Since the conservative scenario requires the change to be detected in both the variables, no change alert is ever issued in the IBM-domain and the window grows without limits. This issue is solved by the emerged/vanished progressive heuristic. Nevertheless, here the number of false alarms significantly piles up. This, in turn, leads to an unnecessary short window and high error. In the last-block-vs.-increment version, the Jaccard and frequency distance measures trigger many false alarms, as well, and suffer from a higher error. However, the window-vs.-increment scenario brings a substantial advantage. The two heuristics considerably improve their detection accuracy and, in some cases, even outperform the Ganti’s heuristic. The multivariate heuristic achieves very good results on all domains with the exception of the RK-small-change domain. Graphs depicting


RKbig - Average Em/Van

IBMbig - Average Em/Van

1200

4500 4000

1000 800 Emerged Vanished

600 400 200

Emerged/Vanished

Emerged/Vanished

92

3500 3000 2500

Emerged Vanished

2000 1500 1000 500

0

0 1

21

41

61

1

Block

21

41

61

Block

Figure 4.20: The numbers of emerged/vanished frequent itemsets in both the test domains: the RK-domain and the IBM-domain. The test data consisted of four contexts, each twenty blocks long. The numbers of emerged and vanished itemsets were taken as an average from ten runs.

window size reveal that the multivariate heuristic does not detect any changes in the RK-small-change domain, leading, thus, to an increased error. As for the choice of the window adjustment operator, in the RK-domain the reluctant and opportunistic operators seem to be the best choice for heuristics suffering from many spurious changes (history-based last-block-vs.-increment heuristics). On the contrary, the harsh operator gives the best results for stable heuristics (Jaccard and frequency in the window-vs.-increment scenario, multivariate). In this case the reluctant and opportunistic operators tend to have considerably higher error at the time of context change because they wait one block longer for the change to be confirmed. This waiting causes the window to contain outdated transactions which leads to an incorrect concept approximation. On the other hand, the IBM-domain seems to require the use of the Reluctant or the Opportunistic operator, with the Reluctant operator achieving a slightly lower error.


4.4.2

93

Gradual Change

The second set of experiments deals with a more realistic domain, where the database gradually evolves from one concept to another. The experiment conditions were similar to the previous set of experiments (Section 4.4.1). The algorithms were run on two domains, a big-change domain and a small-change domain, for both the RK-generator and the IBM-generator. All the domains comprised two alternating contexts, and in addition to the former experiment, there was a region of twenty thousand transactions between any two contexts that evolved the concept from the old to the new (see Section 4.1 for more details). The error of every experiment was averaged by summing over the entire test data and dividing by number of blocks. These averaged errors for all the possible combinations of change detection heuristics and window adjustment operators are depicted in Figures 4.21, 4.22, 4.23, and 4.24. Each graph displays the average errors in one of the four domains. The white, light gray, and dark gray bars represent the three possible window adjustment operators; the bars are grouped by the different change detection heuristics used, and the abbreviations are explained in Figure 4.7. The next set of graphs, Figures 4.25 through 4.32, reveals the percentage of true and spurious changes (Section 4.2.4) detected by the system when employing the different change detection heuristics and window adjustment operators. The white, light gray, and dark gray bars represent the three possible window adjustment operators. Each group of bars displays the results for a different change detection heuristic used, and the abbreviations are explained in Figure 4.7. In this case, the graphs do not contain information for the baseline approaches since these approaches do not involve change detection. Figure C.1 through Figure C.11 in Appendix C give details about the experiment’s error, and Figures C.12 through Figure C.22 in Appendix C reveal the


94

window’s size throughout the experiment. The conclusions from this set of experiments are somewhat different than those from the previous experiment (Section 4.4.1). A non-window approach still produces a high error; however a fixed-size window becomes more acceptable in this setting. Especially, a long window can accommodate enough transactions to satisfactorily approximate the current context, and since the concept changes gradually, the transactions in the window are highly relevant, even in the changing environment. Nonetheless, a variable-size window can still bring a performance gain. The Ganti’s heuristic proved a little less sensitive, especially in the RK-small-change domain where no change was detected. Also, in the IBM-domains the algorithm was prone to grouping several contexts together. Similarly, the emerged/vanished conservative heuristic was not able to recognize context change in the IBM-domains. The window-vs.-increment version exhibits an interesting behavior in the RK-domains. Contexts tend to be grouped together in the way that each group contains a portion of the previous transition period, the stable context, and a part of the following transition period. The emerged/vanished progressive heuristic is, on the other hand, too sensitive and keeps the window shorter than necessary. However, the window-vs-increment variation improves its performance. The Jaccard and frequency heuristics, again, work better in the window-vs-increment version and give the best results of all the heuristics. The multivariate heuristic, however, is able to detect only very big changes (the RK-big-change domain) and steadily grows the window in other cases. As for the window adjustment operator, each domain seems to need a different operator. In the RK-domain the reluctant operator brings an advantage in the form of lower error, whereas in the IBM-domain, the harsh operator wins in most cases.


95

RKbig-grad, Average Error 0.012 0.01

Error

0.008


0.006 0.004 0.002


0

Figure 4.21: The system’s error on RK-generator data with gradual big change when employing all the proposed change detectors and window adjustment operators. The three left-most bars represent the baseline approaches: no window, short fixed-size window, and long fixed-size window. The next bar shows the error for the Ganti’s heuristic. The next eight bars deal with all the variations of the history-based heuristic. The right-most bar shows the multivariate heuristic. RKsmall-grad, Average Error 0.012 0.01

Error

0.008 0.006 0.004


0.002


0

Figure 4.22: The system’s error on RK-generator data with gradual small change when employing all the proposed change detectors and window adjustment operators. The three left-most bars represent the baseline approaches: no window, short fixed-size window, and long fixed-size window. The next bar shows the error for the Ganti’s heuristic. The next eight bars deal with all the variations of the history-based heuristic. The right-most bar shows the multivariate heuristic.


96

IBMbig-grad, Average Error 0.012 0.01

Error

0.008


0.006 0.004 0.002


0

Figure 4.23: The system’s error on IBM-generator data with gradual big change when employing all the proposed change detectors and window adjustment operators. The three left-most bars represent the baseline approaches: no window, short fixed-size window, and long fixed-size window. The next bar shows the error for the Ganti’s heuristic. The next eight bars deal with all the variations of the history-based heuristic. The right-most bar shows the multivariate heuristic. IBMsmall-grad, Average Error 0.012 0.01

Error

0.008 0.006 0.004


0.002


0

Figure 4.24: The system’s error on IBM-generator data with gradual small change when employing all the proposed change detectors and window adjustment operators. The three left-most bars represent the baseline approaches: no window, short fixed-size window, and long fixed-size window. The next bar shows the error for the Ganti’s heuristic. The next eight bars deal with all the variations of the history-based heuristic. The right-most bar shows the multivariate heuristic.


97

100 90 80 70 60 50 40 30 20 10 0



True Changes (%)

RKbig-grad, True Changes

Figure 4.25: Percentage of detected true changes on RK-generator big-change data with gradual change. The points of change were known in advance and this graph shows what percentage of them was detected by the system. The left-most bar represents the Ganti’s heuristic. The next eight bars deal with all the variations of the history-based heuristic. The right-most bar shows the multivariate heuristic.

100 90 80 70 60 50 40 30 20 10 0



True Changes (%)

RKbig-grad, Spurious Changes

Figure 4.26: Percentage of spurious changes on RK-generator big-change data with gradual change. The graph evaluates the system’s robustness against spurious changes. The percentage of spurious changes is defined as the relative frequency of reported changes in blocks where no change was expected. The left-most bar represents the Ganti’s heuristic. The next eight bars deal with all the variations of the history-based heuristic. The right-most bar depicts the multivariate heuristic.


98

100 90 80 70 60 50 40 30 20 10 0



True Changes (%)

RKsmall-grad, True Changes

Figure 4.27: Percentage of detected true changes on RK-generator data with gradual small change. The points of change were known in advance and this graph shows what percentage of them was detected by the system. The left-most bar represents the Ganti’s heuristic. The next eight bars deal with all the variations of the historybased heuristic. The right-most bar represents the multivariate heuristic.

100 90 80 70 60 50 40 30 20 10 0



True Changes (%)

RKsmall-grad, Spurious Changes

Figure 4.28: Percentage of spurious changes on RK-generator small-change data with gradual change. The graph evaluates the system’s robustness against spurious changes. The percentage of spurious changes is defined as the relative frequency of reported changes in blocks where no change was expected. The left-most bar represents the Ganti’s heuristic. The next eight bars deal with all the variations of the history-based heuristic. The right-most bar shows the multivariate heuristic.


99

100 90 80 70 60 50 40 30 20 10 0



True Changes (%)

IBMbig-grad, True Changes

Figure 4.29: Percentage of detected true changes on IBM-generator data with gradual big change. The points of change were known in advance and this graph shows what percentage of them was detected by the system. The left-most bar represents the Ganti’s heuristic. The next eight bars deal with all the variations of the history-based heuristic. The right-most bar depicts the multivariate heuristic.

100 90 80 70 60 50 40 30 20 10 0



True Changes (%)

IBMbig-grad, Spurious Changes

Figure 4.30: Percentage of spurious changes on IBM-generator big-change data with gradual change. The graph evaluates the system’s robustness against spurious changes. The percentage of spurious changes is defined as the relative frequency of reported changes in blocks where no change was expected. The left-most bar represents the Ganti’s heuristic. The next eight bars deal with all the variations of the history-based heuristic. The right-most bar represents the multivariate heuristic.


100

100 90 80 70 60 50 40 30 20 10 0



True Changes (%)

IBMsmall-grad, True Changes

Figure 4.31: Percentage of detected true changes on IBM-generator data with gradual small change. The points of change were known in advance and this graph shows what percentage of them was detected by the system. The left-most bar represents the Ganti’s heuristic. The next eight bars deal with all the variations of the history-based heuristic. The right-most bar shows the multivariate heuristic.

100 90 80 70 60 50 40 30 20 10 0



True Changes (%)

IBMsmall-grad, Spurious Changes

Figure 4.32: Percentage of spurious changes on IBM-generator small-change data with gradual change. The graph evaluates the system’s robustness against spurious changes. The percentage of spurious changes is defined as the relative frequency of reported changes in blocks where no change was expected. The left-most bar represents the Ganti’s heuristic. The next eight bars deal with all the variations of the history-based heuristic. The right-most bar depicts the multivariate heuristic.


4.4.3

101

Discussion

The experiments described in this section examined the behavior of different variations of the proposed system. The first conclusion is that it, indeed, is worth to introduce a variable-size window. The next important lesson is that the Ganti’s heuristic achieves a quite good approximation of the current context, however, the same, and sometimes better, results can be obtained using a much cheaper alternative. The first experiment set showed that the Ganti’s change detector, the window-vs-increment Jaccard and frequency detectors, and the multivariate detection heuristic performed equally, with the exception of the RK-small-change domain where the multivariate heuristic failed to detect context changes. The second set of experiments revealed that the overall winners are the Ganti’s and window-vs-increment Jaccard and frequency heuristics. Since the Ganti’s change detector requires up to hundred times higher computational costs (Section 4.3.1), the best choice is either the Jaccard or the frequency detector in the window-vs-increment version. The results of the experiments also suggest that choice of window adjustment operator depends on the change detection heuristic used and the domain mined for associations. In the case the heuristic is prone to spurious changes, it is better to use either the reluctant or the opportunistic operator, with the reluctant operator having a slight edge. In the case the heuristic is stable enough, it is better to use the harsh operator.

CHAPTER 5: CONCLUSION AND FUTURE WORK This dissertation has explored the field of association mining with particular attention to association mining in time-varying domains. The thesis summarizes the current state-of-the-art and continues with proposing a new approach to deal with domains in which concepts evolve over time. The proposed system achieves high accuracy of context approximation in time-varying domains due to the following features: a variable-sized window of the most recent transactions, three heuristics, Ganti’s heuristic, history-based, and multivariate, that detect change in concept, and three operators, harsh, reluctant and opportunistic, that adjust window size after a change has been detected. The theoretical results of the proposed method have been verified in two types of domains – abrupt change and gradual change domains. 5.1

System Summary The dissertation proposes a novel window-based system for dealing with

time-varying databases. A set of the most recent transactions is retained in a window and the window is then used for inducing a model of the current context, where a context is defined as a continuous set of transactions with a stable set of frequent itemsets. In periods of stability the window size grows, which yields higher accuracy of the induced model, and the window shrinks when a change in context is suspected. The thesis adopts and modifies a change detection heuristic proposed in [23] and introduces two statistics-based heuristics for context change detection. The first heuristic, Ganti’s heuristic, is based on a bootstrapping technique that allows for obtaining the context’s underlying distribution. This distribution is used to assess the significance of the difference between the window and the increment – a block of new transactions. The second heuristic, history-based, compares the difference between the

CHAPTER 5. CONCLUSION AND FUTURE WORK

103

window and the increment to a history of previous differences. A change in context is suspected if the current difference is significantly higher than the differences in the course of the previous increments. To assess the difference between the window and the increment, the thesis proposes four difference measures: the first two are based on the number of emerged and vanished itemsets; e.g., itemsets that are found in the increment but not in the window, and itemsets frequent in the window but not frequent in the increment, respectively; the third measure uses the Jaccard coefficient, and the fourth measure employs the absolute difference in itemset supports. The third change detection heuristic, multivariate, builds on properties of the association mining task. The dissertation demonstrates that the supports of frequent itemsets follow a multivariate binomial distribution. The heuristic employs a statistical test to determine whether the distributions underlying the window and the increment are identical or not. The size of the window grows in periods of stability and shrinks when a context change is detected. The number of transactions removed from the window is driven by a window adjustment operator. The dissertation suggests three such operators. The first operator, harsh, suggests to remove all transactions from the window when a change is detected. The second operator, reluctant, does not react upon the first change alert, but if a second change alert is issued when the next increment arrives, all transactions are removed from the window, except the last block. This operator was designed to overcome possible spurious changes. The third operator, opportunistic, removes 50 % of the transactions from the window upon the first change and removes the rest of the window content when a second change alert comes in the next step.


5.2

104

Conclusion The performance of the proposed system was tested on two types of synthetically

generated data. The first type was the widely accepted IBM data generator, as introduced in [3]. Although the generator was not originally designed for creating time-varying data, the dissertation proposed a modification, by generating the database piece-wise with different parameter settings, that allowed generating of such data. Since the modified generator still did not allow a full control over the data properties, a new – RK-generator – designed specifically for time-varying domains, was introduced. The data was generated in a way that allowed one to control its properties in terms of difference between any two contexts. In addition, it was possible to theoretically calculate the support of any itemset, or compute the set of all frequent itemsets, in a given context. Both the data generators were able to create two types of data, with: (1) abrupt changes, and (2) gradual changes where the change between two contexts took several blocks. The system was then evaluated under different settings. The accuracy of the current context approximation was assessed by comparing the obtained set of frequent itemsets to the expected set.1 The experiments evaluated the accuracy and context boundary detection achieved by the system when employing the different change detection heuristics in combination with the three window adjustment operators. The experiments were performed on both abrupt and gradual types of data. The results of the experiments show that for the abrupt-change data (for both the RK-generator and the IBM-generator) the system maintained the highest overall accuracy of context approximation when using the Ganti’s heuristic, history-based 1

For the IBM-generator the expected set contained frequent itemsets obtained from a sufficiently large data sample; the expected set for the RK-generator was calculated theoretically.


105

heuristic with the Jaccard and frequency difference measures (when comparing the entire window with the increment), and the multivariate change detection heuristic. The heuristics were able to detect most of the true changes in context while detecting minimum spurious changes. The other heuristics were also capable of detecting most of the changes; however, they reported increased amounts of spurious changes as well. As for window size adjustment, the reluctant operator seems to perform the best on a wide range of datasets. However, in the cases where the change detection heuristic gives a stable performance (with minimum spurious changes), the harsh operator seems to be more appropriate. Generally speaking, the reluctant and opportunistic operators can, to some extent, eliminate the effect of spurious changes, however, these operators introduce a higher error at the time of context change. The results for the gradual-change domain are somewhat different. Application of the multivariate change detection heuristic caused the system to achieve low accuracy in context approximation, since the heuristic was not able to detect any change. The usage of history-based heuristic with the Jaccard and frequency difference measures still exhibited the highest accuracy and, in the small-change domains, the system employing the history-based heuristic was able to outperform the one with the Ganti’s heuristic. The choice of the window adjustment operator depends on the explored domain. In the RK-domain the reluctant operator helped to achieve the highest accuracy while in the IBM-domain the harsh operator served the best.

5.3

Future Work This section discusses some suggestions for work beyond the scope of the

dissertation.


106

Other Test Beds One may object that the test data used in the experiments are far removed from real-world data, and therefore the experiment’s results do not say much about the system’s performance under realistic conditions. The best proof of the system’s correctness would be a test with a real world data. However, real world data are very often proprietary which means they are difficult and expensive to obtain. Due to this fact, the real world test is beyond the scope of the dissertation. Recurring Contexts Common experience suggests that many of the patterns in real life change in cycles. The four seasons follow each other in a fixed order and then cyclically reappear. Many biological or economical systems go through cycles of development with recurring patterns. The reuse of knowledge from the previous cycles may be employed to shorten the learning time needed for adaptation to a new context. This reasoning is inspired by the more advanced implementation of the FLORA framework – the FLORA3 algorithm ([72]). Some preliminary experiments, illustrated by Figure 5.1, prove that this approach has, indeed, potential to further improve the system’s accuracy of context approximation. The black line shows the error rate of the current system. The algorithm has to “re-learn” the context approximation each time a change is detected. The gray line, on the other hand, depicts the situation where the system remembers the properties of every encountered context. In the model domain the first two contexts are unique and the system needs to learn the model from the beginning in both cases. However, the third and subsequent contexts are very similar to the first two contexts allowing thus the FLORA3 approach to achieve a considerably faster adaptation to the new contexts. Hierarchical Changes In addition to recurring contexts, the patterns may reappear in cycles with different periods. Some of the cycles may be relatively short (mere days in some cases) and not very significant. Other cycles may be longer (weeks, months or


107

Recurring Contexts 0.006 0.005

Error

0.004 FLORA2 FLORA3

0.003 0.002 0.001 0 0

20

40

60

80

100

Block

Figure 5.1: System with and without reusing the knowledge of previous contexts. The dataset consists of three repetitions of a pair of contexts. The existing system needs to “learn” context approximation each time a change is detected. The new system is able to reuse some knowledge of previous contexts and, in case of a recurring context, start with significantly better initial approximation.

years), and introduce changes that are more significant. The intended direction of future research will explore this phenomenon. The expected outcome of this research strand should be a tree-like hierarchy of contexts, with the finest granulation in the leaves and the whole domain in the root. Multivariate Change Detection Heuristic Fine-Tuning The way the multivariate heuristic was derived assumes certain simplifications that might harm its performance. This paragraph lists some of the issues that require further study: 1. The assumption that itemset frequencies are pairwise independent (equation 3.12) does not always hold. For instance, itemsets X, Y , X ⊂ Y , will most probably be correlated. However, the number of itemsets of this kind tends to be relatively small, compared to all the large itemsets. 2. The multivariate heuristic, as introduced in this work, employs a χ2 test for


108

comparing the two multivariate normal distributions. However, statistical literature suggests some other tests for this comparison (for instance, Hotelling’s T 2 test). 3. The approximation of the binomial distribution with a normal distribution (Lemma 3.2) is limited by a minimal frequency constraint. Because association mining tends to deal with low support itemsets the approximation may not be close enough. Perhaps, a non-parametric test would be more appropriate in this case.

BIBLIOGRAPHY [1] C. Aggarwal, C. Procopius, and P. Yu, “Finding localized associations in market basket data,” IEEE Transactions on Knowledge and Data Engineering, vol. 14, pp. 51–62, 2002. [2] C. Aggarwal and P. Yu, “A new approach to online generation of association rules,” IEEE Transactions on Knowledge and Data Engineering, vol. 13, pp. 527–540, 2002. [3] R. Agrawal, T. Imielinski, and A. Swami, “Mining association rules between sets of items in large databases,” in Proceedings of ACM SIGMOD, 1993, pp. 207–216. [4] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. Verkamo, “Fast discovery of association rules,” in Knowledge Discovery and Data Mining, U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, Eds. AAAI Press, 1996, ch. 12, pp. 307–328. [5] R. Agrawal and G. Psaila, “Active data mining,” in First International Conference on Knowledge Discovery and Data Mining (KDD-95), U. Fayad and R. Uthurusamy, Eds. Montreal, Quebec, Canada: AAAI Press, Menlo Park, CA, USA, 1995, pp. 3–8. [Online]. Available: citeseer.nj.nec.com/agrawal95active.html [6] R. Agrawal and R. Srikant, “Fast algorithms for mining association rules,” in Proceedings of the 20th International Conference on Very Large Databases, VLDB, 1994, pp. 487–499. [7] J. Ale and G. Rossi, “An approach to discovering temporal association rules,” in Proceedings of the 2000 ACM symposium on Applied Computing. Como, Italy: ACM press, 2000, pp. 294–300. [8] P. Auer and M. Warmuth, “Tracking the best disjunction,” Machine Learning, vol. 32, pp. 127–150, 1998. [9] N. Ayan, A. Tansel, and M. Arkun, “An efficient algorithm to update large itemsets with early pruning,” in Knowledge Discovery and Data Mining, 1999, pp. 287–291. [Online]. Available: citeseer.nj.nec.com/article/ayan99efficient.html [10] R. Bayardo and R. Agrawal, “Mining the most interesting rules,” in Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, California, 1999, pp. 145–154.

BIBLIOGRAPHY

110

[11] L. Breiman, J. Friedman, R. Olshen, and C. Stone, Classification and Regression Trees. Belmont, California: Wadsworth, 1984. [12] S. Brin, R. Motwani, J. Ullman, and S. Tsur, “Dynamic itemset counting and implication rules for market basket data,” in SIGMOD 1997, Proceedings ACM SIGMOD International Conference on Management of Data, May 13–15, 1997, Tucson, Arizona, USA, J. Peckham, Ed. ACM Press, May 1997, pp. 255–264. [Online]. Available: http://citeseer.nj.nec.com/brin97dynamic.html [13] S. Chakrabarti, S. Sarawagi, and B. Dom, “Mining surprising patterns using temporal description length,” in Twenty-Fourth International Conference on Very Large databases VLDB’98, A. Gupta, O. Shmueli, and J. Widom, Eds. New York, NY: Morgan Kaufmann, 1998, pp. 606–617. [Online]. Available: citeseer.nj.nec.com/chakrabarti98mining.html [14] M. Chen, J. Park, and P. Yu, “Data mining for path traversal patterns in a web environment,” in Sixteenth International Conference on Distributed Computing Systems, 1996, pp. 385–392. [Online]. Available: citeseer.nj.nec.com/article/chen96data.html [15] D. Cheung, J. Han, V. Ng, and C. Wong, “Maintenance of discovered association rules in large databases,” in Proceedings of the 12th International Conference on Data Engineering, New Orleans, Louisiana, 1996, pp. 106–114. [Online]. Available: citeseer.nj.nec.com/cheung96maintenance.html [16] D. Cheung, S. Lee, and B. Kao, “A general incremental technique for maintaining discovered association rules,” in Database Systems for Advanced Applications, 1997, pp. 185–194. [Online]. Available: citeseer.nj.nec.com/cheung97general.html [17] D. Cheung, V. Ng, A. Fu, and W. Fu, “Efficient mining of association rules in distributed databases,” IEEE Transactions on Knowledge and Data Engineering, vol. 8, pp. 911–922, 1996. [18] P. Domingos, “Exploiting context in feature selection,” in 13th International Conference on Machine Learning (ICML96) – Workshop on Learning in Context-Sensitive Domains, M. Kubat and W. Widmer, Eds., Bari, Italy, July 1996, pp. 15–20. [19] U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, “From data mining to knowledge discovery in databases,” AI Magazine, vol. 17, pp. 37–54, 1996. [Online]. Available: citeseer.nj.nec.com/fayyad96from.html

BIBLIOGRAPHY

111

[20] W. Frawley, G. Piatetsky-Shapiro, and C. Matheus, “Knowledge discovery in databases - an overview,” AI Magazine, vol. 13, pp. 57–70, 1992. [Online]. Available: citeseer.nj.nec.com/frawley92knowledge.html [21] J. Freeman, “An unknown change point and goodness of fit,” The Statistician, vol. 35, pp. 335–344, 1986. [22] V. Ganti, J. Gehrke, and R. Ramakrishnan, “Demon: Mining and monitoring evolving data,” IEEE Transactions on Knowledge and Data Engineering, vol. 13, no. 1, pp. 51–63, January/February 2001. [23] V. Ganti, J. Gehrke, R. Ramakrishnan, and W.-Y. Loh, “A framework for measuring differences in data characteristics,” Journal of Computer and System Sciences, vol. 64, pp. 542–578, 2002. [24] A. Hafez, J. Deogun, and V. Raghavan, “The item-set tree: A data structure for data mining,” in Data Warehousing and Knowledge Discovery: First International Conf., (DaWaK’99), M. Mohania and A. M. Tjoa, Eds. Florence, Italy: Springer, Aug. 1999, pp. 183–192. [Online]. Available: http://www.cacs.louisiana.edu/Publications/Raghavan/hdr99.pdf [25] J. Han, J. Pei, and Y. Yin, “Mining frequent patterns without candidate generation,” in 2000 ACM SIGMOD Intl. Conference on Management of Data, W. Chen, J. Naughton, and P. Bernstein, Eds. ACM Press, May 2000, pp. 1–12. [Online]. Available: citeseer.nj.nec.com/han99mining.html [26] M. Harries, “Batch learning in domains with hidden changes in context,” Ph.D. dissertation, School of Computer Science and Engineering, University of New South Wales, Australia, 1999. [27] M. Harries, C. Sammut, and K. Horn, “Extracting hidden context,” Machine Learning, vol. 32, pp. 101–126, 1998. [28] D. Helmbold and P. Long, “Tracking drifting concepts using random examples,” in Proceedings of the 33rd Annual Symposium on Computational Learning Theory. IEEE Computer Science Press, 1991, pp. 493–502. [29] D. Helmbold and P. Long, “Tracking drifting concepts by minimizing disagreements,” Machine Learning, vol. 14, pp. 27–45, 1994. [30] M. Herbster and M. Warmuth, “Tracking the best expert,” Machine Learning,

BIBLIOGRAPHY

112

vol. 32, pp. 151–178, 1998. [31] A. Jain and R. Dubes, Algorithms for Clustering Data. Englewood Clifs, New Jersey: Prentice Hall, 1988. [32] J. Jobson, Applied Multivariate Data Analysis, Volume II: Categorical and Multivariate Methods. Springer-Verlag, 1992. [33] R. Klinkenberg, “Using labeled and unlabeled data to learn drifting concepts,” in 7th International Joint Conference on AI – Workshop on Learning from Temporal and Spatial Data, M. Kubat and K. Morik, Eds., Seattle, Washington, 2001, pp. 16–24. [34] R. Klinkenberg, “Learning drifting concepts: Example selection vs. example weighting,” Intelligent Data Analysis, vol. 7, 2003. [35] M. Kubat, “Floating approximation in time-varying knowledge bases,” Pattern Recognition Letters, vol. 10, pp. 223–227, 1989. [36] M. Kubat, “A machine learning based approach to load balancing in computer networks,” Cybernetics and Systems, vol. 23, pp. 389–400, 1992. [37] M. Kubat, A. Hafez, V. Raghavan, J. Lekkala, and W. Chen, “Itemset trees for targeted association mining,” IEEE Transactions on Knowledge and Data Engineering, 2002. [Online]. Available: http://www.cacs.louisiana.edu/Publications/Raghavan/KHLR02.pdf [38] M. Kubat and G. Widmer, “Adapting to drift in continuous domains,” in Proceedings of the European Conference on Machine Learning, ECML 95, Heraklion, Crete, 1995, pp. 307–310. [39] A. Kuh, T. Petsche, and R. Rivest, “Learning time-varying concepts,” Advances in Neural Information Processing Systems, vol. 3, pp. 183–189, 1991. [40] A. Kuh, T. Petsche, and R. Rivest, “Incrementally learning time-varying half-planes,” Advances in Neural Information Processing Systems, vol. 4, pp. 920–927, 1992. [41] C. Lee and M. Chen, “Progressive partition miner: An efficient algorithm for mining general temporal association rules,” IEEE Transactions on Knowledge and Data Engineering, vol. 15, no. 4, pp. 1004–1016, July/August 2003.

BIBLIOGRAPHY

113

[42] Y. Li, P. Ning, X. Wang, and S. Jajodia, “Discovering calendar-based temporal association rules,” in TIME, 2001, pp. 111–118. [Online]. Available: citeseer.nj.nec.com/article/li01discovering.html [43] B. Liu, W. Hsu, and Y. Ma, “Pruning and summarizing the discovered associations,” in Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, Claifornia, 1999, pp. 125–134. [44] B. Liu, W. Hsu, and Y. Ma, “Discovering the set of fundamental rule changes,” in Knowledge Discovery and Data Mining, 2001, pp. 335–340. [Online]. Available: citeseer.nj.nec.com/hsu01discovering.html [45] M. Maloof, “Progressive partial memory learning,” Ph.D. dissertation, School of Information Technology and Engineering, George Mason University, Fairfax, Virginia, 1996. [46] M. Maloof and R. Michalski, “Selecting examples for partial memory learning,” Machine Learning, vol. 41, pp. 27–52, 2000. [47] B. Manly, Multivariate Statistical Methods. Chapman & Hall, 1994. [48] S. Matwin and M. Kubat, “The role of context in concept learning,” in Workshop Notes of the Workshop on Learning in Context-Sensitive Domains, Bari, Italy, July 1997, pp. 1–5. [49] U. Menzefricke, “A bayesian analysis of a change in the precision of a sequence of independent normal random variables at an unknown time point,” Applied Statistics, vol. 30, no. 2, pp. 141–146, 1981. [50] T. Mitchell, R. Caruana, D. Freitag, J. McDermott, and D. Zabowski, “Experience with a learning personal assistant,” Communications of the ACM, vol. 37, pp. 81–91, 1994. [51] D. Moore and G. McCabe, Introduction to the Practice of Statistics. W. H. Freeman and Company, 1989. [52] B. Nag, P. Deshpande, and D. DeWitt, “Using a knowledge cache for interactive discovery of association rules,” in Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, California, 1999, pp. 244–253.

BIBLIOGRAPHY

114

¨ [53] B. Ozden, S. Ramaswamy, and A. Silberschatz, “Cyclic association rules,” in ICDE, 1998, pp. 412–421. [Online]. Available: citeseer.nj.nec.com/ozden98cyclic.html [54] B. Padmanabhan and A. Tuzhilin, “Small is beautiful: Discovering the minimal set of unexpected patterns,” in Knowledge Discovery and Data Mining, 2000, pp. 54–63. [Online]. Available: citeseer.nj.nec.com/tuzhilin00small.html [55] J.-S. Park, M. Chen, and P. Yu, “An effective hash based algorithm for mining association rules,” in Proceedings of the ACM SIGMOD, 1995, pp. 175–186. [56] A. Pettitt, “A non-parametric approach to the change-point problem,” Applied Statistics, vol. 28, no. 2, pp. 126–135, 1979. [57] J. Quinlan, C4.5: Programs for Machine Learning. San Mateo, California: Morgan Kaufmann, 1993. [58] V. Raghavan and A. Hafez, “Dynamic data mining,” in Proceedings of the 13th International Conference on Industrial and Engineering Applications of Artificial Intelligence and Expert Systems, IEA/AIE, New Orleans, Louisiana, June 2000, pp. 220–229. [59] C. Rainsford, M. Mohania, and J. Roddick, “A temporal windowing approach to the incremental maintenance of association rules,” in Eighth International Database Workshop, Data Mining, Data Warehousing and Client/Server Databases (IDW’97), J. Fong, Ed. Hong Kong: Springer Verlag, 1997, pp. 78–94. [60] S. Ramaswamy, S. Mahajan, and A. Silberschatz, “On the discovery of interesting patterns in association rules,” in The VLDB Journal, 1998, pp. 368–379. [61] A. Savasere, E. Omiecinski, and S. Navathe, “An efficient algorithm for mining association rules in large databases,” in The VLDB Journal, 1995, pp. 432–444. [Online]. Available: citeseer.nj.nec.com/sarasere95efficient.html [62] A. Silberschatz and A. Tuzhilin, “What makes patterns interesting in knowledge discovery systems,” IEEE Transactions on Knowledge and Data Engineering, vol. 8, pp. 970–974, 1996. [Online]. Available: citeseer.nj.nec.com/silberschatz96what.html [63] M. Srivastava and K. Worsley, “Likelihood ratio tests for a change in the multivariate normal mean,” Journal of the American Statistical Association, vol. 81, no. 393, pp. 199–204, March 1986.

BIBLIOGRAPHY

115

[64] S. Thomas, S. Bodagala, K. Alsabti, and S. Ranka, “An efficient algorithm for the incremental updation of association rules in large databases,” in Knowledge Discovery and Data Mining, 1997, pp. 263–266. [Online]. Available: citeseer.nj.nec.com/thomas97efficient.html [65] H. Toivonen, “Sampling large databases for association rules,” in The VLDB Journal, 1996, pp. 134–145. [66] P. Turney, “The identification of context-sensitive features: A formal definition of context for concept learning,” in 13th International Conference on Machine Learning (ICML96) – Workshop on Learning in Context-Sensitive Domains, M. Kubat and W. Widmer, Eds., Bari, Italy, July 1996, pp. 53–59. [67] P. Turney, “The management of context-sensitive features: A review of strategies,” in 13th International Conference on Machine Learning (ICML96) – Workshop on Learning in Context-Sensitive Domains, M. Kubat and W. Widmer, Eds., Bari, Italy, July 1996, pp. 53–59. [68] G. Widmer, “Combining robustness and flexibility in learning drifting concepts,” in Proceedings of the 13th European Conference on Artificial Intelligence, Wiley, Chisester, UK, 1994, pp. 368–372. [69] G. Widmer, “Tracking concept drift through meta-learning,” Machine Learning, vol. 27, pp. 259–286, 1997. [70] G. Widmer and M. Kubat, “Learning flexible concepts from streams of data: Flora2,” in Proceedings of the 10th European Conference on Artificial Intelligence, Vienna, Austria, August 1992, pp. 463–467. [71] G. Widmer and M. Kubat, “Effective learning in dynamic environments by explicit context tracking,” in Proceedings of the European Conference on Machine Learning, Vienna, Austria, April 1993, pp. 227–243. [72] G. Widmer and M. Kubat, “Learning in the presence of concept drift and hidden contexts,” Machine Learning, vol. 23, pp. 69–101, 1996. [73] B. Yakir, “Dynamic sampling policy for detecting a change in distribution, with a probability bound on false alarm,” The Annals of Statistics, vol. 24, no. 5, pp. 2199–2214, October 1996. [74] M. Zaki, S. Parthasarathy, M. Ogihara, and W. Li, “New algorithms for fast

BIBLIOGRAPHY

discovery of association rules,” The University of Rochester, Tech. Rep. TR651, 1997. [Online]. Available: citeseer.nj.nec.com/zaki97new.html

116

APPENDIX A: HEURISTIC ASSUMPTIONS This appendix summarizes the results of experiments aimed at verifying certain assumptions made by change detection heuristics, as introduced in Section 3.3. The first experiment targeted the assumption that the four distance measures employed by the history-based heuristic (Section 3.3.2) are normally distributed around a constant mean. The measure inputs were: the last block of the window and the increment. The histograms of the observed values are depicted in the first set of graphs (Figures A.1 and A.2). The second experiment focused on the founding assumption of the multivariate heuristic (Section 3.3.3). The heuristic assumes that the observed frequency of every frequent itemset can be approximated by a normal distribution. The histograms of the observed values for five randomly selected itemsets are depicted in Figure A.3 (for the RK-domain) and Figure A.4 (for the IBM-domain).

APPENDIX A. HEURISTIC ASSUMPTIONS

118

Histogram for RK Vanished

240

240

200

200

160

160

frequency

frequency

Histogram for RK Emerged

120 80 40

80 40

0

0 150

170

190

210

230

250

150

170

190

210

230

RK Emerged

RK Vanished

Histogram for RK Jaccard

Histogram for RK Frequency dist

240

240

200

200

160

160

frequency

frequency

120

120 80 40

250

120 80 40

0

0 0.61

0.63

0.65

0.67

RK Jaccard

0.69

0.71

8

8.3

8.6

8.9

RK Frequency dist

9.2

9.5 (X 0.001)

Figure A.1: Histograms for the numbers of emerged and vanished itemsets, Jaccard distance measure, and frequency distance measure. The test domain was created by the RK-generator and consisted of one stable context. In every step each distance measure compared the increment with the last block in the window. The measured values over the whole domain are plotted in the graphs.


119

Histogram for IBM Vanished

150

150

120

120

frequency

frequency

Histogram for IBM Emerged

90 60 30

90 60 30

0

0 0

1

2

3

4 (X 1000)

IBM Emerged

0

2

3

IBM Vanished

Histogram for IBM Jaccard

4 (X 1000)

Histogram for IBM Frequency dist

120

300

100

250

80

200

frequency

frequency

1

60 40 20

150 100 50

0

0 0.31

0.41

0.51

0.61

IBM Jaccard

0.71

0.81

69

89

109

IBM Frequency dist

129

149 (X 0.0001)

Figure A.2: Histograms for the numbers of emerged and vanished itemsets, Jaccard distance measure, and frequency distance measure. The test domain was created by the IBM-generator and consisted of one stable context. In every step each distance measure compared the increment with the last block in the window. The measured values over the whole domain are plotted in the graphs.


120

Histogram for Itemset {3296}

100

100

80

80

frequency

frequency


60 40 20

40 20

0

0 0

10

20

30

40

50

0

10

20

30

40

Itemset {553}

Itemset {3296}



100

80

80

60

frequency

frequency

60

60 40

40

20

20 0

0 0

10

20

30

40

0

5

10

Itemset {4750}

15

20

25

30

Itemset {7558}

Histogram for Itemset {9996} 100

frequency

80 60 40 20 0 0

10

20

30

40

Itemset {9996}

Figure A.3: Histograms of the observed distributions for five randomly selected itemsets. The test dataset was created by the RK-generator and consisted of one stable context. The experiments collected itemset frequencies in each block over the entire dataset.


121

Histogram for Itemset {510 2496}

100

100

80

80

frequency

frequency


60 40 20

40 20

0

0 0

10

20

30

40

50

0

10

20

30

40

Itemset {18}

Itemset {510 2496}

Histogram for Itemset {623 1895 11816}

Histogram for Itemset {1058 13031 17587 17747}

100

100

80

80

frequency

frequency

60

60 40 20

60 40 20

0

0 0

10

20

30

40

0

10

Itemset {623 1895 11816}

20

30

40

50

Itemset {1058 13031 17587 17747}

Histogram for It {1518 1871 3441 4431 8363} 100

frequency

80 60 40 20 0 0

10

20

30

40

It {1518 1871 3441 4431 8363}

Figure A.4: Histograms of the observed distributions for five randomly selected itemsets. The test dataset was created by the IBM-generator and consisted of one stable context. The experiments collected itemset frequencies in each block over the entire dataset.

APPENDIX B: EXPERIMENTS WITH ABRUPT CHANGE DOMAINS The results in this appendix provide a detailed report of the proposed system’s performance in domains where change in context is abrupt. The experiments thoroughly tested the system’s performance for all the possible combinations of change-detection heuristics and window adjustment operators. The experiments worked with two test domains: the RK-domain and the IBM-domain, as introduced in Section 4.1. The system was tested on two types of datasets in each of the two domains: a big-change type and a small-change type. Section B.1 evaluates the first objective of this dissertation – the accuracy of current context approximation. The second objective – the ability to detect changes in context – is examined in Section B.2. B.1

Accuracy of Context Approximation

This section evaluates the course of error for all the proposed heuristics. The experiments were run on four domains, all consisting of two alternating contexts of twenty blocks, with abrupt change between the contexts. The error was calculated as the frequency distance (Section 4.2.1) between the window and the context at each moment. The upper graphs show the error for datasets created by the RK generator and the lower subgraphs depict the results for the IBM-generator data. The left-column graphs cover the case with big difference between adjacent contexts and the graphs on the right deal with small difference between every two contexts. Each graph, with the exception of the first two sets, contains three curves, each for one of the three window adjustment operators: harsh, reluctant, and opportunistic. The curves in the first set of graphs depict the baseline three cases: the window contains all transactions, the window has a fixed size of two blocks, the window has a fixed size of ten blocks. The second set of graphs deals with the modified Ganti’s change detection heuristic

APPENDIX B. EXPERIMENTS WITH ABRUPT CHANGE DOMAINS

123

(Section 3.3.1). The following eight sets of graphs examine all the variations of the history-based heuristic (Section 3.3.2). The last set of graphs explores the behavior of the multivariate heuristic (Section 3.3.3).


RKsmall, Baseline algorithms

0.016

0.016

0.014

0.014

0.012

0.012

0.01

0.01 Error

Error

RKbig, Baseline algorithms

0.008

0.008

0.006

0.006

0.004

0.004

0.002

0.002 0

0 0

20

40

0

60

20

nowin

fixwin 2k

40

60

Block

Block nowin

fixwin10k

IBMbig, Baseline algorithms 0.016

0.016

0.014

0.014

0.012

0.012

0.01

0.01

0.008

fixwin10k

0.008

0.006

0.006

0.004

0.004

0.002

0.002

0

fixwin 2k

IBMsmall, Baseline algorithms

Error

Error

124

0 0

20

40

60

0

20

Block nowin

fixwin 2k

40

60

Block fixwin10k

nowin

fixwin 2k

fixwin10k

Figure B.1: Baseline algorithms. Comparison of the error rate for a system without a window – all transactions are considered – and two fixed-size-window variations. The system was tested on two datasets – with big change and with small change between adjacent context – for the two data generators.


Ganti, RKsmall

0.016

0.016

0.014

0.014

0.012

0.012

0.01

0.01 Error

Error

Ganti, RKbig

125

0.008

0.008 0.006

0.006

0.004

0.004

0.002

0.002

0 0 0

20

40

0

60

20

ganti, IBMbig

60

ganti, IBMsmall

0.016

0.016

0.014

0.014

0.012

0.012

0.01

0.01 Error

Error

40 Block

Block

0.008

0.008

0.006

0.006

0.004

0.004

0.002

0.002

0

0 0

20

40 Block

60

0

20

40

60

Block

Figure B.2: The error rate for the Ganti’s change detection heuristic. The system was tested on two datasets – with big change and with small change between adjacent context – for the two data generators. Each graph shows three curves, one for each window adjustment operator.


RKsmall, Em/Van Conservative

0.016

0.016

0.014

0.014

0.012

0.012

0.01

0.01 Error

Error

RKbig, Em/Van Conservative

0.008

0.008

0.006

0.006

0.004

0.004

0.002

0.002 0

0 0

20

40

0

60

20

harsh

reluctant

40

60

Block

Block harsh

opport.

IBMbig, Em/Van Conservative 0.016

0.016

0.014

0.014

0.012

0.012

0.01

0.01

0.008

opport.

0.008

0.006

0.006

0.004

0.004

0.002

0.002

0

reluctant

IBMsmall, Em/Van Conservative

Error

Error

126

0 0

20

40

60

0

20

Block harsh

reluctant

40

60

Block opport.

harsh

reluctant

opport.

Figure B.3: The error rate for the history-based heuristic with the emerged/vanished conservative distance measure. The increment is compared with the last block only.


RKsmall, Em/Van Progressive

0.016

0.016

0.014

0.014

0.012

0.012

0.01

0.01 Error

Error

RKbig, Em/Van Progressive

0.008

0.008

0.006

0.006

0.004

0.004

0.002

0.002

0

0 0

20

40

60

0

20

Block harsh

reluctant

40

60

Block opport.

harsh

IBMbig, Em/Van Progressive 0.016

0.016

0.014

0.014

0.012

0.012

0.01

0.01

0.008

opport.

0.008

0.006

0.006

0.004

0.004

0.002

0.002

0

reluctant

IBMsmall, Em/Van Progressive

Error

Error

127

0 0

20

40

60

0

20

Block harsh

reluctant

40

60

Block opport.

harsh

reluctant

opport.

Figure B.4: The error rate for the history-based heuristic with the emerged/vanished progressive approach. The increment is compared with the last block only.


RKsmall, Jaccard

0.016

0.016

0.014

0.014

0.012

0.012

0.01

0.01 Error

Error

RKbig, Jaccard

0.008

0.008

0.006

0.006

0.004

0.004

0.002

0.002

0

0 0

20

40

60

0

20

Block harsh

reluctant

40

60

Block opport.

harsh

IBMbig, Jaccard

reluctant

opport.

IBMsmall, Jaccard

0.016

0.016

0.014

0.014

0.012

0.012

0.01

0.01 Error

Error

128

0.008

0.008

0.006

0.006

0.004

0.004

0.002

0.002

0

0 0

20

40

60

0

20

Block harsh

reluctant

40

60

Block opport.

harsh

reluctant

opport.

Figure B.5: The error rate for the history-based heuristic with the Jaccard distance measure approach. The increment is compared with the last block only.


RKsmall, Frequency dist.

0.016

0.016

0.014

0.014

0.012

0.012

0.01

0.01 Error

Error

RKbig, Frequency dist.

0.008

0.008

0.006

0.006

0.004

0.004

0.002

0.002

0

0 0

20

40

60

0

20

Block harsh

reluctant

40

60

Block opport.

harsh

IBMbig, Frequency dist.

reluctant

opport.

IBMsmall, Frequency dist.

0.016

0.016

0.014

0.014

0.012

0.012

0.01

0.01 Error

Error

129

0.008

0.008

0.006

0.006

0.004

0.004

0.002

0.002

0

0 0

20

40

60

0

20

Block harsh

reluctant

40

60

Block opport.

harsh

reluctant

opport.

Figure B.6: The error rate for the history-based heuristic with the frequency distance measure approach. The increment is compared with the last block only.


RKsmall, Em/Van Cons. Win

0.016

0.016

0.014

0.014

0.012

0.012

0.01

0.01 Error

Error

RKbig, Em/Van Conservative Win

0.008

0.008

0.006

0.006

0.004

0.004

0.002

0.002

0

0 1

21

41

61

1

21

Block harsh

reluctant

41

61

Block opport.

harsh

IBMbig, Em/Van Cons. Win 0.016

0.016

0.014

0.014

0.012

0.012

0.01

0.01

0.008

opport.

0.008

0.006

0.006

0.004

0.004

0.002

0.002

0

reluctant

IBMsmall, Em/Van Cons. Win

Error

Error

130

0 1

21

41

61

0

20

Block harsh

reluctant

40

60

Block opport.

harsh

reluctant

opport.

Figure B.7: The error rate for the history-based heuristic with the emerged/vanished conservative approach. The entire window is being compared with the increment.


RKsmall, Em/Van Prog. Win

0.016

0.016

0.014

0.014

0.012

0.012

0.01

0.01 Error

Error

RKbig, Em/Van Progressive Win

0.008

0.008

0.006

0.006

0.004

0.004

0.002

0.002

0

0 1

21

41

61

1

21

Block harsh

reluctant

41

61

Block opport.

harsh

IBMbig, Em/Van Progressive Win 0.016

0.016

0.014

0.014

0.012

0.012

0.01

0.01

0.008

opport.

0.008

0.006

0.006

0.004

0.004

0.002

0.002

0

reluctant

IBMsmall, Em/Van Progr. Win

Error

Error

131

0 1

21

41

61

1

21

Block harsh

reluctant

41

61

Block opport.

harsh

reluctant

opport.

Figure B.8: The error rate for the history-based heuristic with the emerged/vanished progressive approach. The entire window is being compared with the increment.


RKsmall, Jaccard Win

0.016

0.016

0.014

0.014

0.012

0.012

0.01

0.01 Error

Error

RKbig, Jaccard Win

0.008

0.008

0.006

0.006

0.004

0.004

0.002

0.002 0

0 1

21

41

0

61

20

harsh

reluctant

40

60

Block

Block harsh

opport.

IBMbig, Jaccard Win

reluctant

opport.

IBMsmall, Jaccard Win

0.016

0.016

0.014

0.014

0.012

0.012

0.01

0.01 Error

Error

132

0.008

0.008

0.006

0.006

0.004

0.004

0.002

0.002

0

0 1

21

41

61

0

20

Block harsh

reluctant

40

60

Block opport.

harsh

reluctant

opport.

Figure B.9: The error rate for the history-based heuristic with the Jaccard distance measure approach. The entire window is being compared with the increment.


RKsmall, Frequency dist. Win

0.016

0.016

0.014

0.014

0.012

0.012

0.01

0.01 Error

Error

RKbig, Frequency dist. Win

0.008

0.008

0.006

0.006

0.004

0.004

0.002

0.002

0

0 1

21

41

61

1

21

Block harsh

reluctant

41

61

Block opport.

harsh

IBMbig, Frequency dist. Win 0.016

0.016

0.014

0.014

0.012

0.012

0.01

0.01

0.008

opport.

0.008

0.006

0.006

0.004

0.004

0.002

0.002

0

reluctant

IBMsmall, Frequency dist. Win

Error

Error

133

0 1

21

41

61

1

21

Block harsh

reluctant

41

61

Block opport.

harsh

reluctant

opport.

Figure B.10: The error rate for the history-based heuristic with the frequency distance measure approach. The entire window is being compared with the increment.


RKsmall, Multi

0.016

0.016

0.014

0.014

0.012

0.012

0.01

0.01 Error

Error

RKbig, Multi

0.008

0.008

0.006

0.006

0.004

0.004

0.002

0.002 0

0 0

20

40

0

60

20

Block harsh

reluctant

harsh

opport.

IBMbig, Multivariate

60

reluctant

opport.

IBMsmall, Multivariate

0.016

0.016

0.014

0.014

0.012

0.012

0.01

0.01 Error

Error

40 Block

0.008

0.008

0.006

0.006

0.004

0.004

0.002

0.002

0

0 0

20

40

60

0

20

Block harsh

reluctant

40

60

Block opport.

harsh

reluctant

Figure B.11: The error rate for the multivariate heuristic.

opport.

134


B.2

135

Window Size

This section evaluates the course of the window size on abrupt-change domains. The desired system’s behavior is to grow the window in a stable context and shrink the window as a reaction to a change in context. The experiments were run on four domains, all consisting of two alternating contexts of twenty blocks. The upper graphs show the error for datasets created by the RK-generator and the lower graphs depict the results for the IBM-generator data. The left-column graphs cover the case with big difference between neighboring contexts. The graphs on the right deal with small-difference domains. Each graph, with the exception of the first two sets, contains three curves, each for one of the three window adjustment operators: harsh, reluctant, and opportunistic. The curves in the first set of graphs depict the baseline three cases: the window contains all transactions, the window has a fixed size of two blocks, the window has a fixed size of ten blocks. The second set of graphs deals with the Ganti’s change detection heuristic. The following eight sets of graphs examine all the variations of the history-based heuristic (Section 3.3.2). The last set of graphs explores the behavior of the multivariate heuristic (Section 3.3.3).


RKsmall, Baseline algorithms

90000

90000

80000

80000

70000

70000

60000

60000 Window Size

Window Size

RKbig, Baseline algorithms

50000 40000

50000 40000

30000

30000

20000

20000

10000

10000

0

0 0

20

40

60

0

20

Block nowin

fixwin 2k

40

60

Block fixwin10k

nowin

IBMbig, Baseline algorithms 90000

90000

80000

80000

70000

70000

60000

60000

50000 40000

fixwin10k

50000 40000

30000

30000

20000

20000

10000

10000

0

fixwin 2k

IBMsmall, Baseline algorithms

Window Size

Window Size

136

0 0

20

40

60

0

20

Block nowin

fixwin 2k

40

60

Block fixwin10k

nowin

fixwin 2k

fixwin10k

Figure B.12: Baseline algorithms. The changing size of the window for a system where all transactions are considered and two fixed-size-window variations.


Rksmall, Ganti

25000

25000

20000

20000

Window Size

Window Size

Rkbig, Ganti

137

15000

10000

5000

15000

10000

5000

0

0 0

20

40

60

0

20

Block

40

60

Block

IBMbig, Ganti

IBMsmall, Ganti

25000

50000 45000

20000

40000

Window Size

Window Size

35000 15000

10000

30000 25000 20000 15000

5000

10000 5000

0

0 0

20

40 Block

60

0

20

40

60

Block

Figure B.13: The changing size of the window for the Ganti’s change detection heuristic in abrupt-change domains.


RKsmall, Em/Van Conservative

25000

25000

20000

20000

Window Size

Window Size

RKbig, Em/Van Conservative

15000

10000

5000

15000

10000

5000

0

0 0

20

40

60

0

20

Block harsh

reluctant

40

60

Block opport.

harsh

IBMbig, Em/Van Conservative 90000

90000

80000

80000

70000

70000

60000

60000

50000 40000

opport.

50000 40000

30000

30000

20000

20000

10000

10000

0

reluctant

IBMsmall, Em/Van Conservative

Window Size

Window Size

138

0 0

20

40

60

0

20

Block harsh

reluctant

40

60

Block opport.

harsh

reluctant

opport.

Figure B.14: The changing size of the window for the history-based heuristic with the emerged/vanished conservative approach in abrupt-change domains. The increment is compared with the last block only.


RKsmall, Em/Van Progressive

25000

25000

20000

20000

Window Size

Window Size

RKbig, Em/Van Progressive

15000

10000

5000

15000

10000

5000

0

0 0

20

40

60

0

20

Block harsh

reluctant

40

60

Block opport.

harsh

IBMbig, Em/Van Progressive 25000

25000

20000

20000

15000

10000

5000

reluctant

opport.

IBMsmall, Em/Van Progressive

Window Size

Window Size

139

15000

10000

5000

0

0 0

20

40

60

0

20

Block harsh

reluctant

40

60

Block opport.

harsh

reluctant

opport.

Figure B.15: The changing size of the window for the history-based heuristic with the emerged/vanished progressive approach in abrupt-change domains. The increment is compared with the last block only.


Rksmall, Jaccard

25000

25000

20000

20000

Window Size

Window Size

RKbig, Jaccard

140

15000

10000

5000

15000

10000

5000

0

0 0

20

40

60

0

20

Block harsh

reluctant

40

60

Block opport.

harsh

IBMbig, Jaccard

reluctant

opport.

IBMsmall, Jaccard

25000

30000

25000

20000

Window Size

Window Size

20000 15000

10000

15000

10000 5000

5000

0

0 0

20

40

60

0

20

Block harsh

reluctant

40

60

Block opport.

harsh

reluctant

opport.

Figure B.16: The changing size of the window for the history-based heuristic with the Jaccard distance measure approach. The increment is compared with the last block only.


RKsmall, Frequency dist.

25000

25000

20000

20000

Window Size

Window Size

RKbig, Frequency dist.

15000

10000

5000

15000

10000

5000

0

0 0

20

40

60

0

20

Block harsh

reluctant

40

60

Block opport.

harsh

IBMbig, Frequency dist.

reluctant

opport.

IBMsmall, Frequency dist.

25000

25000

20000

20000

Window Size

Window Size

141

15000

10000

5000

15000

10000

5000

0

0 0

20

40

60

0

20

Block harsh

reluctant

40

60

Block opport.

harsh

reluctant

opport.

Figure B.17: The changing size of the window for the history-based heuristic with the frequency distance measure approach in abrupt-change domains. The increment is compared with the last block only.


RKsmall, Em/Van Cons. Win

25000

25000

20000

20000

Window Size

Window Size

RKbig, Em/Van Cons. Win

15000

10000

5000

15000

10000

5000

0

0 0

20

40

60

0

20

Block harsh

reluctant

40

60

Block opport.

harsh

IBMbig, Em/Van Cons. Win 90000

90000

80000

80000

70000

70000

60000

60000

50000 40000

opport.

50000 40000

30000

30000

20000

20000

10000

10000

0

reluctant

IBMsmall, Em/Van Cons. Win

Window Size

Window Size

142

0 0

20

40

60

0

20

Block harsh

reluctant

40

60

Block opport.

harsh

reluctant

opport.

Figure B.18: The changing size of the window for the history-based heuristic with the emerged/vanished conservative approach in abrupt-change domains. The entire window is being compared with the increment.


RKsmall, Em/Van Prog. Win

25000

25000

20000

20000

Window Size

Window Size

RKbig, Em/Van Prog. Win

15000

10000

5000

15000

10000

5000

0

0 1

21

41

61

1

21

Block harsh

reluctant

41

61

Block opport.

harsh

IBMbig, Queue 5

reluctant

opport.

IBMsmall, Em/Van Prog. Win

25000

25000

20000

20000

Window Size

Window Size

143

15000

10000

5000

15000

10000

5000

0

0 0

20

40

60

1

21

Block harsh

reluctant

41

61

Block opport.

harsh

reluctant

opport.

Figure B.19: The changing size of the window for the history-based heuristic with the emerged/vanished progressive approach in abrupt-change domains. The entire window is being compared with the increment.


RKsmall, Jaccard Win

25000

25000

20000

20000

Window Size

Window Size

RKbig, Jaccard Win

15000

10000

5000

15000

10000

5000

0

0 1

21

41

61

1

21

Block harsh

reluctant

41

61

Block opport.

harsh

IBMbig, Jaccard Win

reluctant

opport.

IBMsmall, Jaccard Win

25000

25000

20000

20000

Window Size

Window Size

144

15000

10000

5000

15000

10000

5000

0

0 1

21

41

61

0

20

Block harsh

reluctant

40

60

Block opport.

harsh

reluctant

opport.

Figure B.20: The changing size of the window for the history-based heuristic with the Jaccard distance measure approach in abrupt-change domains. The entire window is being compared with the increment.


RKsmall, Frequency dist. Win

25000

25000

20000

20000

Window Size

Window Size

RKbig, Frequency dist. Win

15000

10000

5000

15000

10000

5000

0

0 0

20

40

60

1

21

Block harsh

reluctant

41

61

Block opport.

harsh

IBMbig, Frequency dist. Win 25000

25000

20000

20000

15000

10000

5000

reluctant

opport.

IBMsmall, Freq. Dist. Win

Window Size

Window Size

145

15000

10000

5000

0

0 1

21

41

61

0

20

Block harsh

reluctant

40

60

Block opport.

harsh

reluctant

opport.

Figure B.21: The changing size of the window for the history-based heuristic with the frequency distance measure approach in abrupt-change domains. The entire window is being compared with the increment.


RKbig, Multi

146

RKsmall, Multi

25000

90000 80000

20000

70000 Window Size

Window Size

60000 15000

10000

50000 40000 30000 20000

5000

10000 0

0 0

20

40

60

0

20

Block harsh

reluctant

opport.

harsh

IBMbig, Multivariate

60

reluctant

opport.

IBMsmall, Multivariate

25000

25000

20000

20000

Window Size

Window Size

40 Block

15000

10000

5000

15000

10000

5000

0

0 0

20

40

60

0

20

Block harsh

reluctant

40

60

Block opport.

harsh

reluctant

opport.

Figure B.22: The changing size of the window for the multivariate heuristic in abrupt-change domains.

APPENDIX C: EXPERIMENTS WITH GRADUAL CHANGE DOMAINS The results provided in this appendix give a detailed report of the proposed system’s performance in domains where change in context gradually takes place over several blocks. The experiments thoroughly tested the system’s performance for all the possible combinations of change detection heuristics and window adjustment operators. The experiments worked with two test domains: the RK-domain and the IBM-domain, as introduced in Section 4.1. The system was tested on two types of datasets in each of the two domains: a big-change type and a small-change type. Section C.1 evaluates the first objective of this dissertation – the accuracy of current context approximation. The second objective – the ability to detect changes in context – is examined in Section C.2. C.1

Accuracy of Context Approximation

This section evaluates the course of error on gradual-change domains. The experiments were run on four domains, all consisting of two alternating contexts with an intermediate transition period between every two contexts. Both the contexts and the transitional period comprised twenty blocks. The error was calculated as the frequency distance (Section 4.2.1) between the window and the context at each moment. The upper graphs show the error for datasets created by the RK-generator and the lower subgraphs depict the results for the IBM-generator data. The left column graphs cover the case with big difference between the contexts. The graphs on the right deal with small-difference domains. Each graph, with the exception of the first two sets, contains three curves, each for one of the three window adjustment operators: harsh, reluctant, and opportunistic. The curves in the first set of graphs depict the baseline three cases: the window contains all transactions, the window has a fixed size of two blocks, the window has a fixed size of ten blocks. The second set of graphs deals with

APPENDIX C. EXPERIMENTS WITH GRADUAL CHANGE DOMAINS

148

the Ganti’s change detection heuristic. The following eight sets of graphs examine all the variations of the history-based heuristic (Section 3.3.2). The last set of graphs explores the behavior of the multivariate heuristic (Section 3.3.3).


RKgrad-small, Baseline alg.

0.016

0.016

0.014

0.014

0.012

0.012

0.01

0.01 Error

Error

RKgrad-big, Baseline algorithms

0.008

0.008

0.006

0.006

0.004

0.004

0.002

0.002 0

0 0

20

40

60

80

100

0

120

20

40

nowin

60

80

100

120

Block

Block fixwin 2k

nowin

fixwin10k

IBMbig-grad, Baseline algorithms 0.016

0.016

0.014

0.014

0.012

0.012

0.01

0.01

0.008

fixwin10k

0.008

0.006

0.006

0.004

0.004

0.002

0.002

0

fixwin 2k

IBMmedium-grad, Baseline alg.

Error

Error

149

0 0

20

40

60

80

0

20

Block nowin

fixwin 2k

40

60

80

Block fixwin10k

nowin

fixwin 2k

fixwin10k

Figure C.1: Baseline algorithms in gradual-change domains. The error rate for a system without a window – all transactions are considered – and two fixed-size-window variations.


ganti, Rksmall-grad

0.016

0.016

0.014

0.014

0.012

0.012

0.01

0.01 Error

Error

ganti, RKbig-grad

0.008

0.008

0.006

0.006

0.004

0.004

0.002

0.002

0

0 0

20

40

60

80

100 120

0

20

Block

60

80

100 120

Block

ganti, IBMbig-grad

ganti, IBMsmall-grad

0.016

0.016

0.014

0.014

0.012

0.012

0.01

0.01 Error

Error

40

0.008

0.008

0.006

0.006

0.004

0.004

0.002

0.002

0

0 0

20

40 Block

60

80

0

20

40

60

Block

Figure C.2: The error rate for the Ganti’s detection heuristic.

80

150


RKgrad-small, Em/Van Cons.

0.016

0.016

0.014

0.014

0.012

0.012

0.01

0.01 Error

Error

RKgrad-big, Em/Van Cons.

0.008

0.008

0.006

0.006

0.004

0.004

0.002

0.002 0

0 0

20

40

60

80

100

0

120

20

40

harsh

60

80

100

120

Block

Block reluctant

harsh

opport.

IBMbig-grad, Em/Van Cons. 0.016

0.016

0.014

0.014

0.012

0.012

0.01

0.01

0.008

opport.

0.008

0.006

0.006

0.004

0.004

0.002

0.002

0

reluctant

IBMsmall-grad, Em/Van Cons.

Error

Error

151

0 0

20

40

60

80

0

20

Block harsh

reluctant

40

60

80

Block opport.

harsh

reluctant

opport.

Figure C.3: The error rate for the history-based heuristic with the emerged/vanished conservative approach. The increment is compared with the last block only.


RKsmall-grad, Em/Van Progr.

0.016

0.016

0.014

0.014

0.012

0.012

0.01

0.01 Error

Error

RKbig-grad, Em/Van Progressive

0.008

0.008

0.006

0.006

0.004

0.004

0.002

0.002

0

0 0

20

40

60

80

100

120

0

20

40

Block harsh

60

80

100

120

Block

reluctant

opport.

harsh

IBMbig-grad, Em/Van Prog. 0.016

0.016

0.014

0.014

0.012

0.012

0.01

0.01

0.008

opport.

0.008

0.006

0.006

0.004

0.004

0.002

0.002

0

reluctant

IBMsmall-grad, Em/Van Prog.

Error

Error

152

0 0

20

40

60

80

0

20

Block harsh

reluctant

40

60

80

Block opport.

harsh

reluctant

opport.

Figure C.4: The error rate for the history-based heuristic with the emerged/vanished progressive approach. The increment is compared with the last block only.


RKsmall-grad, Jaccard

0.016

0.016

0.014

0.014

0.012

0.012

0.01

0.01 Error

Error

Rkbig-grad, Jaccard

0.008

0.008

0.006

0.006

0.004

0.004

0.002

0.002

0

0 0

20

40

60

80

100

120

0

20

40

Block harsh

60

80

100

120

Block

reluctant

opport.

harsh

IBMbig-grad, Jaccard

reluctant

opport.

IBMsmall-grad, Jaccard

0.016

0.016

0.014

0.014

0.012

0.012

0.01

0.01 Error

Error

153

0.008

0.008

0.006

0.006

0.004

0.004

0.002

0.002

0

0 0

20

40

60

80

0

20

Block harsh

reluctant

40

60

80

Block opport.

harsh

reluctant

opport.

Figure C.5: The error rate for the history-based heuristic with the Jaccard distance measure approach. The increment is compared with the last block only.


RKsmall-grad, Frequency dist.

0.016

0.016

0.014

0.014

0.012

0.012

0.01

0.01 Error

Error

RKbig-grad, Frequency dist.

0.008

0.008

0.006

0.006

0.004

0.004

0.002

0.002

0

0 0

20

40

60

80

100

120

0

20

40

Block harsh

60

80

100

120

Block

reluctant

opport.

harsh

IBMbig-grad, Frequency dist. 0.016

0.016

0.014

0.014

0.012

0.012

0.01

0.01

0.008

opport.

0.008

0.006

0.006

0.004

0.004

0.002

0.002

0

reluctant

IBMsmall-grad, Frequency dist.

Error

Error

154

0 0

20

40

60

80

0

20

Block harsh

reluctant

40

60

80

Block opport.

harsh

reluctant

opport.

Figure C.6: The error rate for the history-based heuristic with the frequency distance measure approach. The increment is compared with the last block only.


RKgrad-small, Em/Van Cons. Win

0.016

0.016

0.014

0.014

0.012

0.012

0.01

0.01 Error

Error

RKgrad-big, Em/Van Cons. Win

0.008

0.008

0.006

0.006

0.004

0.004

0.002

0.002 0

0 0

20

40

60

80

100

0

120

20

40

harsh

60

80

100

120

Block

Block reluctant

harsh

opport.

IBMbig-grad, Em/Van Cons. Win 0.016

0.016

0.014

0.014

0.012

0.012

0.01

0.01

0.008

opport.

0.008

0.006

0.006

0.004

0.004

0.002

0.002

0

reluctant

IBMsmall-grad, Em/Van Con. Win

Error

Error

155

0 0

20

40

60

80

0

20

Block harsh

reluctant

40

60

80

Block opport.

harsh

reluctant

opport.

Figure C.7: The error rate for the history-based heuristic with the emerged/vanished conservative approach. The entire window is being compared with the increment.


RKsmall-grad, Em/Van Prog. Win

0.016

0.016

0.014

0.014

0.012

0.012

0.01

0.01 Error

Error

RKbig-grad, Em/Van Progr. Win

0.008

0.008

0.006

0.006

0.004

0.004

0.002

0.002

0

0 0

20

40

60

80

100

120

0

20

40

Block harsh

60

80

100

120

Block

reluctant

opport.

harsh

IBMbig-grad, Em/Van Prog. Win 0.016

0.016

0.014

0.014

0.012

0.012

0.01

0.01

0.008

opport.

0.008

0.006

0.006

0.004

0.004

0.002

0.002

0

reluctant

IBMsmall-gr., Em/Van Prog. Win

Error

Error

156

0 0

20

40

60

80

0

20

Block harsh

reluctant

40

60

80

Block opport.

harsh

reluctant

opport.

Figure C.8: The error rate for the history-based heuristic with the emerged/vanished progressive approach. The entire window is being compared with the increment.


Rksmall-grad, Jaccard Win

0.016

0.016

0.014

0.014

0.012

0.012

0.01

0.01 Error

Error

RKbig-grad, Jaccard Win

0.008

0.008

0.006

0.006

0.004

0.004

0.002

0.002

0

0 0

20

40

60

80

100

120

0

20

40

Block harsh

60

80

100

120

Block

reluctant

opport.

harsh

IBMbig-grad, Jaccard Win

reluctant

opport.

IBMsmall-grad, Jaccard Win

0.016

0.016

0.014

0.014

0.012

0.012

0.01

0.01 Error

Error

157

0.008

0.008

0.006

0.006

0.004

0.004

0.002

0.002

0

0 0

20

40

60

80

0

20

Block harsh

reluctant

40

60

80

Block opport.

harsh

reluctant

opport.

Figure C.9: The error rate for the history-based heuristic with the Jaccard distance measure approach. The entire window is being compared with the increment.


RKsmall-grad, Freq. dist. Win

0.016

0.016

0.014

0.014

0.012

0.012

0.01

0.01 Error

Error

RKbig-grad, Frequency dist. Win

0.008

0.008

0.006

0.006

0.004

0.004

0.002

0.002

0

0 0

20

40

60

80

100

120

0

20

40

Block harsh

60

80

100

120

Block

reluctant

opport.

harsh

IBMbig-grad, Frequency dist. Win 0.016

0.016

0.014

0.014

0.012

0.012

0.01

0.01

0.008

opport.

0.008

0.006

0.006

0.004

0.004

0.002

0.002

0

reluctant

IBMsmall-grad, Freq. dist. Win

Error

Error

158

0 0

20

40

60

80

0

20

Block harsh

reluctant

40

60

80

Block opport.

harsh

reluctant

opport.

Figure C.10: The error rate for the history-based heuristic with the frequency distance measure approach. The entire window is being compared with the increment.


RKsmall-grad, Multivariate

0.016

0.016

0.014

0.014

0.012

0.012

0.01

0.01 Error

Error

RKbig-grad, Multivariate

0.008

0.008

0.006

0.006

0.004

0.004

0.002

0.002

0

0 0

20

40

60

80

100

120

0

20

40

Block harsh

80

100

120

Block

reluctant

opport.

harsh

IBMbig-grad, Multi

reluctant

opport.

IBMsmall-grad, Multi

0.016

0.016

0.014

0.014

0.012

0.012

0.01

0.01 Error

Error

60

0.008

0.008

0.006

0.006

0.004

0.004

0.002

0.002

0

0 0

20

40

60

80

0

20

Block harsh

reluctant

40

60

80

Block opport.

harsh

reluctant

Figure C.11: The error rate for the multivariate heuristic.

opport.

159


C.2

160

Window Size

This section evaluates the course of the window size on gradual-change domains. The desired system’s behavior is to grow the window in a stable context and keep the window short in transitional periods between contexts. The experiments were run on four domains, all consisting of two alternating contexts with a gradual transition between every two contexts. Both the contexts and the transitional periods comprised twenty blocks. The upper graphs show the error for datasets created by the RK-generator and the lower graphs depict the results for the IBM-generator data. The left-column graphs cover the case with big difference between neighboring contexts. The graphs on the right deal with small-difference domains. Each graph, with the exception of the first two sets, contains three curves, each for one of the three window adjustment operators: harsh, reluctant, and opportunistic. The curves in the first set of graphs depict the baseline three cases: the window contains all transactions, the window has a fixed size of two blocks, the window has a fixed size of ten blocks. The second set of graphs deals with the Ganti’s change detection heuristic. The following eight sets of graphs examine all the variations of the history-based heuristic (Section 3.3.2). The last set of graphs explores the behavior of the multivariate heuristic (Section 3.3.3).


RKbig-grad, Baseline algorithms

160000

160000

140000

140000

120000

120000

100000

100000

Window Size

Window Size

RKbig-grad, Baseline algorithms

80000 60000

80000 60000

40000

40000

20000

20000

0

0 0

20

40

60

80

100 120

0

20

40

Block nowin

fixwin 2k

60

80

100 120

Block fixwin10k

nowin

IBMbig-grad, Baseline algorithms 120000

120000

100000

100000

80000

80000

60000

fixwin10k

60000

40000

40000

20000

20000

0

fixwin 2k

IBMsmall-grad, Baseline alg.

Window Size

Window Size

161

0 0

20

40

60

80

0

20

Block nowin

fixwin 2k

40

60

80

Block fixwin10k

nowin

fixwin 2k

fixwin10k

Figure C.12: Baseline algorithms in gradual-change domains. The changing size of the window for a system where all transactions are considered and two fixed-size-window variations.


RKbig_grad, Ganti

162

RKsmall-grad, Ganti

30000

100000 90000

25000

80000 70000 Window Size

Window Size

20000

15000

10000

60000 50000 40000 30000 20000

5000

10000 0

0 0

20

40

60

80

100 120

0

20

40

60

Block

80

100 120

Block

IBMbig-grad, Ganti

IBMsmall-grad, Ganti

45000

40000

40000

35000

35000

30000 Window Size

Window Size

30000 25000 20000

25000 20000 15000

15000 10000

10000

5000

5000 0

0 0

20

40

60

Block

80

0

20

40

60

80

Block

Figure C.13: The changing size of the window in gradual-change domains for the Ganti’s detection heuristic.


RKsmall-grad, Em/Van Cons.

45000

45000

40000

40000

35000

35000

30000

30000 Window Size

Window Size

RKbig-grad, Em/Van Cons.

25000 20000

25000 20000

15000

15000

10000

10000

5000

5000

0

0 0

20

40

60

80

100 120

0

20

40

Block harsh

60

80

100 120

Block

reluctant

opport.

harsh

IBMbig-grad, Em/Van Cons. 90000

90000

80000

80000

70000

70000

60000

60000

50000 40000

opport.

50000 40000

30000

30000

20000

20000

10000

10000

0

reluctant

IBMsmall-grad, Em/Van Cons.

Window Size

Window Size

163

0 0

20

40

60

80

0

20

Block harsh

reluctant

40

60

80

Block opport.

harsh

reluctant

opport.

Figure C.14: The changing size of the window in gradual-change domains for the historybased heuristic with the emerged/vanished conservative approach.


RKsmall-grad, Em/Van Progr.

25000

25000

20000

20000

Window Size

Window Size

RKbig-grad, Em/Van Progressive

15000

10000

5000

15000

10000

5000

0

0 0

20

40

60

80

100 120

0

20

40

Block harsh

60

80

100 120

Block

reluctant

opport.

harsh

IBMbig-grad, Em/Van Progr. 25000

25000

20000

20000

15000

10000

5000

reluctant

opport.

IBMsmall-grad, Progressive

Window Size

Window Size

164

15000

10000

5000

0

0 0

20

40

60

80

0

20

Block harsh

reluctant

40

60

80

Block opport.

harsh

reluctant

opport.

Figure C.15: The changing size of the window in gradual0change domains for the historybased heuristic with the emerged/vanished progressive approach.


RKsmall-grad, Jaccard

45000

45000

40000

40000

35000

35000

30000

30000 Window Size

Window Size

RKbig-grad, Jaccard

25000 20000

25000 20000

15000

15000

10000

10000

5000

5000

0

0 0

20

40

60

80

100 120

0

20

40

Block harsh

60

80

100 120

Block

reluctant

opport.

harsh

IBMbig-grad, Jaccard

reluctant

opport.

IBMsmall-grad, Jaccard

25000

25000

20000

20000

Window Size

Window Size

165

15000

10000

5000

15000

10000

5000

0

0 0

20

40

60

80

0

20

Block harsh

reluctant

40

60

80

Block opport.

harsh

reluctant

opport.

Figure C.16: The changing size of the window in gradual-change domains for the historybased heuristic with the Jaccard distance measure approach.


RKsmall-grad, Frequency dist.

45000

45000

40000

40000

35000

35000

30000

30000 Window Size

Window Size

RKbig-grad, Frequency dist.

25000 20000

25000 20000

15000

15000

10000

10000

5000

5000

0

0 0

20

40

60

80

100 120

0

20

40

Block harsh

60

80

100 120

Block

reluctant

opport.

harsh

IBMbig-grad, Frequency dist. 25000

25000

20000

20000

15000

10000

5000

reluctant

opport.

IBMsmall-grad, Frequency dist.

Window Size

Window Size

166

15000

10000

5000

0

0 0

20

40

60

80

0

20

Block harsh

reluctant

40

60

80

Block opport.

harsh

reluctant

opport.

Figure C.17: The changing size of the window in gradual-change domains for the historybased heuristic with the frequency distance measure approach.


RKsmall-grad, Em/Van Cons. Win

45000

45000

40000

40000

35000

35000

30000

30000 Window Size

Window Size

RKbig-grad, Em/Van Cons. Win

25000 20000

25000 20000

15000

15000

10000

10000

5000

5000

0

0 0

20

40

60

80

100 120

0

20

40

Block harsh

60

80

100 120

Block

reluctant

opport.

harsh

IBMbig-grad, Em/Van Cons. Win 120000

120000

100000

100000

80000

80000

60000

opport.

60000

40000

40000

20000

20000

0

reluctant

IBMsmall-grad, Em/Van Con. Win

Window Size

Window Size

167

0 0

20

40

60

80

0

20

Block harsh

reluctant

40

60

80

Block opport.

harsh

reluctant

opport.

Figure C.18: The changing size of the window in gradual-change domains for the historybased heuristic with the emerged/vanished conservative approach. The entire window is being compared with the increment.


RKbig-grad, Em/Van Progr. Win

168

RKsmall-grad, Em/Van Prog. Win

25000

30000

25000

20000

Window Size

Window Size

20000 15000

10000

15000

10000 5000

5000

0

0 0

20

40

60

80

100 120

0

20

40

Block harsh

80

100 120

Block

reluctant

opport.

harsh

IBMbig-grad, Em/Van Prog. Win 25000

25000

20000

20000

15000

10000

5000

reluctant

opport.

IBMsmall-grad, Em/Van Prog Win

Window Size

Window Size

60

15000

10000

5000

0

0 0

20

40

60

80

0

20

Block harsh

reluctant

40

60

80

Block opport.

harsh

reluctant

opport.

Figure C.19: The changing size of the window in gradual-change domains for the historybased heuristic with the emerged/vanished progressive approach. The entire window is being compared with the increment.


RKsmall-grad, Jaccard Win

45000

45000

40000

40000

35000

35000

30000

30000 Window Size

Window Size

RKbig-grad, Jaccard Win

25000 20000

25000 20000

15000

15000

10000

10000

5000

5000

0

0 0

20

40

60

80

100 120

0

20

40

Block harsh

60

80

100 120

Block

reluctant

opport.

harsh

IBMbig-grad, Jaccard Win

reluctant

opport.

IBMsmall-grad, Jaccard Win

25000

25000

20000

20000

Window Size

Window Size

169

15000

10000

5000

15000

10000

5000

0

0 0

20

40

60

80

0

20

Block harsh

reluctant

40

60

80

Block opport.

harsh

reluctant

opport.

Figure C.20: The changing size of the window in gradual-change domains for the historybased heuristic with the Jaccard distance measure approach. The entire window is being compared with the increment.


RKsmall-grad, Freq. dist. Win

45000

45000

40000

40000

35000

35000

30000

30000 Window Size

Window Size

RKbig-grad, Freq. dist. Win

25000 20000

25000 20000

15000

15000

10000

10000

5000

5000

0

0 0

20

40

60

80

100 120

0

20

40

Block harsh

60

80

100 120

Block

reluctant

opport.

harsh

IBMbig-grad, Freq Win

reluctant

opport.

IBMsmall-grad, Freq. Dist. Win

25000

25000

20000

20000

Window Size

Window Size

170

15000

10000

5000

15000

10000

5000

0

0 0

20

40

60

80

0

20

Block harsh

reluctant

40

60

80

Block opport.

harsh

reluctant

opport.

Figure C.21: The changing size of the window in gradual-change domains for the history-based heuristic with the frequency distance measure approach. The entire window is being compared with the increment.


RKbig-grad, Multi

RKsmall-grad, Multi

45000

160000

40000

140000

35000

120000 Window Size

Window Size

30000 25000 20000 15000

100000 80000 60000 40000

10000

20000

5000 0

0 0

20

40

60

80

100 120

0

20

40

60

Block harsh

80

100 120

Block

reluctant

opport.

harsh

IBMbig-grad, Multi

reluctant

opport.

IBMsmall-grad, Multi

60000

60000

50000

50000

40000

40000 Window Size

Window Size

171

30000

30000

20000

20000

10000

10000

0

0 0

20

40

60

80

0

20

Block harsh

reluctant

40

60

80

Block opport.

harsh

reluctant

opport.

Figure C.22: The changing size of the window in gradual-change domains for the multivariate heuristic.

ABSTRACT Data mining is a relatively new area of computer science that has received increased attention in the last decade. The basic goal of data mining is to extract new and useful knowledge from vast amounts of data. A popular subfield of data mining is the area of association mining that searches for frequently co-occurring items in a market-basket type database. A market basket is the list of items a customer purchases at the register. Of course, this type of database is not limited to marketing only; other possible areas include analysis of web data, astronomical data, stock market analysis, medical field, and many others. The basic tenet of this work is that association patterns evolve over time due to fashion, season, or the introduction of new products. The dissertation proposes a novel window-based system to induce a model of the current state of the database. The main contributions of this work are three groups of heuristics for detecting change in context, and three operators for adjusting the window after a change has been detected. The dissertation experimentally examines the system’s behavior for all combinations of the change detection heuristics and the window adjustment operators in two kinds of domains. The domains of the first kind abruptly change their properties while the domains of the second kind gradually evolve from one context to another. The dissertation shows that a system with a variable-size window achieves higher accuracy of context approximation, as compared to a system with no window or a fixed-size-window system. The change detection heuristics proposed in this work achieve the same accuracy as a heuristic coming from an earlier work, but with one to two orders of magnitude lower computational costs.

BIOGRAPHICAL SKETCH Antonin Rozsypal was born in Olomouc, Czechoslovakia on April 25, 1976 to Antonin Rozsypal and Bozena Rozsypalova. He received the Bachelor of Science and Master of Science degrees in Computer Science from Masaryk University, Brno, Czech Republic in 1997 and 1999, respectively. In the fall of 2000 Mr. Rozsypal joined the doctoral program in Computer Science at the University of Louisiana at Lafayette. His professional interests include Machine Learning and Data Mining.

Rozsypal, Antonin. Bachelor of Science, Masaryk University, Brno, Czech Republic, 1997; Master of Science, Masaryk University, Brno, Czech Republic, 1999; Doctor of Philosophy, University of Louisiana at Lafayette, Fall 2003 Major: Computer Science Title of Dissertation: Association Mining in Time-Varying Domains Dissertation Directors: Dr. Vijay V. Raghavan, Dr. Miroslav Kubat Pages in Dissertation: 173; Words in Abstract: 283 ABSTRACT Data mining is a relatively new area of computer science that has received increased attention in the last decade. The basic goal of data mining is to extract new and useful knowledge from vast amounts of data. A popular subfield of data mining is the area of association mining that searches for frequently co-occurring items in a market-basket type database. A market basket is the list of items a customer purchases at the register. Of course, this type of database is not limited to marketing only; other possible areas include analysis of web data, astronomical data, stock market analysis, medical field, and many others. The basic tenet of this work is that association patterns evolve over time due to fashion, season, or the introduction of new products. The dissertation proposes a novel window-based system to induce a model of the current state of the database. The main contributions of this work are three groups of heuristics for detecting change in context, and three operators for adjusting the window after a change has been detected. The dissertation experimentally examines the system’s behavior for all combinations of the change detection heuristics and the window adjustment operators in two kinds of domains. The domains of the first kind abruptly change their properties while the domains of the second kind gradually evolve from one context to another. The dissertation shows that a system with a variable-size window achieves higher accuracy of context approximation, as compared to a system with no window or a

fixed-size-window system. The change detection heuristics proposed in this work achieve the same accuracy as a heuristic coming from an earlier work, but with one to two orders of magnitude lower computational costs.

Association Mining in Time-Varying Domains A ... - CiteSeerX

Association Mining in Time-Varying Domains A ... - CiteSeerX

Suggest Documents

Association Mining in Gradually Changing Domains - Association for ...

Explanation-Oriented Association Mining Using a ... - CiteSeerX

Mining Fuzzy Association Rules - CiteSeerX

Sequence Mining in Categorical Domains: Algorithms and ... - CiteSeerX

Mining Fuzzy Association Rules in A Bank-account ... - CiteSeerX

Sequence Mining in Categorical Domains: Incorporating Constraints

Association Rule Mining, A Survey

Mining Association Rules from XML Data - CiteSeerX

Maximal Association Rules: A New Tool for Mining for ... - CiteSeerX

Mining of Fuzzy Association Rules using a Kernel- based ... - CiteSeerX

Maximal Association Rules: A Tool for Mining ... - CiteSeerX

A Data Mining System for Discovering Fuzzy Association ... - CiteSeerX

Association Rule Mining and Medical Application: A ... - CiteSeerX

A Comparative Study of Association Rules Mining ... - CiteSeerX

Association Rule Mining and Medical Application: A ... - CiteSeerX

A Text Mining Technique Using Association Rules Extraction - CiteSeerX

Cross-Ontology Multi-level Association Rule Mining in the ... - CiteSeerX

Fuzzy Logic in Association Rules Mining: An Overview - CiteSeerX

Association Rule Mining in Peer-to-Peer Systems - CiteSeerX

1 Mining Association Patterns in eb Usage Data Pang ... - CiteSeerX

graph mining sub domains an a framework for ... - Semantic Scholar

Association Rule Mining: A Survey - Semantic Scholar

a new improved weighted association rule mining

Traffic mining in a road-network - CiteSeerX