Data Mining–Challenges, Models, Methods and Algorithms

0 downloads 0 Views 1MB Size Report
May 8, 2003 - data mining is to discover new and useful information in data bases this is ..... ter product placements, product development but also fraud detection. ... credit card transactions .... both known and unknown such hidden mechanims are common. In ...... that the Markov inequality holds for any positiv function.
Data Mining – Challenges, Models, Methods and Algorithms Markus Hegland May 8, 2003

2

Contents 1 Introduction

5

2 Modelling: Objects, Features and Distributions 2.1 Objects and Properties . . . . . . . . . . . . . . . 2.2 Data Sets, Distributions and Measures . . . . . . 2.3 Sampling . . . . . . . . . . . . . . . . . . . . . . . 2.4 Errors . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Restricted Observations . . . . . . . . . . . . . . 2.6 Properties and Patterns . . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

3 Association rules 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Modelling for association rules . . . . . . . . . . . . . . . 3.2.1 Modelling the Data . . . . . . . . . . . . . . . . . 3.2.2 Modelling the Patterns in the Database . . . . . . 3.3 Structure of X . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Structure of the Predicates on X . . . . . . . . . . . . . . 3.5 Support and antimonotonicity . . . . . . . . . . . . . . . 3.6 Expectation and the size of itemsets . . . . . . . . . . . . 3.7 Examples of distributions . . . . . . . . . . . . . . . . . 3.8 Rules, their support and their confidence . . . . . . . . . 3.9 Apriori Algorithm . . . . . . . . . . . . . . . . . . . . . . 3.9.1 Time complexity – computing supports . . . . . . 3.9.2 The algorithm . . . . . . . . . . . . . . . . . . . . 3.9.3 Determination and size of the candidate itemsets 3.9.4 Apriori Tid . . . . . . . . . . . . . . . . . . . . . 3.9.5 Constrained association rules . . . . . . . . . . . 3.9.6 Partitioned algorithms . . . . . . . . . . . . . . . 3

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . .

. . . . . .

11 12 14 17 18 19 19

. . . . . . . . . . . . . . . . .

21 21 27 27 29 38 43 47 50 54 58 60 60 64 66 79 81 85

4

CONTENTS 3.9.7 Finding strong rules . . . . . . . . . . . . . . . . . . . 88 3.10 Mining Sequences . . . . . . . . . . . . . . . . . . . . . . . . . 89 3.11 The FP tree algorithm . . . . . . . . . . . . . . . . . . . . . . 91

4 Predictive Models 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Mathematical Framework . . . . . . . . . . . . . . . . . . . 4.2.1 The data . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Support, misclassification rates, Bayes estimators and least squares . . . . . . . . . . . . . . . . . . . . . . . 4.2.3 Function spaces . . . . . . . . . . . . . . . . . . . . . 4.3 Constant Functions . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Constant classifiers . . . . . . . . . . . . . . . . . . . 4.3.2 The best constant real functions . . . . . . . . . . . . 4.4 Tree models . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Splitting and impurity functions . . . . . . . . . . . . 4.4.2 Growing a tree . . . . . . . . . . . . . . . . . . . . . 4.4.3 Tree pruning . . . . . . . . . . . . . . . . . . . . . . 4.4.4 Trees and rules . . . . . . . . . . . . . . . . . . . . . 4.4.5 Regression Trees . . . . . . . . . . . . . . . . . . . . 4.5 Linear and additive functionspaces . . . . . . . . . . . . . . 4.5.1 Additive models . . . . . . . . . . . . . . . . . . . . . 4.5.2 Regression Splines . . . . . . . . . . . . . . . . . . . 4.6 Local models . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6.1 Radial basis functions . . . . . . . . . . . . . . . . . 4.7 Models based on Density estimators . . . . . . . . . . . . . . 4.7.1 Bayesian classifiers . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

5 Cluster Analysis 5.1 Partitioning techniques . . . 5.1.1 k-means algorithm . 5.1.2 k-medoids algorithm 5.2 Hierarchical Clustering . . .

143 . 146 . 147 . 149 . 150

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

95 . 95 . 99 . 99 100 105 106 106 109 112 113 117 121 122 123 126 127 133 135 135 138 138

Chapter 1 Introduction The following is an attempt of an introduction and review of some current data mining techniques.The main focus is on the computational aspects of data mining and not on statistics, data management nor data retrieval. However, the course will develop some of the mathematical foundations which underpin data mining algorithms. Data mining is a new and rapidly growing field. It draws ideas and resources from multiple disciplines, including machine learning, statistics, database research, high performance computing and commerce. This explains the dynamic, multifaceted and rapidly evolving nature of the data mining discipline. While there is a broad consensus that the abstract goal of data mining is to discover new and useful information in data bases this is where the consensus ends and the means of achieving this goal are as diverse as the communities contributing. The foundations of all data mining methods, however, are in mathematics. Any moderately sized treatment of data mining techniques necessarily has to be selective and maybe biased towards a particular approach. Despite this, we hope that the following discussion will provide useful information for readers wishing to get some understanding of ideas and challenges underlying a selection of data mining techniques. This selection includes some of the most widely used data mining problems like association rule mining, predictive models, and clustering. The focus of this course is on the challenges posed by data size, dimensionality, data complexity and on fundamental mathematical concepts and modelling. It is hoped that this overview gives the computational mathematician in particular some starting points for further exploration. We believe that data mining does provide many new challenging questions for approxi5

6

CHAPTER 1. INTRODUCTION

mation theory, stochastic analysis, numerical analysis and parallel algorithm design to name just a few mathematical discplines relevant for data mining. Data mining techniques are used to find patterns, structure or regularities and singularities in large and growing data sets. A necessary property of algorithms which are capable of handling large and growing datasets is their scalability or linear complexity with respect to the data size. Scalability in the data mining literature means a time (and space) complexity which is proportional to the size of the data set, i.e., O(n) if n is the number of records of a data set. The proportionality “constant” may actually grow slightly as well and complexities like O(n log(n)) are usually also acceptable. Patterns in the database are described by relations between the attributes. In a sense, a relational database itself defines a pattern. However, the size of the relations of the data base makes it impossible to use them directly for further predictions or decisions. On the other hand, these relations only provide information about the available observations and cannot be directly applied to future observations. Methods which are able to generalise their results to future observations are investigated in statistics and machine learning. The variables, or attributes, considered here are mainly assumed to be either continuous or categorical. However, more general data types are frequently analysed in data mining, see [15]. The techniques discussed here are not based on sampling and access every item in the full data set. However, this does not imply that sampling is unimportant in data mining, in fact, it is often the only way to deal with very large data sets. The different disciplines are also reflected in different goals of data mining techniques [65]: 1. Induction. Find general patterns and equations which characterise the data. 2. Compression. Reduce the complexity of the data, replace by simpler concepts 3. Querying. Find better ways to query data, i.e. extract information and properties. 4. Approximation. Find models which approximate the observations well.

7 5. Search. Look for recurring patters. Furthermore, one may classify data mining algorithms based on three key elements [65]: • Model representation. Decision trees, regression functions and associations. • Data. Continuous, time series, discrete, labeled, multimedia or nominal data. • Applications domains. Finance and economy, biology, web logs, web text mining. The discovered patterns can be characterised based on: Accuracy and precision, expressiveness, interpretability, parsinomy, surprisingness, interestingness, or actionability [65]. Assume that the initial raw data is stored in a relational data base, i.e., a collection of tables or relations. Each table contains a sequence of records, each of which consists of a number of attribute values. These tables are typically used for transactional purposes, i.e., for the management of a business. For example, a health insurance company would store a table containing information of all the doctors, another one with the patients and maybe a third one with claims which contains pointers into the other two tables. Each record in these tables contains information about the individuals, for example, the patient table would contain the age, sex and location of a patient. In order to guarantee consistency of the information and avoid redundancy, the tables are normalised [23]. Assume now that in a data mining project the claims of patients are to be further examined. In particular, one might be interested in differences of claims between different doctors of a particular specialisation. Thus a first step is to restructure the original data base into a data base where each new “record” now contains many records from several tables and the data mining investigation is to compare these “records” in order to find regularities, patterns and outliers. In a next step, major characteristics of these records are extracted. In our health insurance example this might be the number of patients, the total

8

CHAPTER 1. INTRODUCTION

claim or the type of service of a particular doctor. These characteristics may be symbolic objects [15] including sets, intervals and frequency distributions but in this discussion we will mainly contain our discussion to features which are either categorical or continuous variables. The choice of these features is very important and often the main reason for failure or success of a data mining project. If, in our example, the features do not provide information about doctors and patients, the analysis of their interrelations will not provide any interesting insights. One data mining task is indeed the identification of features containing information which can contribute to a particular research question. We will assume that the features have been chosen and focus here instead on the algorithms which are used to analyse them. Thus, after these preliminary steps, one is left with one table or relation which is a sequence of feature vectors and the data mining task is to explore and describe how these feature vectors are interrelated. From a statistical point of view the data is a sequence of independent vectors which are described by a probability distribution. The goal of data mining is then to uncover interesting structure or aspects of the underlying probability distribution. Questions which are addressed are: • Are there areas which have a higher probability? This is addressed by clustering techniques and association rules. • Can some of the variables be explained by others and how well? This is addressed by classification and regression. • Where are the areas where these functional relationships are better / worse? The analysis of areas of high misclassification rates and the residuals of regression address this. The sizes of data bases analysed are growing exponentially. In fact, it was suggested that data grows at the same rate as computational resources, which according to Moore’s law, doubles every 18 months [11]. Today data collections in the Megabyte range are very common, Gigabyte data collections are becoming available including many business data collections some of which can be well into the Terabyte range where also the public Web is found, and data collections are appearing now in the Petabyte range. The largest challenge is not so much the absolute size of the data bases but their constant growth. Thus one of the largest challenges in computer science in general is the generation of systems which are capable to handle such growing

9 data sets, i.e. the systems need to be scalable [11]. The speed of hard- and software do scale at the same rate but algorithms which do not have linear time complexity, i.e. for which the time spent grows faster than the data size, cannot lead to scalable solutions. (For example, an O(N 3 ) algorithm which requires five Minutes today to analyse the most current data would require one year in ten years to analyse the then most actual database even if the most up-to-date hardware is used.) Thus it is essential that all data mining algorithms have at most linear time complexity in the data size. The areas selected for the current discussion are: Basic modelling (chapter 2), Association rules (chapter 3), classification and regression (chapter 4) and clustering (chapter 5). The reader wishing to get further insights to these and other areas of data mining research is encouraged to search for relevant literature on the Web, a useful aide and guide in this search is the book on data mining by J. Han [39]. In addition to its discussion of a vast collection of data mining research it also provides a a database perspective. Association rule discovery is maybe data mining at is purest and the techniques, while possibly motivated by earlier work in rule discovery, have evolved within the data mining community. As this area is the newest of the three covered, the mathematical foundations are less well studied. In contrast, there has been a lot of theoretical work in the area of classification in the machine learning community, in particular we would like to point to the book by DeVroye/Gy¨orfi/Lugosi [25]. While linear regression is widely studied in statistics, nonparametric regression is still a relatively new field in particular, in the case of high-dimensional data. Clustering techniques have been used for many years and the amount of literature on clustering is quite formidable. We will only visit the tip of the iceberg here.

10

CHAPTER 1. INTRODUCTION

Chapter 2 Modelling: Objects, Features and Distributions This chapter develops the basic modelling used in data mining. The aim of mathematical modelling is to map relationships between entities in the real world to relationships between mathematical abstractions with a view of drawing conclusions about the nature of these relationships. For example when mapping the system containing the sun and the earth onto two point masses interacting by gravitational forces one can derive properties like Kepler’s laws. Mathematical modelling is strictly speaking not a mathematical discipline but nontheless it is an essential step in the successful application of mathematical insights to our work and our life. In particular, putting a data mining problem into mathematical framework enables the analyst to do the following: • concisely define the data mining tasks and challenges, • describe algorithms to solve these tasks, • describe the properties of data and results, • establish the computational properties of the algorithms, • facilitate the transfer of technology between various application domains and • guide the implementation of algorithms. 11

12CHAPTER 2. MODELLING: OBJECTS, FEATURES AND DISTRIBUTIONS This chapter will focus on very basic modelling. In the remaining 3 chapters modelling which is pertinent to the particular problems is covered in the introduction to the chapter.

2.1

Objects and Properties

The first thing a data miner has to do is to determine the objects of the data mining study. The aim of the study is then to find out more about these objects, how they are related, if there are any unusual objects or objects of special interest. We model objects as abstract entities, in particular as elements of a set. However, this set is typically not further specified mathematically. We will denote objects by the symbol ω and the set of objects by Ω. What consists of an object is not defined by the data but has to be decided before the study by the investigator. For example, in the analysis of medical records one is interested to learn more about patients, in particular, their medical history. Thus the objects here are patients of a hospital or more generally of a state. Sometimes one may be interested to compare different conditions or episodes of medical care. In this case, an object is an episode of care. Finally, on may be interested in investigating single medical procedures received during episodes. Here the medical procedure is the object. So while all these investigations may use the same data set of medical records, the objects can be quite different. They are thus not the same as a record or transaction of the database. (Except when one is interested in the distribution of the records.) In retail one is interested in customer buying patterns. In this case, the object could be a customer. Often one may not have the necessary information to determine customers from the retail database and may wish to investigate particular shopping events. Thus a market basket (or shopping trolley) would be the object. Astronomers are interested in the dynamics of stars. In this case the objects are the stars. Having recorded a brightnes timeseries of one star, however, one might investigate the local dynamics so that the object could be a time window into this time series. Finally, in bioinformatics one studies DNA and proteins. There the objects could be DNAs or proteins, or a window of into the sequences of aminoacids i.e. a subsequence. The study of subsequences of sequences can reveal patterns which are frequent within and accross sequences.

2.1. OBJECTS AND PROPERTIES

13

So in summary, the objects – while being important and concrete entities which define the scope and direction of a study – are, from a mathematical viewpoint, abstract entities, i.e., elements ω of a set. One might characterise the objects by a unique identifier (like a patient number). For many practical purposes one can think of the set of Ω as a finite set, e.g., the set of patients of a hospital but in other cases one might be better served to consider infinite sets, e.g., the set of potential patients of a hospital. What we observe are properties of objects. It is important to separte the objects from the properties which are observed even though we will only be able draw consequences about the distribution of the properties and not directly about the objects themselves. The values of the properties of objects are mathematical quantities. They can be numbers, tuples of numbers, classes, graphs (e.g., for the characterisation of molecules), sets, lists or any other mathematical entity. We will denote the set of potential properties of objects by X. A property of object ω is then the value X(ω) of a mapping X : Ω → X. For example the property could be the set of items contained in a market basket, the intensity timeseries of a star or the residue sequence of a protein. Properties are also called features in machine learning or random variables in statistics. Note that properties are not only just real numbers or integers but include much more complex mathematical entities. The nature of a data mining exploration is largely influenced by the observed properties. One may think that introducing the objects and properties in this way may exclude randomness or observation errors. This is not the case. The errors themselves are again modelled by objects which relate to the observation. The observation itself is an object ω 0 which is mapped onto an actual object ω. Thus we may have multiple records (observations) of the temperature at a certain time an place but we are really interested in developing and understanding of the temperature distribution over time and place. In this case the object is the “real” temperature which is thus observed with an error which has to be taken into account when one discusses methods to find patterns. The distinction between the objects and the observations is crucial, however, if one does not make it one ends up with results about specific observations which, while depending on the objects also include “errors” which typically provide no information about the objects. While the observed features determine what we can discover about objects, they are not necessarily the features which we would like to investigate. For this one may wish to transform these features into features of interest.

14CHAPTER 2. MODELLING: OBJECTS, FEATURES AND DISTRIBUTIONS For example, in a market basket, one may be interested in the total value of the goods purchased. For this one needs to determine the prices of all the goods and add them up. This part of mapping features into other features is often counted as part of the preprocessing, together with finding and handling of missing data and data inconsistencies. While this is an important step, we will not further discuss it here, instead, we will often assume that the X(ω) are the features we are interested in. Obviously, the “real objects” ω and their properties X(ω) are quite different things. The properties are observed and are mathematical items. We will mainly talk about properties of collections and distributions of properties, however, it is useful to keep in mind that these properties just open a small window into the real world and most of the properties are unknown. In addition the observed properties typically only reflect properties of our real objects approximately and may contain large errors. Drawing any conclusions about real objects needs to take this into account One may argue that the set Ω and the concept of objects is not part of the mathematical discussion and should be ommitted here. This, however, misses an important point: In order to successfully apply mathematics one needs a good foundation for the mathematical model and a way to connect the mathematics to the real world. This is exactly what mathematical does and the objects are part of this modelling.

2.2

Data Sets, Distributions and Measures

Data consists of collections of observed properties of objects. More specifically, the data set is a sequence x(i) ∈ X. The data mining algorithms will find properties of these sequences which typically are independant of the order of the sequence which is treated like a multiset. The mathematical structure which is exploited in the data mining analysis is a consequence of the mathematical structure in X. A simple example of a data set is the set of wages of all employees of a company. Properties of this simple data set include average, variance, quantiles, regions of high density, and maxima. These properties have been widely discussed in the statistical literature. All these properties are actually properties of the probability distribution function. In data mining we will consider properties of the distribution of observations or of underlying theoretical distributions. Both cases are modelled

2.2. DATA SETS, DISTRIBUTIONS AND MEASURES

15

by measures on the set X. In the case of finite X, the measure is a discrete probability distribution which is defined by a function p : X → [0, 1] for which X p(x).

X

x∈

In this case the data mining algorithms aim to find properties of the distribution function p. Depending on the underlying set X, the types of properties of interest and the distribution one will use different algorithms. A special case of a discrete probability is the uniform probability where p(x) = 1/|X| is the same for all values x ∈ X. This is an uninteresting case and does not have any patterns. The other extreme is the case of an atomic measure with only one value, say y, in this case we have p(x) = δy (x) where ( 1 if x = y δy (x) = 0 else. This again has no special structure an is typically uninteresting. As is often the case, one here knows exactly what is uninteresting, however, defining the interesting distributions is much more challenging.x The distribution of the values of a data set can be encoded themselves as a discrete probability distribution. This is the empirical distribution defined by n 1X p(x) = δ (i) (x). n i=1 x All the information of interest in data mining is encoded in this distribution. It is easy to see that this function allows us to decide if any element is in the data set or not and, in addition, which proportion of the data set has a given value. Before we analyse this further we will review some more general basic probabilty theory. Probability theory introduces a measure on a collection of subsets of X. The elements of X, our observations, are called samples in probability theory and X is the sample space. An event in probabilistic terminology is a subset of X which corresponds to a selection of values taken. For example, an event could contain all observations of wages in a narrow range around the average national wage. In order to be of any use, the set of possible events, which we call S, needs to have some basic structure. In practice, one assumes that

16CHAPTER 2. MODELLING: OBJECTS, FEATURES AND DISTRIBUTIONS • For every event A ∈ S the complement is also an event, i.e., AC ∈ S • For every two events A, B ∈ S their intersection is also an event, i.e., A∩B ∈S • For every (countable) S collection of events A1 , A2 , . . . ∈ S their union is also an event, i.e., i Ai ∈ S. In the case of finite sets one does not need to worry about infinite unions but this becomes important, e.g., in the case where the samples are real numbers or vectors. If the set of events S satisfies the above properties it is called a σ-algebra. The choice of S is just as important as the choice of the set of objects. It will define the type of statements that we can make about the objects. The maximal sigma-algebra of a set X is the power set 2X and the minimal sigma algebra (in terms of set inclusions) is S = {∅, X}. Note that any sigma algebra contains the set X and the empty set ∅. An example of a (sigma) algebra for X = {0, 1, 2, 3, 4} is S = {∅, {2}, {0, 1}, {3, 4}, {0, 1, 2}, {2, 3, 4}, {0, 1, 3, 4}, {0, 1, 2, 3, 4}}. Note the limitation which comes with this choice: The set {3} does not appear in S, so for example, if these sets are to represent sets of items or market baskets a basket containing only item 3 is not possible. In most discrete cases, one considers S to be the powerset, however, constraints on the data may make it more efficient to consider a subset. A data set is a collection of objects (where some objects may appear multiple times). This is described by a probability distribution which is a mapping P : S → [0, 1]. The mapping is σ additive, and • P (AC ) = 1 − P (A) for any set A ∈ S • P (A ∪ B) = P (A) + P (B) for A, B ∈ S with A ∩ B = ∅ or more generally S P • P ( i Ai ) = i P (Ai ) for Ai ∈ S and Ai ∩ Aj = ∅ • P (X) = 1, P (∅) = 0 This is an axiomatic (Kolmogorov axioms) or measure theoretic characterisation of the distribution.

2.3. SAMPLING

17

Any discrete probabilty distribution p(x) defined on the range of a random variable X defines a measure by X P (A) = p(x), A ∈ S. x∈A

It is easy to show that this P has all the properties of a probability measure and the sigma algebra in this case is just the power set 2X . The counting measure for a certain data set is defined as #(A) =

n X

δ(x(i) ∈ A)

i=1

where δ(x(i) ∈ A) is one if x(i) ∈ A and zero else. Thus #(A) is just the number of observed values x(i) of the data base in the set A. Now if nx = #({x}) is the number of data points with value x then one gets p(x) =

nx . n

and the measure P affiliated with this is the empirical measure: P (A) =

X x∈A

2.3

p(x) =

1X nx . n x∈A

Sampling

In many cases one would like to be able to find properties of a distribution P from a sample of random variables which are independent and identically distributed with P . The motivation for this may be computational, as one may not be able to process the full data set. Taking a much smaller subset of data items will make many algorithms computationally feasible at the cost of some approximation errors. In particular, unlikely events will tend to be excluded so that the capability to find unusual points may be restricted. So for an unknown measure P and known features defined by X we assume that we have a given sample x(i) , i = 1, . . . , n of observations of the values of X(ω (i) ). These observations define an empirical distribution Pn with n ¢ 1 X ¡ (i) Pn (A) = δ x ∈ X(A) . n i=1

18CHAPTER 2. MODELLING: OBJECTS, FEATURES AND DISTRIBUTIONS Very often the underlying distribution P may not be known at all and only a sample of data points is given which should provide insights into P . This, in fact, is a common scenario in statistics. Making predictions about P based on Pn is in this case referred to as prediction, which means prediction of properties of future unseen objects drawn from the distribution P . In business, for example, one would like to know if any new customer is likely going to be profitable or not. The predictive capability of the methods is also referred to as generalisation in the machine learning literature. A common challenge is to avoid any idiosyncrasies of a particular Pn and extract only patterns which are common to all Pn and thus P . In statistical terminology, one would like to avoid overfitting and reduce the variance of the estimator. There are some results about the feasibility of finding patterns from samples, in particular in the context of specific features or random variables. Such results are asymptotic in n and include, e.g., the law of large numbers. In general, one may expect that the closer Pn is to P the more one can expect to find patterns from P reflected in Pn . Thus one could investigate how close Pn is to P in order to get some general insights into the performance of the algorithms. For this one would be interested in bounds on the difference of the measures on collections of A ∈ Σ, i.e., in bounds of the form |P (A) − P (An )| ≤ en . Alternatively, one could attempt to modify Pn to make it closer to P . This is particularly promising in the case of P being a continuous function as Pn is not continuous a smoothed version of Pn may give better approximation results. In case of linear estimators on obtains kernel density estimators.

2.4

Errors

Another aspect or challenge is posed by the fact that often the x(i) are not observations of values of X(ω) but are perturbed by errors or noise. The effect of this noise the determination of the patterns of P has to be carefully examined. We refer the reader here to the statistical literature on this.

2.5. RESTRICTED OBSERVATIONS

2.5

19

Restricted Observations

The distribution of the data will typically be influenced by other variables. For example, on may have a distribution p(x, z) for some z ∈ Z but in the case of the observations the probability distribution takes the form p1 (x) = P ({ω|X(ω) = x, Z(ω) = z, g(x, z) = 0}). When one applies the findings based on p(x) one may be in a different situation, e.g., one may have a distribution p2 (x) = P ({ω|X(ω) = x, Z(ω) = z, h(x, z) = 0}). for some h 6= g. Thus the distribution p2 may not have the properties which were discovered from p1 . Carefull analysis of the particular situation has to take this into account.

2.6

Properties and Patterns

Data mining aims to find a properties and patterns in large data sets. Computationally this boils down to discovering subspaces, coherent subsets and manifolds of the set X which are supported by the measure P .

20CHAPTER 2. MODELLING: OBJECTS, FEATURES AND DISTRIBUTIONS

Chapter 3 Association rules 3.1

Introduction

Large amounts of data have been collected routinely in the course of dayto-day management in businesss, administration, banking, the delivery of social and health services, environmental protection, policing and in politics. This data is primarily used for accounting and management of the customer base. However, it is also one of the major assets to the owner as it contains a wealth of knowledge about the customers which can assist in the development of marketing strategies, political campaigns, policies and product quality controll. Data mining techniques help process this data which is often huge, constantly growing and complex. The discovered patterns point to underlying mechanisms which help understand the customers and can give leads to better customer satisfaction and relations. As an illustrative (small) example consider the US Congress voting records from 1984 [44], see figure 3.1. The 16 columns of the displayed bit matrix correspond to the 16 votes and the 435 rows to the members of congress. We have simplified the data slightly so that a matrix element is one (pixel set) in the case of votes which contains “voted for”, “paired for” and “announced for” and the matrix element is zero in all other cases. Data mining aims to discover patterns in this bit matrix. In particular, we will find columns or items which display similar voting patterns and we aim to discover rules relating to the items which hold for a large proportion of members of congress. We will see how many of these rules can be explained by underlying mechanisms (in this case party membership). 21

22

CHAPTER 3. ASSOCIATION RULES

voting data

random data

Figure 3.1: 1984 US House of Representatives Votes, with 16 items voted on

3.1. INTRODUCTION

23

The choice of the columns and the rows is in this example somewhat arbitrary. One might actually be interested on patterns which relate to the members of congress. For example one might be interested in statements like “if member x and member y vote yes then member z votes yes as well. Statements like this may reveal some of the connections between the members of congress. For this the data matrix is thus “transposed”. The duality of observations and objects occurs in other areas of data mining as well and illustrates that data size and data complexity are really two dual concepts which can be interchanged. This is in particular exploited in some newer association rule discovery algorithms which are related to formal concept analysis [35]. An example of this duality from bioinformatics is in figure 3.2 which is based on microarray data from [4]. The rows of this bit matrix correspond to tissues tested and the rows correspond to the first 400 of 2000 genes. The bit is set when the expression level is over the 75 percent quantile for that gene which we use here as an example for the gene being “switched on”. One can now study rules like “if gene x and y are switched on then so is gene z” but also in statements like “if a gene is switched on for samples x and y then it is also switched on for sample z” which will tell us something about the distribution of the genes in the sample space rather than the distribution of the samples in the gene space. Note that in this case the number of genes (2000) is much larger than the number of samples (60) the dual association rule discovery problem is in some sense more natural than the original problem. While data mining had been studied since before 1988, it was the introduction of association rule mining in 1993 by Agrawal, Imielinski and Swami [2] and the publication in 1995 of an efficient algorithm by Agrawal and Srikant [3] and, independently, by Mannila, Toivonen and Verkamo [55] which initiated a wealth of research and development activity. This research has been dealing with efficiency, applications, the interface with data access, and the relation with other concepts like prediction and has strengthened the young discipline and helped establish it as an important and exciting research area in computer science and data processing. An association rule is an implication or if-then-rule which is supported by data. The motivation given in [2] for the development of association rules is market basket analysis which deals with the contents of point-of-sale transactions of large retailers. A typical association rule resulting from such a study could be “90 percent of all customers who buy bread and butter also buy milk”. While such insights into customer behaviour may also be

24

CHAPTER 3. ASSOCIATION RULES

Figure 3.2: Expressed genes for tumor data

3.1. INTRODUCTION

25

obtained through customer surveys, the analysis of the transactional data has the advantage of being much cheaper and covering all current customers. The disadvantage compared to customer surveys is in the limitation of the given transactional data set. For example, point-of-sale data typically does not contain any information about personal interests, age and occupation of customers. Understanding the customer is core to business and ultimately may lead to higher profits through better customer relations, customer retention, better product placements, product development but also fraud detection. While originating from retail, association rule discovery has also been applied to other business data sets including • credit card transactions • telecommunication service purchases • banking services • insurance claims and • medical patient histories. However, the usefulness of association rule mining is not limited to business applications. It has also been applied in genomics and text (web page) analysis. In these and many other areas, association rule mining has lead to new insights and new business opportunities. Of course the concept of a market basket needs to be generalised for these applications. For example, a market basket is replaced by the collection of medical services received by a patient during an episode of care, the subsequence of a sequence of amino acids of a protein or the set of words or concepts used in a web page. Thus when applying association rule mining to new areas one faces two core questions: • what are the “items” and • what are the “market baskets”. The answer of these questions is facilitated if one has an abstract mathematical notion of items and market baskets. This will be developed in the next section.

26

CHAPTER 3. ASSOCIATION RULES market basket id 1 2 3 4 5

market basket content orange juice, soda water milk, orange juice, bread orange juice, butter orange juice, bread, soda water bread

Table 3.1: Five grocery market baskets The efficiency of the algorithms will depend on the particular characteristics of the data sets. An important feature of the retailer data sets is that they contain a very large number of items (tens of thousands) but every market basket typically contains only a small subset. A simple example in Table 3.1 illustrates the type of results association rule mining produces. From the table one can induce the following “rules”: • 80 % of all transactions contain orange juice (1,2,3,4) • 40 % of all transactions contain soda water (1,4) • 50 % of the transactions containing orange juice also contain soda water (1,4) while • all transactions which contain soda water also contain orange juice These rules are very simple which is typical for association rule mining. Only simple rules will ultimately be understandable and useful. However, the determination of these rules is a major challenge due to the very large data sets and the large number of potential rules. In the example of the congressional voters one finds that the items have received between 34 to 63.5 percent yes votes. Pairs of items have received between 4 and 49 percent yes votes. The pairs with the most yes votes (over 45 percent) are in the columns 2/6, 4/6, 13/15, 13/16 and 15/16. Some rules obtained for these pairs are: 92 percent of the yes votes in column 2 are also yes votes in column 6, 86 percent of the yes votes in column 4 are also yes votes in column 6 and, on the other side, 88 percent of the votes in column 13 are also in column 15 and 89 percent of the yes votes in column 16 are also yes votes in column 15. These figures suggest combinations of items which could be further investigated in terms of causal relationships between

3.2. MODELLING FOR ASSOCIATION RULES item juice bread milk cheese potatos

27

item number 0 1 2 3 4

Table 3.2: Items of a micromarket the items. Only a careful statistical analysis may provide some certainty on this. This, however, is beyond the scope of these lectures.

3.2

Modelling for association rules

Association rule discovery has originated in market basket analysis. Here the object is a market basket of items purchased by a customer. While many features may be of interest in market basket analysis, the main features studied are the types of items in the market basket. The market basket example is just one incidence where association rule discovery is used. In general, it is used whenever the objects are sets of items, and, more generally, a collection of properties Xi (ω) of the objects, i.e., statements which are either true or false. In all these cases the features take the form of bitvectors.

3.2.1

Modelling the Data

Thus one has a set of possible items which we number and represent by their number. For example, a “micromarket” may sell the following items which are given together with a map into integers: This allocation is somewhat arbitrary and for particular applications one may wish to choose specific applications, for example one may order the items such that the itemnumbers indicate the rank in which the items were purchased, in the case of the micromarket juice would be the most frequently purchased, followed by bread etc. More generally, we map the items onto integers 0, . . . , d − 1. Any market basket is a set of items and we will model them as a set of integers x ∈ Zd = {0, . . . , d − 1}. Thus the mathematical problem of finding association rules

28

CHAPTER 3. ASSOCIATION RULES

from data consists of finding structure in a sequence of subsets of Zd . So the set X of features of objects is the powerset of Zd . Sets can be represented in various ways and all representations have their strengths. The representation of a collection of integers is very intuitive and compact. An alternative representation are bitvectors. With any subset of Zd one associates a vector of length d containing except at the positions indicated by the elements of the subset where the component of the vector is one. So, for example, in the case of the microshop items one may have a market basket {bread, milk, potatos} which is mapped onto the set {1, 2, 4} which corresponds to the bitvector (0, 1, 1, 0, 1). Thus in a bitvector not only can one easily extract what the elements are but also which elements are not contained in the set. Using this representation we define our set of features to be X = {0, 1}d but occasionally will refer to the vectors as itemsets. The underlying probability space is discrete. The size of an itemset will be important as it gives a complexity measure. Using the bitvector representation, the size of x is then |x| =

d−1 X

xi .

i=0

In some cases one would like to enumerate the sets. The bitvector can be interpreted as the vector of bits of an integer and one thus has the following mapping of bitvectors (or itemsets) onto integers: φ(x) =

d−1 X

xi 2i .

i=0

Note that this actually gives a bijective map between the set of finite sets of integers onto the set of integers. Table 3.3 shows a collection of micromarket itemsets, the corresponding set of integers, the representing bitvector and the number in which this subset occurs in the collection of all subsets. The data is thus represented by a sequence of bitvectors or a Boolean matrix. For example, corresponding to table 3.3 with the micromarket items one gets the matrix   1 1 1 0 0 0 0 0 0 1  . 0 1 0 1 0

3.2. MODELLING FOR ASSOCIATION RULES itemset {juice, bread, milk } { potatos } {bread, potatos }

set of numbers {0, 1, 2} {4} {1, 4}

29

bitvector number of subset (1, 1, 1, 0, 0) 7 (0, 0, 0, 0, 1) 16 (0, 1, 0, 0, 1) 18

Table 3.3: Collection of itemsets and their encodings In the example of congressional voting the first few records are 1 1 0 0

0 1 0 1

1 1 0 0

1 0 1 1

1 0 1 0

1 1 1 1

0 0 0 0

1 1 1 1

0 1 1 1

1 1 0 0

0 1 0 1

0 0 0 1

1 1 0 0

1 1 0 0

0 0 1 1

0 0 0 0

or the same as sets 1 1 4 2 1

3.2.2

3 2 5 4 3

4 3 6 6 7

5 6 8 8 9

6 8 10 13 14 8 9 10 11 13 14 9 15 9 11 12 15 10 11 13 16

Modelling the Patterns in the Database

The structure association rules model in these matrices stems from the similarity of the rows x(i) which is based on some underlying mechanism. We will now illustrate this for the case of the congressional data. If there is no underlying mechanism, i.e., the samples are generated independent random components then the data mining algorithm should not find any structure. (As any “discovered patterns” would purely reflect the sample and not the underlying distribution.) If we take to rows at random then they are modelled by two random vectors X (i) and X (j) . Even when they are at random, there will still be some components which these vectors share. The number of differing components is also a random variable. This is defined by the Hamming distance: d X dH (x, y) = |xi − yi |. i=1

If the probability that any component is one is p then the probabilty that this component is the same for both vectors is p2 := p2 + (1 − p)2 . The number of

30

CHAPTER 3. ASSOCIATION RULES

different components is the sum of i.i.d. binary random variables and thus it has a binomial distribution with expectation dp2 and variance dp2 (1 − p2 ). Note that the ratio of the standard deviation (or square root of the variance) to the expectation goes to zero when d goes to infinity. This is an example of the concentration effect or curse of dimensionality. In figure 3.3 we display the distribution of the the actual distances together with the display of the binomial distribution of distances in the random case (upper two histograms). Clearly, the histogram in figure 3.3 shows that we do not have the random situation and thus we may expect to find meaningful patterns in the data. These patterns will often relate to the party membership as is shown in the lower two histograms, in the case of the same party one gets a random distribution with lower expectation and in the case of different parties one with higher expectation. In the distribution of all the differences in voting behaviour one has both these effects and thus a much broader distribution as in the random case. This example is not so exciting as it is expected that differences accross parties are larger than differences within a party. Such a “discovery” would be judged to be less than exciting. We have chosen this example, however, as it illustrates the case of an underlying “hidden” variable which may be able to explain the distribution. In practical cases both known and unknown such hidden mechanims are common. In the case of the biological data the same distribution together with the binomial one is displayed in figure 3.4. The data set has much more noise and only a few data points which is a big problem, however, it appears that the distribution of the actual distances is still wider than the random case and thus on may be able to find some patterns. The “hidden” variable in this case is the occurrence of tumor and it is conjectured that different genes are active for tumor cells than for others, in fact, this is what some think could be used to find tumor. We can expect from this first display that finding meaningful and correct patterns will be much more difficult for this data set. In association rule discovery, one would like to make statements about the objects which are not necessarily universally true but which hold for a good proportion of the data. More specifically, one is interested to find mappings a : X → {0, 1} which are true on a subset of X with large measure, or equivalently, one would like to find subsets of X which have a large probability. The set Sa associated with a predicate a is its support, i.e., the subset of

3.2. MODELLING FOR ASSOCIATION RULES

31

0.15 0.00

0.05

0.10

Density

0.10 0.00

0.05

Density

0.15

0.20

random votes

0.20

number of different votes

0

5

10

15

0

5

0.20 0.15 0.00

0.05

0.10

Density

0.15 0.10 0.05 0.00

Density

15

different party

0.20

same party

10

0

5

10

15

0

5

10

Figure 3.3: Distribution of distances of two randomly chosen records in the congress voting data, compared to the binomial distribution and the distribution of the distances in the case of the same party and different parties.

15

32

CHAPTER 3. ASSOCIATION RULES

0.015 0.000

0.005

0.010

Density

0.010 0.000

0.005

Density

0.015

0.020

random expressions

0.020

number of different expressions

0

200

400

600

800

1000

0

200

800

1000

800

1000

0.020 0.015 0.000

0.005

0.010

Density

0.015 0.010 0.005 0.000

Density

600

different tissue class

0.020

same tissue class

400

0

200

400

600

800

1000

0

200

400

600

Figure 3.4: Distribution of distances of two randomly chosen records in the tumor data.

3.2. MODELLING FOR ASSOCIATION RULES

33

X where the predicate is one: Sa = supp(a) = {x|a(x) 6= 0}. Unfortunately, the probability P (Sa ) of the support is also called support we call it: s(a) = P (Sa ). We will be looking for statements, predicates a, or alternatively, for subsets Sa for which the support is high, say, more than a given lower bound σ which is dictated by the application. Thus we are looking for a with s(a) ≥ σ. There are trivial examples which satisfy this, including the case a = 1 where Sa = X or the case where Sa = {x(1) , . . . , x(N ) } being the data set. In both cases one would have a support s(a) = 1 or at least close to one. One thus needs to constrain the addimissible predicates a to ones which are interesting such that these cases are eliminated. Basically, one would like to have small and simple sets (a la Occam’s razor) with a large support. A problem related to the curse of dimensionality is that the total number of predicates is very large. As a predicate can be identified with its support the total set of predicates is equivalent to the powerset of X. As the size of d X is 2d the size of the powerset of X is 22 . Finding all the predicates which have at least support σ seems to be an unsurmountable task. Thus again, one needs to constrain the types of predicates. On the one side, many predicates will not be of interest, on the other side many of the predicates found will be very similar to other ones found, and, say, differ just for one X. One could thus introduce similarity classes of predicates. First, however, we can identify a generic property which the predicates we are interested in should have. For this we go back to the market basket example again. A property of a market basket relates to the contents, the price etc. If a market basket contains certain items then a larger market basket will contain the same items plus some more and thus, for the purposes of marketing the larger market basket would be just as interesting as the smaller one. Thus we would like our predicates to be monotone in x, i.e., if x ≤ y one has a(x) ≤ a(y). Of course this is not allways the case but this constraint does reduce the number of possible predicates a lot. So now we have the data mining problem: Find all monotone predicates a which are frequent. In order to do this we will have to understand more about the structure of the monotone predicates which we will do in a later section.

34

CHAPTER 3. ASSOCIATION RULES

Note that this monotonicity constraint does depend on the application and in other cases one might use different constraints. The simplest case of such monotone constrains are given by the function which extracts the i-th component of the vector x, i.e., ai (x) = xi . In the case of the US cogress votes the support for all the single votes is displayed in figure 3.5. A different, slightly more complex case is the product (or and) of two components: ai,j (x) = xi,j . For the voting example lets consider votes of the the type “V13 and Vx”. Here the distributions do look quite different. figure 3.6. In particular, while the distribution of the single votes did not show us much about the parties, the combination certainly does. We can see that there are votes which do have similar voting behaviours. These similarities seem to go beyond party boundaries, i.e., high supports of a combination of two votes often means high supports from both parties. So maybe there is an underlying mechanism which is more fundamental than the party membership. In the random case there is only the case where the first vote is combined with itself (which actually is only one vote) and the case where two votes are combined. Finally, the confidences of the rules “if voted yes for V13 then voted yes for Vx” is displayed in figure 3.7. So far we have only considered structure which can be described by one predicate. However, one is also interested in multiple predicates, in particular, pairs of predicates which interact. These are the association rules which have initiated the work in this area. They are described by a pair of predicates a, b of which the conjunction has a good support, and for which the conjunction has a good conditional support in the support Sa . This conditional support is also called confidence. Again, one is interested in monotone predicates a and b. The association rules intuitively model “if-then-rules”, are denoted by a ⇒ b and a is called antecedent and b is the consequent. Note that these “if-then-rules have nothing to do with implications in classical propositional logic where an implication is equivalent to ¬a ∨ b. So an association rule is a pair of predicates with two numbers, the support s(a ⇒ b) = s(a ∧ b) and the confidence c(a ⇒ b) = s(ab)/s(a). Thus one has

3.2. MODELLING FOR ASSOCIATION RULES

35

0.6 0.5 0.4 0.0

0.1

0.2

0.3

proportion yes votes

0.4 0.3 0.2 0.1 0.0

proportion yes votes

0.5

0.6

0.7

random data

0.7

voting data

V13

V5

V9

V12

V16

vote number

V2

V14

V8

V13

V5

V9

V12

V16

V2

vote number

Figure 3.5: Supports for all the votes in the US congress data (split by party).

V14

V8

36

CHAPTER 3. ASSOCIATION RULES

0.6 0.5 0.4 0.0

0.1

0.2

0.3

proportion yes votes

0.4 0.3 0.2 0.1 0.0

proportion yes votes

0.5

0.6

0.7

random data

0.7

voting data

V13

V5

V9

V12

V16

V2

V14

vote number of second vote in pair

V8

V13

V5

V9

V12

V16

V2

V14

vote number of second vote in pair

Figure 3.6: Supports for pairs of votes in the US congress data (split by party).

V8

3.2. MODELLING FOR ASSOCIATION RULES

37

0.8 0.6 0.0

0.2

0.4

proportion yes votes

0.6 0.4 0.2 0.0

proportion yes votes

0.8

1.0

random data

1.0

voting data

V13

V5

V9

V12

V16

V2

V14

vote number of consequent

V8

V13

V5

V9

V12

V16

V2

V14

vote number of consequent

Figure 3.7: Confidence of rule “if V13 then Vx” for the US congress data (split by party).

V8

38

CHAPTER 3. ASSOCIATION RULES

Definition 1 (Association Rule). An association rule, denoted by a ⇒ b consists of a pair of monotone predicates a and b and a measure of support s and a measure of confidence c such that s(a ⇒ b) = s(a ∧ b) and c(a ⇒ b) = s(a ∧ b)/s(a). In the following we will discuss algorithms for two data mining problems for any given probability space P(X, 2X , P ): • Frequent set mining problem. For given σ > 0 find all (all) monotone predicates a such that s(a) ≥ σ. • Association rule mining problem. For given σ > 0 and γ > 0 find all association rules a ⇒ b for which s(a ⇒ b) ≥ σ and c(a ⇒ b) ≥ γ. In addition to these questions in particular other types of constraints for the association rules have been considered. Thus we now have a complete model of the data and the research questions we can now move on to discussing the structure of the data and pattern space which will form the basis for the development of efficient algorithms.

3.3

Structure of X

In order to systematically search throught the set X to find the y mentioned we will need to exploit the structure of X. As X consists of sets, one has a natural partial ordering which is induced by the subset relation. In terms of the components one can define this componentwise as x ≤ y :⇔ xi ≤ yi . Thus if, for any i, x contains i then xi = 1 and, if x ≤ y then yi = 1 and so y contains i as well which shows that x ≤ y means that x is a subset of y. We use the symbol ≤ here instead of ⊂ as we will use the ⊂ symbol for sets of itemsets. Of course, subsets have at most the same number of elements as their supersets, i.e., if x ≤ y then |x| ≤ |y|. Thus the size of a set is a monotone function of the set wrt the partial order. One can also easily see that the function φ is monotone as well. Now as φ is a bijection it induces a total order on X defined as x ≺ y iff φ(x) < φ(y). This is the colex order and the colex order extends the partial order.

3.3. STRUCTURE OF X

39

The partial order (and the colex) order have a smallest element which consists of the empty set, as bitvector x = (0, . . . , 0) and a largest element wich is just the set of all items Zd , as bitvector (1, . . . , 1). Furthermore, for each pair x, y ∈ X there is a greatest lower bound and a least upper bound. These are just x∨y =z where zi = max{xi , yi } for i = 0, . . . , d−1 and similar for x∧y. Basically, the partial order forms a Boolean lattice. We denote the maximal and minimal elements by 1 and 0. More generally, a partially ordered set is defined as Definition 2 (Partially ordered set). A partially ordered set (X, ≤) consists of a set X with a binary relation ≤ such that for all x, x0 , x00 ∈ X: • x≤x

(reflexivity)

• If x ≤ x0 and x0 ≤ x then x = x0 • If x ≤ x0 and x0 ≤ x00 then x ≤ x00

(antisymmetry) (transitivity)

A lattice is a partially ordered set with glb and lub: Definition 3 (Lattice). A lattice (X, ≤) is a partially ordered set such that for each pair of of elements of X there is greatest lower bound and a least upper bound. We will call a lattice distributive if the distributive law holds: x ∧ (x0 ∨ x00 ) = (x ∧ x0 ) ∨ (x ∧ x00 ). where x∨y is the maximum of the two elements which contains ones whenever at least one of the two elements and x ∧ x0 contains ones where both elements contain a one. Then we can define: Definition 4 (Boolean Lattice). A lattice (X, ≤) is a Boolean lattice if 1. (X, ≤) is distributive 2. It has a maximal element 1 and a minimal element 0 such that for all x ∈ X: 0 ≤ x ≤ 1.

40

CHAPTER 3. ASSOCIATION RULES 3. Each element x as a (unique) complement x0 such that x ∧ x0 = 0 and x ∨ x0 = 1.

The maximal and minimal elements 0 and 1 satisfy 0∨x = x and 1∧x = x. In algebra one considers the properties of the conjunctives ∨ and ∧ and a set which has conjunctives which have the properties of a Boolean lattice is called a Boolean algebra. We will now consider some of the properties of Boolean algebras. The smallest nontrivial elements of X are the atoms: Definition 5 (Atoms). The set of atoms A of a lattice is defined by A = {x ∈ X|x 6= 0 and if x0 ≤ x then x0 = x}. The atoms generate the lattice, in particular, one has: Lemma 1. Let X be a finite Boolean lattice. Then, for each x ∈ X one has _ x = {z ∈ A(X) | z ≤ x}. Proof. Let Ax := {z ∈ A(B)|z ≤ x}. Thus x is an upper bound for Ax , i.e., W Ax ≤ x. W Now let y be any upper bound for Ax , i.e., Ax ≤ y. We need to show that x ≤ y. Consider x ∧ y 0 . If this is 0 then from distributivity one gets x = (x ∧ y) ∨ (x ∧ y 0 ) = x ∧ y ≤ y. Conversely, if it is not true that x ≤ y then x ∧ y 0 > 0. This happens if what we would like to show doesn’t hold. In this case there is an atom z ≤ x ∧ y 0 and it follows that z ∈ Ax . As y is an upper bound we have y ≥ z and so 0 = y ∧ y 0 ≥ x ∧ y 0 which is impossible as we assumed x ∧ y 0 > 0. Thus it follows that x ≤ y. The set atoms associated with any element is unique, and the Boolean lattice itself is isomorph to the set of sets of atoms. This is the key structural theorem of Boolean lattices and is the reason why we can talk about sets (itemsets) in general for association rule discovery. Theorem 1 (Representation Theorem for Finite Boolean Algebras). A finite Boolean algebra X is isomorphic to the power set 2A(X) of the set of atoms. The isomorphism is given by η : x ∈ X 7→ {z ∈ A(X) | z ≤ x}

3.3. STRUCTURE OF X

41 {milk, bread, coffee, juice}

{milk, bread, coffee} {milk, bread}

{milk, bread, juice}

{milk, coffee, juice}

{milk, coffee} {bread, coffee} {milk}

{bread, coffee, juice}

{milk, juice}

{bread. juice}

{coffee}

{juice}

{bread}

{coffee, juice}

{}

Figure 3.8: Lattice of breakfast itemsets 1111 0111 0011

1011 0101 0001

1110

1101

0110

1001

0010

0100

1010

1100

1000

0000

Figure 3.9: Lattice of bitvectors. and the inverse is η −1 : S 7→

_

S.

In our case the atoms are the d basis vectors e1 , . . . , ed and anyPelement of X can be represented as a set of basis vectors, in particular x = di=1 ξi ei where ξi ∈ {0, 1}. For the proof of the above theorem and further information on lattices and partially ordered sets see [24]. The significance of the lemma lays in the fact that if X is an arbitrary Boolean lattice it is equivalent to the powerset of atoms (which can be represented by bitvectors) and so one can find association rules on any Boolean lattice which conceptially generalises the association rule algorithms. In figure 3.8 we show the lattice of patterns for a simple market basket case which is just a power set. The corresponding lattice for the bitvectors is in figure 3.9. We represent the lattice using an undirected graph where the nodes are the elements of X and edges are introduced between any element

42

CHAPTER 3. ASSOCIATION RULES

Figure 3.10: The first Boolean Lattices and its covering elements. A covering element of x ∈ X is an x0 ≥ x such that no element is “in between” x and x0 , i.e., any element x00 with x00 ≥ x and x0 ≥ x00 is either equal to x or to x0 . In Figure 3.10 we display the graphs of the first few Boolean lattices. We will graph specific lattices with (Hasse) diagrams [24] and later use the positive plane R2+ to illustrate general aspects of the lattices. In the following we may sometimes also refer to the elements x of X as item sets, market baskets or even patterns depending on the context. As the data set is a finite collection of elements of a lattice the closure of this collection with respect to ∧ and ∨ (the inf and sup operators) in the lattice forms again a boolean lattice. The powerset of the set of elements of this lattice is then the sigma algebra which is fundamental to the measure which is defined by the data. The partial order in the set X allows us to introduce the cumulative distribution function as the probability that we observe a bitvector less than a given x ∈ X: F (x) = P ({x0 |x0 ≤ x}) . By definition, the cumulative distribution function is monotone, i.e. x ≤ x0 ⇒ F (x) ≤ F (x0 ). A second cumulative distribution function is obained from the dual order as: F ∂ (x) = P ({x0 |x0 ≤ x}) . It turns out that this dual cumulative distribution function is the one which is more useful in the discussion of association rules and frequent itemsets. Typically, the set X is very large, for example, with 10, 000 items one can in principle have 210,000 ≈ 103,000 possible market baskets. It can be said with

3.4. STRUCTURE OF THE PREDICATES ON X

43

x9

x2 x4

x5 x8

x3

x7

x1 x6 Figure 3.11: Data points certainty that not all possible market baskets will occur, and, in particular, one would only be interested in market baskets with a limited number of items, say, with less than 100 items. In summary, the data is a set of points, see 3.11. For any given number of items d and number of records or data points n the number of possible data 9 sets is very large, for example, there are 103·10 possible data sets with n = 106 market baskets containing a items from a collection of 10, 000 items. It is thus important to be very systematic when one would like to compare different data sets and determine their essential properties. Now this complexity is a bit misleading, as we have not taken into account that mostly simple market baskets will occur. Note, however, that 106 market baskets and 10, 000 items corresponds to a relatively “small scenario”.

3.4

Structure of the Predicates on X

We would like to make statements about the x which would hold in with some probability. These probable statements are one type of structure we would like to discover for a given distribution. The statements are formulated as predicates a : X → {0, 1}. Such a predicate can be interpreted as a characteristic function of a subset A ⊂ X, for which the predicate is true (or 1). We are looking for predicates

44

CHAPTER 3. ASSOCIATION RULES

which are true for a large proportion of the points, equivalently, we are looking for subsets which have a large probability. In extremely simple cases one can consider all possible predicates (or subsets) and determine the support s(a) = P ({x|a(x) = 1} and select the ones which have large enough support to be interesting. There are a total d of 2|X| = 22 predicates, which, in the case of market basket analysis with 3,000 d = 10, 000 items (which corresponds to a smaller supermarket) is 210 ≈ 3·102,999 10 . Thus it is impossible to consider all the subsets. One suggestion is to search through the predicates in a systematic way. The predicates themselves have an order which is given by the order on the range of a and is the subset relationship of the sets for which the predicates are the characteristic functions. We have a ≤ b ⇔ a(x) ≤ b(x) for all x so a is less than b if a is a consequence of b, i.e, if b is true whenever a is true. With respect to ∨ and ∧ (here in their original logical meaning) the predicates form a Boolean lattice. So we could systematically start a the bottom or the top of the lattice and work our way through the lattice following the edges of the Hasse diagram. Any predicate is defined by the set of elements on which it is true, i.e., it can be written as an indicator function of that set as a(x) = χ{z|a(z)=1} (x) and if the indicator function for the set containing just x is χx we have _ χy . a= a(y)=1

It follows directly that the atoms of the Boolean algebra of the predicates are the indicators χy . Unfortunately, at the bottom end of the lattice the support for any predicates will be very low. Thus we need to start at the top and use the dual ordering. With respect to this dual ordering the atoms are now the indicator functions of the sets containing everything except y. Thus we get the opposite situation as before and for a very long time all the predicates we get have a very large support. Thus it is not feasible to search just along the edges of the graph of the lattice of the predicates. The observations indicate that we need to look for a different ordering of the predicates. Here is where Occam’s razor does give us some intuition.

3.4. STRUCTURE OF THE PREDICATES ON X

45

We are interested in simple predicates or simple sets which we may be able to interprete. This does not necessarily mean sets with few elements but sets which have a simple structure. As the full powerset does not lead to anything usefull we should consider a different set of subsets. In order to do this we go back to the example of the market baskets. A property which will make a market basket interesting will typically also make a market basket which contains the original market basket interesting as well. This leads to properties which are monotone, i.e. for which we have x ≤ y ⇒ a(x) ≤ a(y). The set A for which a is an indicator function is an up-set, also called increasing set or order filter and it satisfies x ∈ A and x ≤ y ⇒ y ∈ A. (The dual of up-sets are down-sets, decreasing sets or order ideals.) The set of up-sets F(X) is a subset of the powerset but it is still quite large. The sizes of some of the smallest sets of up-sets are given in [24] for the cases of d = 1, . . . , 7. For example, in the case where d = 6 the set X has 64 elements and the size of the set of up-sets is 7, 828, 354. The simplest up-sets are the ones generated by one element: ↑ x := {z|z ≥ x}. In order to better understand the lattice of up-sets F(X) one considers join-irreducible sets in this lattice. They are defined as J (F(X)) = {A 6= 0| if A = B ∪ C for some B, C ∈ F(X) then A = B or A = C}. While join-irreducibility is defined for more general lattices, in our case we can characterise the join-irreducible up-sets exactly: Proposition 1. J (F(X)) = {↑ x | x ∈ X} Proof. As X is finite there are is forSeach U ∈ F(X) a sequence xi ∈ X, i = 1, . . . , k such that xi 6≤ xj and U = ki=1 ↑ xi . By definition, at most the U for which k = 1 can be join-irreducible. Let ↑ x = U ∪ V and, say, x ∈ U . It then follows that ↑ x ⊂ U . As U ⊂↑ x it follows ↑ x = U . Thus all ↑ x ∈ J (F(X)).

46

CHAPTER 3. ASSOCIATION RULES

Note that in our case the sets ↑ x {x} are the lower covers of ↑ x, i.e., the closest up-sets which are subsets. In fact, this is a different characterisation of the join-irreducible elements of finite lattices, see [24]. The join-irreducible elements are essential in Birkhoff’s representation theorem for finite distributive lattices [24] which states that the finite distributive lattices are isomorph to the lattice of down-sets generated by joinirreducible elements. This representation is also closely related to the disjunctive normal form in logic. We will now consider the join irreducible elements ↑ y further. Lets denote their characteristic function by ay := χ↑y such that ( 1 if x ≤ y ay (x) = 0 else. Note that ay∨z = ay ∧ az but ay ∨ az is in general not of the form az . We can systematically search through the space of all ay using the ordering induced by the ordering of the index y. We start with the ax where the x are the atoms of X. Note that the induced ordering is reversed, i.e., one has y ≤ z ⇔ ay ≥ az , and more generally one has Proposition 2. ay (x) is a monotone function of y and an anti-monotone function of x and furthermore: ^ ay (x) = (az (x) ∨ ¬az (y)) .

X

z∈A( )

Proof. The monotonicity/antimonotonicity properties follow directly from the definition. While the ↑ y are join-irreducible, they are not meet-irreducible. In particular, from the representation of Boolean algebras and the isomorphy of the lattice of join-irreducible elements of F(X) and X one gets \ ↑y= ↑ z.

X

z≤y, z∈A( )

If z 6≤ y it follows that ¬az (y) is true.

3.5. SUPPORT AND ANTIMONOTONICITY

47

Now we could use the logical theory of resolution to search for objects with particular properties or, alternatively, to find all possible predicates which either hold for all market baskets or for at least one market basket. These deductions are quite possible for smaller data sets but for larger data sets one can only hope to find a very limited subset of rules or predicates which occur very frequently. Finally, we are interested in predicates a which have a support s(a) ≥ σ, i.e., are frequent. Or more specifically, we are interested in frequent itemsets z such that s(az ) ≥ σ. By the apriori principle we have that if z is frequent and if y ≤ z then y is frequent as well. Thus the set of frequent itemsets forms a down-set, decreasing set or order ideal. It follows that we can represent the set of frequent itemsets F as [ ↓ z (i) F = i

for a set of maximal elements z (i) . The aim of the data mining for frequent itemsets is to find these maximal elements of F . The apriori algorithm searches for these maximal element by looking at the lattice of z and, by starting at 0 moves to ever larger z until the maxima are found. In summary, we now have an algebraic framework for the search for interesting predicates ay .

3.5

Support and antimonotonicity

After having explored the space of predicates we intend to search we now discuss effects of the criterium used to select predicates further. Intuitively, we are looking for all predicates which are supported by the data, i.e., for which the support s(a) is larger than a certain given value σ. Recall the definition of support s(a) = P (supp(a)). In case of the empirical distribution the relative frequency is the support, in the case where we have a sample (possibly from a larger population) the relative frequency is an estimator of the support and one has: n

1X sˆ(a) = a(x(i) ) n i=1

48

CHAPTER 3. ASSOCIATION RULES

x9

x2 x4

x5

supp(a) x8

x3

x7

x1 x6

X

Figure 3.12: Support of a general predicate In the literature the term support is used some times also for the estimate sˆ(a) and some times for the number of transactions for which a holds. In [1] support is reserved for the set of items in the data base which support the item, support count is used for the size of this set and support ratio for the estimate of s(a). In Figure 3.12 the support is illustrated together with some data points. From the properties of the probability one gets properties of the support function s, in particular, one has s(a ∧ b) ≤ s(a). We will now consider the support of the predicates ay which considered in the previous section. Their support s(ay ) is illustrated in figure 3.13, it is not the same as the support of the item set y, i.e., s(ay ) 6= P ({y}). Note that the support of the item set s(ax ) and the probability of the corresponding market basket x is not the same! The support of ax is illustrated in However, the support is the the cumulative distribution function based on the dual lattice of X. Proposition 3. Let s be the support and ax be the predicate which indicates the presence of the item set x. Then one has s(ax ) = F ∂ (x) where F ∂ is the cumulative distribution function which is based on the dual partial order.

3.5. SUPPORT AND ANTIMONOTONICITY

49

supp(ax )

x

Figure 3.13: Support of ax . Proof. s(ay ) =

X

p(y).

y≥x

While we have used probabilistic language, often, we are only interested in patterns which are frequent in the current data set and do not rely on any probabilistic interpretation of the data set or the generalisation of the rules. A consequence of this previous proposition is the main property of the support function. Recall that a function f (x) is monotone iff f (x) ≤ f (y) whenever x ≤ y and a function is anti-monotone iff f (x) ≥ f (y) whenever x ≤ y. We now have the core property used for the association rule discovery algorithms: Theorem 2 (Anti-monotonicity). The support s(ay ) is an anti-monotone function of y. Proof. The cumulative distribution function F δ is monotone wrt to the dual order and is thus anti-monotone wrt to the original order. The theorem then follows from the previous proposition.

50

CHAPTER 3. ASSOCIATION RULES

This simple anti-monotonicity property forms the basis of the apriori algorithm. Note that the cumulative distribution function uniquely defines the distribution and so either the probability distribution p or the support s(ax ) can be used to determine the relevant properties of the probability space. Association rule mining uses the apriori property to find all the values of the cumulative distribution function which have at least a given size. Note that from a practical point of view, the predicates with small support are uninteresting. A final comment on the relationship between the item sets and the relational data model. In the relational data model all the data is defined with tables or relations. In our case, this is based on two sets, one, the set of market baskets (stored as basket identifiers) which we call M and the set of items which we call I. The data base is then a relation R ⊂ M × I. Thus we have one market basket table with one attribute “contains item” and the domain of this attribute are the items. The relation forms a bipartite graph. The contents of the market baskets x are the adjacency sets corresponding to the elements of M . The supports of item sets are the adjacent sets of sets of items in this model.

3.6

Expectation and the size of itemsets

For the analysis of the complexity of the association rule discovery algorithms we will require more general functions than predicates. In particular, we will consider real functions f : X → R. Such a function is a random variable and it also denotes a property of the objects we are investigating. In our case of finite X the function has an expectation which is defined as X E(f ) = p(x)f (x).

X

x∈

It follows that the expectation is a linear functional defined on the space of real functions over X. The expectation is monotone in the sense that f ≥ g ⇒ E(f ) ≥ E(g) and the expectation of a constant function is the function value. The variance of the random variable corresponding to the function f is ¢ ¡ var(f ) = E (f − E(f ))2 . A first class of functions f are the predicates a with values of 0 and 1 only. For them we have:

3.6. EXPECTATION AND THE SIZE OF ITEMSETS

51

Proposition 4. A predicate a is a random variable with expectation E(a) = s(a) and variance var(a) = s(a)(1 − s(a)). Note that 1 − s(a) = s(1 − a) is the support of the negation of a. One has var(a) ≤ 0.25 and the maximum is reached for s(a) = 0.5. In practice one would not expect any item to be in half of all market baskets and typically, s(a) is often rather small so that the variance is close to the expectation. The previous proposition shows that the support can be defined as either a probability (of the supporting set) or an expectation (of the predicate). In fact, probability theory itself can be either based on the concept of probability or the concept of expectation [72]. A second class of examples is obtained when we consider a sample of independant observations as our objects. Thus the observed feature is the sequence x(1) , . . . , x(n) . We consider the relative frequency of a on the observations, i.e., n 1X sˆ = a(x(i) ) n i=1 as function defined on this object. The probability distribution of the observations is, and here we will assume that the samples are drawn at random independantly: m Y (1) (n) p(x , . . . , x ) = p(x(i) ). i=1

The expectation of the function f (x(1) , . . . , x(n) ) = ˆ(s) is E(ˆ s(a)) = s(a) and the variance is

s(a)(1 − s(a)) . n All these quantities are thus represented as a function of the support s(a). A third example of a random variable is the length of an itemset, var(ˆ s(a)) =

f (x) = |x| =

d X j=1

xj .

52

CHAPTER 3. ASSOCIATION RULES

The expectation of the length of an itemset is E(|x|) =

d X

pj

j=1

where pj is the probability that the bit j is set. Note that pj = s(aj ) where aj is the predicate defined by ( aj (x) =

1 if xj = 1 0 else

is the indicator function for the up-set ↑ ej of the j-th atom of X. For the variance of the length of an itemset we need to also consider ai aj with is the indicator function for sets of pairs of atoms or 2-itemsets and with pij = E(ai aj ) one gets var(|x|) =

d X

pi (1 − pi ) +

i=1

X

(pij − pi pj ).

i6=j

Note that the first sum is the sum of the variances of all the components. The second sum is a correction which originates from the fact that the components are not independant. So far we have considered positive functions f only. One can now give a simple upper bound for the probabibility that the function values get large using the expectation only. This is useful as it allows the estimation of a distribution from a simple feature, the mean. It is unlikely that the the random variable takes values which are very different from the expectation. A simple theorem gives an estimate of the probability for f to reach values larger than any given bound. It does not require any knowledge of the distribution. This is Markov’s inequality: Theorem 3 (Markov’s inequality). Let f be a non-negative function. Then, for x > 0: E(f ) p(f (x) ≥ c) ≤ c where p(f (x) ≥ c) := P ({x|f (x) ≥ c}).

3.6. EXPECTATION AND THE SIZE OF ITEMSETS

53

Proof. From the definition of the expectation one gets E(f ) =

X x∈

=

X

f (x)p(x)

X

f (x)p(x) +

f (x)≥c



X

X

f (x)p(x)

f (x) c is bounded from above by s(a)/c. This shows that we will not expect deviations which are too large but it does not tell us how this estimate improves with growing data size n. For example, we can see that the probability that the estimate is four times the size of s(a) is less or equal than 0.25. If we consider 1/ˆ(s) as our random variable we can get a similar lower bound, i.e., the probability that sˆ(a) is less than s(a)/4 is also 0.25. For the third example, the size of an itemset one gets similar bounds. We can again state that the probability that we get observations of the length of an itemset which is more than 4 times or less than 4 times the actual size are less or equal 0.25 each. It turns out, that we can further modify the bounds obtained from the Markov inequality to include the variance as well. This is due to the fact that the Markov inequality holds for any positiv function. The variance now gives us a measure for how much the function values will likely spread. This is stated in the next theorem: Theorem 4 (Chebyshev’s inequality). For any real function f and r > 0 on has: P ({x | E(f ) − r

p p 1 var(f ) < f (x) < E(f ) + r var(f )}) ≥ 1 − 2 . r

54

CHAPTER 3. ASSOCIATION RULES

p Consider again the examples 2 and 3 from above. Choose r = ρE(f )/ var(f ). Then we get P ({x | E(f )(1 − ρ) < f (x) < E(f )(1 + ρ)}) ≥ 1 −

var(f ) . ρ2 E(f )2

As a consequence, we can now estimate the probability that the function var(f ) value is between a bound of, say 25% around the exact value to be 1− 16E(f . )2 . This is small In the second example we get for this bound 1 − 16(1−s(a)) ns(a) for small supports and large for large supports. Note that ns(a) is just the number of positive cases. Thus for very small s(a) we will need at 32 positive cases in order to get a 50 percent chance that the estimated support is within a 25 percent bound. This is actually not bad and thus the estimator for the support is relatively good. In the third case the interpretation of the bound is a bit trickier but not much. We get the bound ³P ´ P d 16 j=1 pj (1 − pj ) + i6=j (pij − pi pj ) 1− ³P ´2 d j=1 pj which, in the case of small pj and for indpedendant variables xj is approximated by 16 ´. 1 − ³P d p j j=1 Consequently, in this case we can expect the length of an actual itemset to be at most 25 percent of its expectation with a probability of more than 0.5 if the expected size is larger than 32. Note that the dependance of these bounds as a function of the number of variables d and the number of observations n are manifestations of the concentration phenomena.

3.7

Examples of distributions

The simplest case is where all pj = q for some q > 0 and p(x) = q |x| (1 − q)d−|x|

3.7. EXAMPLES OF DISTRIBUTIONS

55

This is the case where every item is bought equally likely with probability q and independently. The length of the itemset has in this case a binomial distribution and the expected size of the itemset is E(|x|) = qd. The support of ax can be seen to be s(ax ) = q |x| . This models the observation that large data sets have low support (but not much else). The frequent itemsets correspond to itemsets which are limited in size by |x| ≤ log(σ0 )/ log(q). The distribution of the size is a binomial distribution µ ¶ d r p(|x| = r) = q (1 − q)d−r . r For very large numbers of items d and small q this is well approximated by the Poisson distribution with λ := dq: 1 −λ r e λ. r! The (dual) cumulative distribution function is p(|x| = r) =

∞ X 1 −λ r p(|x| ≥ s) = e λ. r! r=s

In Figure 3.14 this probability is plotted as a function of x for the case of λ = 4 together with the Markov bound. This example is not very exciting but may serve as a first test example for the algorithms. In addition, it is a model for “noise”. A realistic model for the data would include noise and a “signal”, where the signal models the fact that items are purchased in groups. Thus this example is an example with no structure and ideally we would expect a datamining algorithm not to suggest any structure either. A second example, with just slightly more structure has again independant random variables but with variable distributions such that m Y x p(x) = pj j (1 − pj )1−xj . j=1

CHAPTER 3. ASSOCIATION RULES

0.6 0.4 0.0

0.2

cumulative probability

0.8

1.0

56

0

5

10

15

20

x

Figure 3.14: Distribution of size of market baskets for the simplest case (“o”) and the Markov bound for λ = 4.

3.7. EXAMPLES OF DISTRIBUTIONS

57

The average market basket in this case is λ=

m X

pj

j=1

and the support is s(ax ) =

m Y

x

pj j .

j=1

In examples one can often find that the pj when sorted are approximated by Zipf ’s law, i.e, α pj = j for some constant λ α = Pn 1 . j=1 j

A third example is motivated by the voting data. In this case we assume that we have a “hidden variable” u which can take two values and which has a distribution of, say p(u = 0) = p(u = 1) Q = 0.5. In addition, we xj 1−xj have the conditional distributions p(x|u = 0) = m and j=1 pj (1 − pj ) Qm xj 1−xj p(x|u = 1) = j=1 qj (1 − qj ) . From this we get the distribution of the x to be Ãm ! m Y 1 Y xj xj 1−xj 1−xj p(x) = p (1 − pj ) + qj (1 − 1j ) . 2 j=1 j j=1 Try to determine for this case what the distribution of the size of the itemset, and the variance are. Choose specific distributions pj and qj . What about the expected values for the aj and the ajk . A forth example is again motivated by the voting data. If strickt party discipline holds for some items, everyone would be voting the same for these items. Thus we would have random variables x1 = · · · xk with probability p1 = · · · = pk and the other random variables could be random with probability pi , i > k. The probability distribution is then x

k+1 · · · pxd d . p(x) = δ(x2 = x1 ) · · · δ(xk = x1)px1 1 · pk+1

Again, we leave the further analysis of this example to the reader. Hint: the support for any az with not components larger than k set but at least one below is p1 , independant of how many components are set.

58

CHAPTER 3. ASSOCIATION RULES

The aim of association rule mining is to find itemsets which occur more frequently based on an underlying grouping of items into larger blocks of items which are frequently purchased together. In [3] the authors present a model for market basket data which is used to generate synthetic data. It is based on sets of items which are frequently purchased together. We will not discuss this further at this stage.

3.8

Rules, their support and their confidence

The aim of association rule discovery is the derivation of if-then-rules based on the predicates ax defined in the previous subsection. An example of such a rule is “if a market basket contains orange juice then it also contains bread”. One possible way to interpret such a rule is to identify it with the predicate “either the market basket contains bread or it doesn’t contain orange juice”. The analysis could then determine the support of this predicate. However, this is not a predicate in the class we have defined in the previous section. Of course it could be determined from the s(ax ). However, this measure also is not directly interpretable and thus not directly applicable in retail. Association rules do not make the above identification but treat the rule a⇒b as an (ordered) pair of predicates (a, b) and characterise this pair by its support s(a ⇒ b) := s(a ∧ b) and its confidence c(a ⇒ b) :=

s(a ∧ b) . s(a)

The support does not require further discussion but the confidence is a measure of how well the predicate a is able to predict the predicate b. More specifically, the confidence is the conditional probability that a market basket contains the items in the item set b given that it contains the item set a. While the disadvantage of this approach is that it needs two numbers for each rule, the determination of these numbers is straight-forward as the first number is just a support of an item set and the second number is the quotient of two supports. Thus one way to determine these numbers and

3.8. RULES, THEIR SUPPORT AND THEIR CONFIDENCE

59

all rules for which both numbers are larger than given bounds is to first determine all item sets with large enough supports, the frequent item sets and within these search for rules with high confidence. These rules are called the strong rules. This is indeed the approach taken by many algorithms and we will discuss this further in later sections. For now we note that in the case of the predicates defined by the market baskets ω one has c(ax ⇒ ay ) = F δ (y|x) where F δ (y|x) = F δ (y ∧ x)/F δ (x) is a conditional cumulative distribution function and ∧ denotes the glb. The relationship between conditional probability and confidence show one case where one can tell the confidence with having to compute a quotient, namely, when x and y are two independent samples such that F δ (y ∧ x) = F δ (y)F δ (x) and it follows that c(ax ⇒ ay ) = F δ (y). With other words, the proportion of the events containing x which also contain y is the same as the proportion of events which contain y overall. Thus the restriction to ω does not provide any advantage. A simple way to check for this situation is the lift coefficient c(a ⇒ b) γ(a ⇒ b) := s(b) which is one for independent a and b and less otherwise. Now there are several problems with strong association rules which have been addressed in the literature: • The straight-forward interpretation of the rules may lead to wrong inferences. • The number of strong rules found can be very small. • The number of strong rules can be very large. • Most of the strong rules found are expected and do not lead to new insights. • The strong rules found do not lend themselves to any actions and are hard to interpret. We will address various of these challenges in the following. At this stage the association rule mining problem consists of the following: Find all strong association rules in a given data set D.

60

CHAPTER 3. ASSOCIATION RULES

3.9

Apriori Algorithm

In this section we will discuss the classical Apriori algorithm as suggested by Agrawal et al. in [3] and some of the most important derived algorithms.

3.9.1

Time complexity – computing supports

The data is a sequence x(1) , . . . , x(n) of binary vectors. We can thus represent the Pn data(i) as a n by d binary matrix. The number of nonzero elements is i=1 |x |. This is approximated by the expected length E(|x|) times n. So the proportion of nonzero elements is E(|x|)/d. This can be very small, especially for the case of market baskets, where out of the 10,000 items usually less than 100 items are purchased. Thus less than one percent of all the items are nonzero. In this case it makes sense to store the matrix in a sparse format. Here we will consider two ways to store the matrix, either by rows or by columns. The matrix corresponding to an earlier example is   1 1 0 0 0 1 0 1 1 0    1 0 0 0 1  .   1 1 0 1 0  0 0 0 0 1 First we discuss the horizontal organisation. A row is represented simply by the indices of the nonzero elements. And the matrix is represented as a tuple of rows. For example, the above matrix is represented as £ ¤ (1, 2) (1, 3, 4) (1, 5) (1, 2, 4) (5) In practice we need in addition to the indices of the nonzero elements in each row we need pointers which tell us where the row starts if contiguous locations are used in memory. Now assume that we have any row x and a az and would like to find out if x supports az , i.e., if z ≤ x. If both the vectors are represented in the sparse format this means that we would like to find out if the indices of z are a subset of the indices of x. There are several different ways to do this and we will choose the one which uses an auxillary bitvector v ∈ X (in full format) which is initialised to zero. The proposed algorithm has 3 steps: 1. Extract x into v, i.e., v ← x

3.9. APRIORI ALGORITHM

61

2. Extract the value of v for the elements of z, i.e., v[z]. If they are all nonzero then z ≤ x. 3. Set v to zero again, i.e., v ← 0 We assume that the time per nonzero element for all the steps is the same τ and we get for the time: T = (2|x| + |z|)τ. Now in practice we will have to determine if az (x) holds for mk different vectors z which have all the same length k. Rather than doing the above algorithm mk times one only needs to extract x once and one thus gets the algorithm 1. Extract x into v, i.e., v ← x 2. Extract the values of v for the elements of z (j) , i.e., v[z (j) ]. If they are all nonzero then z (j) ≤ x. Do this step for j = 1, . . . , mk . 3. Set v to zero again, i.e., v ← 0 With the same assumptions as above we get (|z (j) | = k): T = (2|x| + mk k)τ. Finally, we need to run this algorithm for all the rows x(i) and we need to consider vectors z (j) of different lengths. Considering this one gets the total time n X X (2 |x(i) | + mk kn)τ T = i=1

k

and the expected time is E(T ) =

X (2E(|x|) + mk k)nτ. k

Note that the sum over k is for k between one and d but only the k for which mk > 0 need to be considered. The complexity has two parts. The first part is proportional to E(|x|)n which corresponds to the number of data points times the average complexity of each data point. This part thus encapsulates the data dependency. The second part is proportional to mk kn where the factor mk k refers to the complexity of the search space which has to be visited

62

CHAPTER 3. ASSOCIATION RULES

for each record n. For k = 1 we have m1 = 1 as we need to consider all the components. Thus the second part is larger than dnτ , in fact, we would probably have to consider all the pairs so that it would be larger than d2 nτ which is much larger than the first part as 2E(|x|) ≤ 2d. Thus the major cost is due to the search through the possible patterns and one typically has a good approximation X E(T ) ≈ mk knτ. k

An alternative is the vertical organisation where the binary matrix (or Boolean relational table) is stored column-wise. This may require slightly less storage as the row wise storage as we only needs pointers to each column and one typically has more rows than columns. In this vertical storage scheme the matrix considered earlier would be represented as £ ¤ (1, 2, 3, 4) (1, 4) (2) (2, 4) (3, 5) The savings in the vertical format however, are offset by extra costs for the algorithm as we will require an auxillary vector with n elements. For any az the algorithm considers only the columns for which the components zj are one. The algorithm determines the intersection (or elementwise product) of all the columns j with zj = 1. This is done by using the auxiliary array v which holds the current intersection. We initially set it to the first column j with zj = 1, later extract all the values at the points defined by the nonzero elements for the next column j 0 for which zj 0 = 1, then zero the original ones in v and finally set the extracted values into the v. More concisely, we have the algorithm, where xj stands for the whole column j in the data matrix. 1. Get j such that zj = 1, mark as visited 2. Extract xj into v, i.e., v ← xj 3. Repeat until no nonzero elements in z unvisited: (a) Get unvisited j such that zj = 1, mark as visited (b) Extract elements of v corresponding to xj , i.e., w ← v[xj ] (c) Set v to zero, v ← 0 (d) Set v to w, v ← w

3.9. APRIORI ALGORITHM

63

4. Get the support s(az ) = |w| So we access v three times for each column, once for the extraction of elements, once for setting it to zero and once for resetting the elements. Thus for the determination of the support of z in the data base we have the time complexity of d X n X (i) T = 3τ xj z j . j=1 i=1

A more careful analysis shows that this is actually an upper bound for the complexity. Now this is done for mk arrays z (s,k) of size k and for all k. Thus we get the total time for the determination of the support of all az to be T = 3τ

mk X d X n XX k

(i) (s,k)

xj zj

.

s=1 j=1 i=1 (i)

We can get a simple upper bound for this using xj ≤ 1 as T ≤3

X

mk knτ

k

because

Pd

(s,k) j=1 zj

= k. This is roughly 3 times what we got for the previous

(i) xj

(i)

algorithm. If the are random they all have an expectation E(xj ) which is typically much less than one and have an average expectation of E(|x|)/d. If we introduce this into the equation for T we get the approximation E(T ) ≈

3E(|x|) X mk knτ d k

which can be substantially smaller than the time for the previous algorithm. Finally, we should point out that there are many other possible algorithms and other possible data formats. Practical experience and more careful analysis shows that one method may be more suitable for one data set where the other is stronger for another data set. Thus one carefully has to consider the specifics of a data set. Another consideration is also the size k and number mk of the z considered. It is clear from the above that it is essential to carefully choose the “candidates” az for which the support will be determined. This will further be discussed in the next sections. There is one term which

64

CHAPTER 3. ASSOCIATION RULES

L4 L3 L2 L1 L0 Figure 3.15: Level sets of Boolean lattices occured in both algorithms above and which characterises the complexity of the search through multiple levels of az , it is: X C= mk k. k

We will use this constant in the discussion of the efficiency of the search procedures.

3.9.2

The algorithm

Some principles of the apriori algorithm are suggested in [2]. In particular, the authors suggest a breadth-first search algorithm and utilise the apriori principle to avoid unnecessary processing. However, the problem with this early algorithm is that it generates candidate itemsets for each record and also cannot make use of the vertical data organisation. Consequently, two groups of authors suggested at the same conference a new and faster algorithm which before each data base scan determines which candidate itemset to explore [3, 55]. This approach has substantially improved performance and is capable of utilising the vertical data organisation. We will discuss this algorithm using the currently accepted term frequent itemsets. (This was in the earlier literature called “large itemsets” or “covering itemsets”.) The apriori algorithm visits the lattice of itemsets in a level-wise fashion, see Figure 3.15 and Algorithm 1. Thus it is a breadth-first-search or BFS procedure. At each level the data base is scanned to determine the support of items in the candidate itemset Ck . Recall from the last section that the Pmajor determining parameter for the complexity of the algorithm is C = k mk k where mk = |Ck |. It is often pointed out that much of the time is spent in dealing with pairs of items. We know that m1 = d as one needs to consider all single items.

3.9. APRIORI ALGORITHM

65

Algorithm 1 Apriori C1 = A(X) is the set of all one-itemsets, k = 1 while Ck 6= ∅ do scan database to determine support of all ay with y ∈ Ck extract frequent itemsets from Ck into Lk generate Ck+1 k := k + 1. end while Furthermore, one would not have any items which alone are not frequent and so one has m2 = d(d − 1)/2. Thus we get the lower bound for C: C ≤ m1 + 2m2 = d2 . As one sees in practice that this is a large portion of the total computations one has a good approximation C ≈ d2 . Including also the dependence on the data size we get for the time complexity of apriori: T = O(d2 n). Thus we have scalability in the data size but quadratic dependence on the dimension or number of attributes. Consider the first (row-wise) storage where T ≈ d2 nτ . If we have d = 10, 000 items and n = 1, 000, 000 data records and the speed of the computations is such that τ = 10−9 the apriori algorithm would require 105 seconds which is around 30 hours, more than one day. Thus the time spent for the algorithm is clearly considerable. In case of the second (column-wise) storage scheme we have T ≈ 3E(|x|)dnτ . Note that in this case for fixed size of the market baskets the complexity is now T = O(dn). If we take the same data set as before and we assume that the average market basket contains 100 items (E(|x|) = 100) then the apriori algorithm would require only 300 seconds or five minutes, clearly a big improvement over the row-wise algorithm.

66

3.9.3

CHAPTER 3. ASSOCIATION RULES

Determination and size of the candidate itemsets

The computational complexity of the apriori algorithm is dominated by data scanning, i.e., evaluation of az (x) for data records x. WePfound earlier that the complexity can be described by the constant C = k mk k. As mk is the number ¡ ¢ of k itemsets, i.e., bitvectors where exactly k bits are one we get mk ≤ m . On the other hand the apriori algorithm will find the set Lk of k the actual frequent itemsets, thus Lk ⊂ Ck and so |Lk | ≤ mk . Thus one gets for the constant C the bounds d µ ¶ X X X d k|Lk | ≤ C = mk k ≤ k. k k k k=0 The upper bound is hopeless for any large size d and we need to get better bounds. This depends very much on how the candidate itemsets Ck are chosen. We choose C1 to be the set of all 1-itemsets, and C2 to be the set of all 2-itemsets so that we get m1 = 1 and m2 = d(d − 1)/2. The apriori algorithm determines alternatively Ck and Lk such that successively the following chain of sets is generated: C1 = L1 → C2 → L2 → C3 → L3 → C4 → · · · How should we now choose the Ck ? We know, that the sequence Lk satisfies the apriori property, which can be reformulated as Definition 6. If y is a frequent k-itemset (i.e., y ∈ Lk ) and if z ≤ y then z is a frequent |z|-itemset, i.e., z ∈ L|z| . Thus in extending a sequence L1 , L2 , . . . , Lk by a Ck+1 we can choose the candidate itemset such that the extended sequence still satisfies the apriori property. This still leaves a lot of freedom to choose the candidate itemset. In particular, the empty set would allways be admissible. We need to find a set which contains Lk+1 . The apriori algorithm chooses the largest set Ck+1 which satisfies the apriori condition. But is this really necessary? It is if we can find a data set for which the extended sequence is the set of frequent itemsets. This is shown in the next proposition: Proposition 5. Let L1 , . . . , Lm be any sequence of sets of k-itemsets which satisfies the apriori condition. Then there exists a dataset D and a σ > 0 such that the Lk are frequent itemsets for this dataset with minimal support σ.

3.9. APRIORI ALGORITHM

67

S Proof. Set x(i) ∈ S k Lk , i = 1, . . . , n to be sequence of all maximal itemsets, i.e., for any z ∈ k Lk there is an x(i) such that z ≤ x(i) and x(i) 6≤ x(j) for i 6= j. Choose σ = 1/n. Then the Lk are the sets of frequent itemsets for this data set. For any collection Lk there might be other data sets as well, the one chosen above is the minimal one. The sequence of the Ck is now characterised by: 1. C1 = L1 2. If y ∈ Ck and z ≤ y then z ∈ Ck0 where k 0 = |z|. In this case we will say that the sequence Ck satisfies the apriori condition. It turns out that this characterisation is strong enough to get good upper bounds for the size of mk = |Ck |. However, before we go any further in the study of bounds for |Ck | we provide a construction of a sequence Ck which satisfies the apriori condition. A first method uses Lk to construct Ck+1 which it chooses to be the maximal set such that the sequence L1 , . . . , Lk , Ck+1 satisfies the apriori property. One can see by induction that then the sequence C1 , . . . , Ck+1 will also satisfy the apriori property. A more general approach constructs Ck+1 , . . . , Ck+p sucht that L1 , . . . , Lk ,Ck+1 , . . . , Ck+p satisfies the apriori property. As p increases the granularity gets larger and this method may work well for larger itemsets. However, choosing larger p also amounts to larger Ck and thus some overhead. We will only discuss the case of p = 1 here. The generation of Ck+1 is done in two steps. First a slightly larger set is constructed and then all the elements which break the apriori property are removed. For the first step the join operation is used. To explain join let the elements of L1 (the atoms) be enumerated as e1 , . . . , ed . Any itemset can then be constructed as join of these atoms. We denote a general itemset by e(j1 , . . . , jk ) = ej1 ∨ · · · ∨ ejk where j1 < j2 < · · · < jk . The join of any k-itemset with itself is then defined as Lk o n Lk := {e(j1 , . . . , jk+1 ) | e(j1 , . . . , jk ) ∈ Lk ,

e(j1 , . . . , jk−1 , jk+1 ) ∈ Lk }.

Thus Lk o n Lk is the set of all k + 1 itemsets for which 2 subsets with k items each are frequent. As this condition also holds for all elements in Ck+1 one

68

CHAPTER 3. ASSOCIATION RULES

has Ck+1 ⊂ Lk o n Lk . The Ck+1 is then obtained by removing an elements which contain infrequent subsets. For example, if (1, 0, 1, 0, 0) ∈ L2 and (0, 1, 1, 0, 0) ∈ L2 then (1, 1, 1, 0, 0) ∈ L2 o n L2 . If we also know that (1, 1, 0, 0, 0) ∈ L2 then (1, 1, 1, 0, 0) ∈ C3 . Having developed a construction for Ck we can now determine the bounds for the size of the candidate itemsets based purely on combinatorial considerations. The main tool for our discussion is the the Kruskal-Katona theorem. The proof of this theorem and much of the discussion follows closely the exposition in Chapter 5 of [16]. We will denote the set of all possible k-itemsets or bitvectors with exactly k bits as Ik . Subsets of this set are sometimes also called hypergraphs in the literature. The set of candidate itemsets Ck ⊂ Ik . Given a set of k-itemsets Ck the lower shadow of Ck is the set of all k − 1 subsets of the elements of Ck : ∂(Ck ) := {y ∈ Ik−1 | y < z for some z ∈ Ck }. This is the set of bitvectors which have k − 1 bits set at places where some z ∈ Ck has them set. The shadow ∂Ck can be smaller or larger than the Ck . In general, one has for the size |∂Ck | ≥ k independant of the size of Ck . So, for example, if k = 20 then |∂Ck | ≥ 20 even if |Ck | = 1. (In this case we actually have |∂Ck | = 20.) For example, we have ∂C1 = ∅, and |∂C2 | ≤ d. It follows now that the sequence of sets of itemsets Ck satisfies the apriori condition iff ∂(Ck ) ⊂ Ck−1 . The Kruskal-Katona Theorem provides an estimate of the size of the shadow. For our discussion we will map the bitvectors in the usual way onto integers. This is done with the mapping defined by φ(x) :=

d−1 X

2ixi .

i=0

For example, φ({1, 3, 5}) = 42. The induced order on the itemsets is then defined by y ≺ z :⇔ φ(y) < φ(z). This is the colex (or colexicographic) order. In this order the itemset {3, 5, 6, 9} ≺ {3, 4, 7, 9} as the largest items determine the order. In the lexicographic ordering the order of these two sets would be reversed.

3.9. APRIORI ALGORITHM

69 (0,0,0,1,1) (0,0,1,0,1) (0,0,1,1,0) (0,1,0,0,1) (0,1,0,1,0) (0,1,1,0,0) (1,0,0,0,1) (1,0,0,1,0) (1,0,1,0,0) (1,1,0,0,0)

3 5 6 9 10 12 17 18 20 24

Table 3.4: All bitvectors with two bits out of five set and their integer. Let [m] := {0, . . . , m − 1} and [m](k) be the set of all k-itemsets where k bits are set in the first m positions. As in the colex order any z with bits m and beyond set¡ are ¢larger than any of the elements in [m](k) this is just the set of the first m−1 bitvectors with k bits set. k We will now construct the sequence of the first m bitvectors for any m. This corresponds to the first numbers, which, in the binary representation have m ones set. Consider, for example the case of d = 5 and k = 2. For this case all the bitvectors are in table 3.4. (Printed reverse for legibility.) As before, we denote by ej the j − th atom and by e(j1 , . . . , jk ) the bitvector with bits j1 , . . . , jk ) set to one. Furthermore, we introduce the element-wise join of a bitvector and a set C of bitvectors as: C ∨ y := {z ∨ y | z ∈ C}. For 0 ≤ s ≤ ms < ms+1 < · · · < mk we introduce the following set of k-itemsets: B

(k)

(mk , . . . , ms ) :=

k [ ¡

¢ [mj ](j) ∨ e(mj+1 , . . . , mk ) ⊂ Ik .

j=s

As only term j does not contain itemsets with item mj (all the others do) the terms are pairwise disjoint and so the union contains |B

(k)

(k)

(mk , . . . , ms )| = b (mk , . . . , ms ) :=

¶ k µ X mj j=s

j

70

CHAPTER 3. ASSOCIATION RULES

k-itemsets. This set contains the first (in colex order) bitvectors with k bits set. By splitting off the last term in the union one then sees: ¡ ¢ B (k) (mk , . . . , ms ) = B (k−1) (mk−1 , . . . , ms ) ∨ emk ∪ [mk ](k) (3.1) and consequently b(k) (mk , . . . , ms ) = b(k−1) (mk−1 , . . . , ms ) + b(k) (mk ). Consider the example of table 3.4 of all bitvectors up to (1, 0, 0, 1, 0). There are 8 bitvectors for which come earlier in the colex order. The highest bit set for the largest element is bit 5. As we consider all smaller elements we need to have all two-itemsets ¡ ¢ where the 2 bits are distributed between positions 1 to 4 and there are 42 = 6 such bitvectors. The other cases have the top bit fixed ¡2¢at position 5 and the other bit is either on position 1 or two thus there are 1 = 2 bitvectors for which the top bit is fixed. Thus we get a total of µ ¶ µ ¶ 4 2 + =8 2 1 bitvectors up to (and including) bitvector (1, 0, 0, 1, 0) for which 2 bits are set. This construction is generalised in the following. In the following we will show that b(k) (mk , . . . , ms ) provides a unique representation for the integers. We will make frequent use of the identity: ¶ ¶ µ r µ X t−r+l t+1 . (3.2) −1= l r l=1 Lemma 2. For every m, k ∈ N there are numbers ms < · · · < mk such that m=

¶ k µ X mj j=s

j

(3.3)

and the mj are uniquely determined by m. Proof. The proof is by induction over m. In the case of m = 1 one sees immediately that there can only be one in the sum (3.3), thus s = k and the only choice is mk = k. Now assume that equation 3.3 is true for some m0 = m − 1. We now show uniqueness for m. We only need to show that mk is uniquely determined as

3.9. APRIORI ALGORITHM

71

the uniqueness of the ¡mk ¢other mj follows from the induction hypothesis applied 0 to m = m − m − k . So we assume a decomposition of the form 3.3 is given. Using equation 3.2 one gets: µ ¶ µ ¶ µ ¶ mk mk−1 ms m= + + ··· + k k−1 s µ ¶ µ ¶ µ ¶ mk mk − 1 mk − k + 1 ≤ + + ··· + k k−1 1 ¶ µ mk + 1 = −1 k as mk−1 ≤ mk − 1 etc. Thus we get ¶ µ ¶ µ mk mk + 1 − 1. ≤m≤ k k ¡ ¢ With other words, mk is the largest integer such that mkk ≤ m. This provides a unique characterisation of mk which proves uniqueness. Assume that the mj be constructed according to the method outlined in the first part of this proof. One can check that equation 3.3 holds for these mj using the characterisation. What remains to be shown is mj+1 > mj and using inductions, it is enough to show that mk−1 < mk . If, on the contrary, this does not hold and mk−1 ≥ mk , then one gets from 3.2: µ ¶ µ ¶ mk mk−1 m≥ + k k−1 µ ¶ µ ¶ mk mk ≥ + k k−1 ¶ µ ¶ µ ¶ µ mk − k + 1 mk − 1 mk +1 + ··· + + = 1 k−1 k µ ¶ µ ¶ µ ¶ mk mk−1 ms ≥ + + ··· + + 1 by induction hyp. k k−1 s =m+1 which is not possible.

72

CHAPTER 3. ASSOCIATION RULES

Let N (k) be the set of all k-itemsets of integers. It turns out that the B (k) occur as natural subsets of N (k) : P ¡ ¢ Theorem 5. The set B (k) (mk , . . . , ms ) consists of the first m = kj=s mjj itemsets of N (k) (in colex order). ¡ ¢ Proof. The proof is by induction over k − s. If k = s and thus m¡ =¢ mkk then the first elements of N (k) are just [mk¡](k)¢. If k > s then the first mkk elements are still [mk ](k) . The remaining m − mkk elements all contain bit mk . By the induction hypothesis the first bk−1 (mk−1 , . . . , ms ) elements containing bit mk are B k−1 (mk−1 , . . . , ms ) ∨ emk and the rest follows from 3.1. The shadow of the first k-itemsets B (k) (mk , . . . , ms ) are the first k − 1itemsets, or more precisely: Lemma 3. ∂B (k) (mk , . . . , ms ) = B (k−1) (mk , . . . , ms ). Proof. First we observe that in the case of s = k the shadow is simply set of all k − 1 itemsets: ∂[mk ]k = [mk ](k−1) . This can be used as anchor for the induction over k − s. As was shown earlier, one has in general: ¡ ¢ B (k) (mk , . . . , ms ) = [mk ]k ∪ B (k−1) (mk−1 , . . . , ms ) ∨ emk and, as the shadow is additive, as ¡ ¢ ∂B (k) (mk , . . . , ms ) = [mk ](k−1) ∪ B (k−2) (mk−1 , . . . , ms ) ∨ emk = B (k−1) (mk , . . . , ms ). Note that B (k−1) (mk−1 , . . . , ms ) ⊂ [ml ](k−1) . Recall that the shadow is important for the apriori property and we would thus like to determine the shadow, or at least its size for more arbitray kitemsets as they occur in the apriori algorithm. Getting bounds is feasible but one requires special technology to do this. This is now what we are going to develop. We would like to reduce the case of general sets of k-itemsets to the case of the previous lemma, where we know the shadow. So we would like to find a mapping which maps the set of k-itemsets to the first k itemsets in colex order without changing the size of the shadow. We will see that

3.9. APRIORI ALGORITHM

73

this can almost be done in the following. The way to move the itemsets to earlier ones (or to “compress” them) is done by moving later bits to earlier positions. So we try to get the k itemsets close to B (k) (mk , . . . , ms ) in some sense, so that the size of the shadow can be estimated. In order to simplify notation we will introduce z + ej for z ∨ ej when ej 6≤ z and the reverse operation (removing the j-th bit) by z −ej when ej ≤ z. Now we introduce compression of a bitvector as ( z − ej + ei if ei 6≤ z and ej ≤ z Rij (z) = z else Thus we simply move the bit in position j to position i if there is a bit in position j and position i is empty. If not, then we don’t do anything. So we did not change the number of bits set. Also, if i < j then we move the bit to an earlier position so that Rij (z) ≤ z. For our earlier example, when we number the bits from the right, starting with 0 we get R1,3 ((0, 1, 1, 0, 0)) = (0, 0, 1, 1, 0) and R31 ((0, 0, 0, 1, 1)) = (0, 0, 0, 1, 1). This is a “compression” as it moves a collection of k-itemsets closer together and closer to the vector z = 0 in terms of the colex order. The mapping Rij is not injective as Rij (z) = Rij (y) when y = Rij (z) and this is the only case. Now consider for any set C of −1 bitvectors the set Rij (C) ∩ C. These are those elements of C which stay in C when compressed by Rij . The compression operator for bitsets is now defined as ˜ i,j (C) = Rij (C) ∪ (C ∩ R−1 (C)). R ij Thus the points which stay in C under Rij are retained and the points which are mapped outside C are added. Note that by this we have avoided the problem with the non-injectivity as only points which stay in C can be mapped onto each other. The size of the compressed set is thus the same. However, the elements in the first part have been mapped to earlier elements in the colex order. In our earlier example, for i, j = 1, 3 we get C = {(0, 0, 0, 1, 1), (0, 1, 1, 0, 0), (1, 1, 0, 0, 0), (0, 1, 0, 1, 0)} we get ˜ i,j (C) = {(0, 0, 0, 1, 1), (0, 0, 1, 1, 0), (1, 1, 0, 0, 0), (0, 1, 0, 1, 0)}. R

74

CHAPTER 3. ASSOCIATION RULES

˜ i,j Corresponding to this compression of sets we introduce a mapping R ˜ ˜ (which depends on C) by Ri,j (y) = y if Ri,j (y) ∈ C and Ri,j (y) = Ri,j (y) else. In our example this maps corresponding elements in the sets onto each other. In preparation, for the next lemma we need the simple little result: Lemma 4. Let C be any set of k itemsets and z ∈ C, ej ≤ z, ei 6≤ z. Then ˜ i,j (C). z − ej + ei ∈ R Proof. There are two cases to consider: ˜ i,j (z). 1. Either z − ej + ei 6∈ C in which case z − ej + ei = R ˜ i,j (z − ej + ei ) one gets again 2. Or z − ej + ei ∈ C and as z − ej + ei = R ˜ i,j (C). z − ej + ei ∈ R

The next result shows that in terms of the shadow, the “compression” ˜ Ri,j really is a compression as the shadow of a compressed set can never be larger than the shadow of the original set. We suggest therefor to call it compression lemma. Lemma 5 (Compression Lemma). Let C be a set of k itemsets. Then one has ˜ i,j (C) ⊂ R ˜ i,j (∂C). ∂R ˜ i,j (C). We need to show that Proof. Let x ∈ ∂ R ˜ i,j (∂C) x∈R and we will enumerate all possible cases. First notice that there exists a ek 6≤ x such that ˜ i,j (C) x + ek ∈ R so there is an y ∈ C such that ˜ i,j (y). x + ek = R ˜ i,j (y) 6= y and so one has (by the previous 1. In the first two cases R lemma) x + ek = y − ej + ei ,

for some y ∈ C, ej ≤ y, ei 6≤ y.

3.9. APRIORI ALGORITHM

75

(a) First consider i 6= k. Then there is a bitvector z such that y = z + ek and z ∈ ∂C. Thus we get ˜ i,j (∂C) x = z − ej + ei ∈ R as z ∈ ∂C and with lemma 4. (b) Now consider i = k. In this case x + ei = y − ej + ei and so x = y − ej ∈ ∂C. As ej 6≤ x one gets ˜ i,j (x) ∈ R ˜ i,j (∂C). x=R ˜ i,j (y) = y, i.e., x + ek = R ˜ i,j (x + ek ). Thus 2. In the remaining cases R ˜ x + ek = y ∈ C and so x ∈ ∂C. Note that Ri,j actually depends on ∂C! ˜ i,j (x) ∈ R ˜ i,j (∂C). (a) In the case where ej 6≤ x one has x = R ˜ i,j (x) and, as (b) In the other case ej ≤ x. We will show that x = R ˜ i,j (∂C). x ∈ ∂C one gets x ∈ R ˜ i,j (x + ek ) if either i. If k 6= i then one can only have x + ek = R ˜ i,j (x), or x − ej + ei + ek ∈ C in ei ≤ x, in which case x = R ˜ i,j (x). which case x − ej + ei ∈ ∂C and so x = R ii. Finally, if k = i, then x + ei ∈ C and so x − ej + ei ∈ ∂C thus ˜ i,j (x). x=R

˜ i,j maps sets of k itemsets onto sets of k itemsets and does The operator R not change the number of elements in a set of k itemsets. One now says that ˜ i,j (C) = C for all i < j. This means a set of k itemsets C is compressed if R that for any z ∈ C one has again Rij (z) ∈ C. Now we can move to prove the key theorem: Theorem 6 (Kruskal/Katona). Let k ≥ 1, A ⊂ N (k) , s ≤ ms < · · · < mk and |A| ≥ b(k) (mk , . . . , ms ) then |∂A| ≥ b(k−1) (mk , . . . , ms ).

76

CHAPTER 3. ASSOCIATION RULES

m m−1

                                                                                       1

     k−1 k

Figure 3.16: Double induction Proof. First we note that the shadow is a monotone function of the underlying set, i.e., if A1 ⊂ A2 then ∂A1 ⊂ ∂A2 . From this it follows that it is enough to show that the bound holds for |A| = b(k) (mk , . . . , ms ). Furthermore, it is sufficient to show this bound for compressed A as compression at most reduces the size of the shadow and we are looking for a lower bound. Thus we will assume A to be compressed in the following. The proof uses double induction over k and m = |A|. First we show that the theorem holds for the cases of k = 1 for any m and m = 1 for any k. In the induction step we show that if the theorem holds for 1, . . . , k − 1 and any m and for 1, . . . , m − 1 and k then it also holds for k and m, see figure (3.16). In the case of k = 1 (as A is compressed) one has: A = B (1) (m) = {e0 , . . . , em−1 } and so ∂A = ∂B (1) (m) = {0} hence |∂A| = 1 = b(0) (m). In the case of m = |A| = 1 one has: A = B (k) (k) = {e(0, . . . , k − 1)}

3.9. APRIORI ALGORITHM

77

and so: ∂A = ∂B (k) (k) = [k](k−1) hence |∂A| = k = b(k−1) (k). The key step of the proof is a partition of A into bitvectors with bit 0 set and such for which bit 0 is not set: A = A0 ∪ A1 where A0 = {x ∈ A|x0 = 0} and A1 = {x ∈ A|x0 = 1}. 1. If x ∈ ∂A0 then x + ej ∈ A0 for some j > 0. As A is compressed it must also contain x + e0 = R0j (x + ej ) ∈ A1 and so x ∈ A1 − e0 thus |∂A0 | ≤ |A1 − e0 | = |A1 |. 2. A special case is A = B (k) (mk , . . . , ms ) where one has |A0 | = b(k) (mk − 1, . . . , ms − 1) and |A1 | = b(k−1) (mk − 1, . . . , ms − 1) and thus m = b(k) (mk , . . . , ms ) = b(k) (mk −1, . . . , ms −1)+b(k−1) (mk −1, . . . , ms −1) 3. Now partition ∂A1 into 2 parts: ∂A1 = (A1 − e0 ) ∪ (∂(A1 − e0 ) + e0 ). It follows from previous inequalities and the induction hypothesis that |∂A1 | = |A1 − e0 | + |∂(A1 − e0 ) + e0 | = |A1 | + |∂(A1 − e0 )| ≥ b(k−1) (mk − 1, . . . , ms − 1) + b(k−2) (mk − 1, . . . , ms − 1) = b(k−1) (mk , . . . , ms ) and hence |∂A| ≥ |∂A1 | ≥ b(k−1) (mk , . . . , ms )

This theorem is the tool to derive the bounds for the size of future candidate itemsets based on a current itemset and the apriori principle. Theorem 7 (Geerts et al 2001). Let the sequence Ck satisfy the apriori property and let |Ck | = b(k) (mk , . . . , mr ). Then |Ck+p | ≤ b(k+p) (mk , . . . , mr ) for all p ≤ r.

78

CHAPTER 3. ASSOCIATION RULES

Proof. The reason for the condition on p is that the shadows are well defined. First, we choose r such that mr ≤ r + p − 1, mr+1 ≤ r + 1 + p − 1, . . ., ms−1 ≤ s − 1 + p − 1 and ms ≥ s + p − 1. Note that s = r and s = k + 1 may be possible. Now we get an upper bound for the size |Ck |: |Ck | = b(k) (mk , . . . , mr ) (k)

≤ b (mk , . . . , ms ) +

¶ s−1 µ X j+p−1 j=1

j

µ

¶ s+p−1 = b (mk , . . . , ms ) + −1 s−1 (k)

according to a previous lemma. If the theorem does not hold then |Ck+p | > b(k+p) (mj , . . . , mr ) and thus |Ck+p | ≥ b(k+p) (mj , . . . , mr ) + 1 ¶ µ s+p−1 (k+p) ≥b (mk , . . . , ms ) + s+p−1 = b(k+p) (mk , . . . , ms , s + p − 1). Here we can apply the previous theorem to get a lower bound for Ck : |Ck | ≥ b(k) (mk , . . . , ms , s + p − 1). This, however is contradicting the higher upper bound we got previously and so we have to have |Ck+p | ≤ b(k+p) (mj , . . . , mr ). As a simple consequence one also gets tightness: Corollary 1 (Geerts et al 2001). For any m and k there exists a Ck with |Ck | = m = b(k+p) (mk , . . . , ms+1 ). such that |Ck+p | = b(k+p) (mk , . . . , ms+1 ). Proof. The Ck consists of the first m k-itemsets in the colexicographic ordering. In practice one would know not only the size but also the contents of any Ck and from that one can get a much better bound than the one provided ¡mk ¢ by the theory. A consequence of the theorem is that for L with |L | ≤ k k k ¡ mk ¢ one has |Ck+p | ≤ k+p . In particular, one has Ck+p = ∅ for k > mp − p.

3.9. APRIORI ALGORITHM

3.9.4

79

Apriori Tid

The apriori algorithm discussed above computes supports of itemsets by doing intersections of columns. Some of these intersections are repeated over time and, in particular, entries of the Boolean matrix are revisited which have no impact on the support. The Apriori TID algorithm provides a solution to some of these problems. For computing the supports for larger itemsets it does not revisit the original table but transforms the table as it goes along. The new columns correspond to the candidate items. In this way each new candidate itemset only requires the intersection of two old ones. The following shows an example on how this works. The example is adapted from [3]. In the first row the itemsets from Ck are depicted. The minimal support is 50 percent or 2 rows. The initial data is:   1 2 3 4 5 1 0 1 1 0    0 1 1 0 1    1 1 1 0 1  0 1 0 0 1 Note that the column (or item) four is not frequent and thus not considered for Ck . Thus one gets in the Apriori TID after one step the matrix:   (1, 2) (1, 3) (1, 5) (2, 3) (2, 5) (3, 5)  0 1 0 0 0 0     0  0 0 1 1 1    1 1 1 1 1 1  0 0 0 0 1 0 Here one can see directly that the itemsets (1, 2) and (1, 5) are not frequent. It turns out that there remains only one candidate, namely (2, 3, 5) and the matrix is   (2, 3, 5)  0     1     1  0 Let z(j1 , . . . , jk ) denote the elements of Ck . Then the elements in the transformed Boolean matrix are az(j1 ,...,jk ) (xi ).

80

CHAPTER 3. ASSOCIATION RULES

We will use again an auxiliary array v ∈ {0, 1}n . The apriori tid algorithm uses the join operation from earlier in order to construct a matrix for the frequent itemsets Lk+1 from Lk . (This, as the previous algorithms are assumed to be in-memory, i.e., all data is stored in memory. The case of very large data sets which do not fit into memory will be discussed later.) The key part of the algorithm, i.e., the step from k to k + 1 is then: 1. Select pair of frequent k-itemsets (y, z), mark as read V 2. expand xy = i xyi i , i.e., v ← xy 3. extract elements using xz , i.e., w ← v[xz ] 4. compress result and reset v to zero, v ← 0 There are three major steps where the auxiliary vector v is accessed. The time complexity for this is n X T = (2|(x(i) )y | + |(x(i) )z |)τ. i=1

This has to be done for all elements y ∨ z where y, z ∈ Lk . Thus the average complexity is X 3nmk E(xy )τ E(T ) = y

V

k yi i xi .

Now for all elements in Lk the support for some “average” y and x = is larger than σ, thus E(xy ) ≥ σ. So we get a lower bound for the complexity: X 3nmk στ. E(T ) ≥ k

We can also obtain a simple upper bound if we observe that E(xy ) ≤ E(|x|)/d which is true “on average”. From this we get E(T ) ≤

X k

3nmk

E(|x|) τ. d

Another (typically lower bound) approximation is obtained if we assume that the components of x are indepdendent. In this case E(xy ) ≈ (E(|x|)/d)k and thus X E(T ) ≥ 3nmk (E(|x|)/d)k τ. k

3.9. APRIORI ALGORITHM

81

From this we would expect that for some rk ∈ [1, k] we get the approximation X E(T ) ≈ 3nmk (E(|x|)/d)rk τ. k

Now recall that the original column-wise apriori implementation required X E(T ) ≈ 3nmk k(E(|x|)/d)τ k

and so the “speedup” we can achieve by using this new algorithm is around P k kmk S≈P . rk −1 m k k (E(|x|)/d) which can be substantial as both k ≥ 1 and E(|x|)/d)rk −1 < 1. We can see that there are two reasons for the decrease in work: First we have reused earlier computations of xj1 ∧ · · · ∧ xjk and second we are able to make use of the lower support of the k-itemsets for larger k. While this second effect does strongly depend on rk and thus the data, the first effect allways holds, so we get a speedup of at least P kmk , S ≥ Pk k mk i.e., the average size of the k-itemsets. Note that the role of the number mk of candidate itemsets maybe slightly diminished but this is still the core parameter which determines the complexity of the algorithm and the need to reduce the size of the frequent itemsets is not diminished.

3.9.5

Constrained association rules

The amount of frequent itemsets found by the apriori algorithm will often be too large or too small. While the prime mechanism of controlling the number itemsets found is the minimal support σ, this may often not be enough as small collections of frequent itemsets may often contain mostly well known associations whereas large collections may reflect mostly random fluctuations. There are effective other ways to controll the amount of itemsets obtained. First, in the case of too many itemsets one can use constraints to filter out trivial or otherwise uninteresting itemsets. In the case of too few

82

CHAPTER 3. ASSOCIATION RULES

frequent itemsets one can also change the attributes or features which define the vector x. In particular, one can introduce new “more general” attributes. For example, one might find that rules with “ginger beer” are not frequent. However, rules with “soft drinks” will have much higher support and may lead to interesting new rules. Thus one introduces new more general items. However, including more general items while maintaining the original special items leads to duplications in the itemsets, in our example the itemset containing ginger beer and soft drinks is identical to the set which only contains ginger beer. In order to avoid this one can again introduce constraints, which, in our example would identify the itemset containing ginger beer only with the one containing softdrink and ginger beer. Constraints are conditions for the frequent itemsets of interest. These conditions take the form “predicate = true” with some predicates b1 (z), . . . , bs (z). Thus one is looking for frequent k-itemsets L∗k for which the bj are true, i.e., L∗k := {z ∈ Lk | bj (z) = 1}. These constraints will reduce the amount of frequent itemsets which need to be further processed, but can they also assist in making the algorithms more efficient? This will be discussed next after we have considered some examples. Note that the constraints are not necessarily simple conjunctions! Examples: • We have mentioned the rule that any frequent itemset should not contain an item and its generalisation, e.g., it should not contain both soft drinks and ginger beer as this is identical to ginger beer. The constraint is of the form b(x) = ¬ay (x) where y is the itemset where the “softdrink and ginger beer bits” are set. • In some cases, frequent itemsets have been well established earlier. An example are crisps and soft drinks. There is no need to rediscover this association. Here the constraint is of the form b(x) = ¬δy (x) where y denotes the itemset “softdrinks and chips”. • In some cases, the domain knowledge tells us that some itemsets are prescribed, like in the case of a medical schedule which prescribes certain procedures to be done jointly but others should not be jointly.

3.9. APRIORI ALGORITHM

83

Finding these rules is not interesting. Here the constraint would exclude certain z, i.e., b(z) = ¬δy (z) where y is the element to exclude. • In some cases, the itemsets are related by definition. For example the predicates defined by |z| > 2 is a consequence of |z| > 4. Having discovered the second one relieves us of the need to discover the first one. This, however, is a different type of constraint which needs to be considered when defining the search space. A general algorithm for the determination of the L∗k determines at every step the Lk (which are required for the continuation) and from those outputs the elements of L∗k . The algorithm is exactly the same as apriori or apriori tid except that not all frequent itemsets are output. See Algorithm 2. The Algorithm 2 Apriori with general constraints C1 = A(X) is the set of all one-itemsets, k = 1 while Ck 6= ∅ do scan database to determine support of all z ∈ Ck extract frequent itemsets from Ck into Lk use the constraints to extract the constrained frequent itemsets in L∗k generate Ck+1 k := k + 1. end while work is almost exactly the same as for the original apriori algorithm. Now we would like to understand how the constraints can impact the computational performance, after all, one will require less rules in the end and the discovery of less rules should be faster. This, however, is not straightforward as the constrained frequent itemsets L∗k do not necessarily satisfy the apriori property. There is, however an important class of constraints for which the apriori property holds: Theorem 8. If the constraints bj , j = 1, . . . , m are anti-monotone then the set of constrained frequent itemsets {L∗k } satisfies the apriori condition. Proof. Let y ∈ L∗k and z ≤ y. As L∗k ⊂ Lk and the (unconstrained frequent itemsets) Lk satisfy the apriori condition one has z ∈ Lsize(z) . As the bj are antimonotone and y ∈ L∗k one has bj (z) ≥ bj (y) = 1

84

CHAPTER 3. ASSOCIATION RULES

and so bj (z) = 1 from which it follows that z ∈ L∗size(z) . When the apriori condition holds one can generate the candidate itemsets Ck in the (constrained) apriori algorithm from the sets L∗k instead of from the larger Lk . However, the constraints need to be anti-monotone. We know that constraints of the form az(j) are monotone and thus constraints of the form bj = ¬az(j) are antimonotone. Such constraints say that a certain combination of items should not occur in the itemset. An example of this is the case of ginger beer and soft drinks. Thus we will have simpler frequent itemsets in general if we apply such a rule. Note that itemsets have played three different roles so far: 1. as data points x(i) 2. as potentially frequent itemsets z and 3. to define constraints ¬az(j) . The constraints of the kind bj = ¬az(j) are now used to reduce the candidate itemsets Ck prior to the data base scan (this is where we save). Even better, it turns out that the conditions only need to be checked for level k = |z (j) |. (This gives a minor saving.) This is summarised in the next theorem: Theorem 9. Let the constraints be bj = ¬az(j) for j = 1, . . . , s. Furthermore let the candidate k-itemsets for L∗k be sets of k-itemsets such that Ck∗ = {y ∈ Ik | if z < y then z ∈ L|z| and bj (y) = 1} and a further set defined by C˜k = {y ∈ Ik | if z < y then z ∈ L|z| and if |z (j) | = k then bj (y) = 1 }. Then C˜k = Ck∗ . Proof. We need to show that every element y ∈ C˜k satisfies the constraints bj (y) = 1. Remember that |y| = k. There are three cases: • If |z (j) | = |y| then the constraint is satisfied by definition • If |z (j) | > |y| then z (j) 6≤ y and so bj (y) = 1

3.9. APRIORI ALGORITHM

85

• Consider the case |z (j) | < |y|. If bj (y) = 0 then az(j) (y) = 1 and so z (j) ≤ y. As |z (j) | < |y| it follows z (j) < y. Thus it follows that z (j) ∈ L∗z(j) and, consequently, bj (z (j) ) = 1 or z (j) 6≤ z (j) which is not true. It follows that in this case we have bj (y) = 1. From this it follows that C˜k ⊂ Ck∗ . The converse is a direct consequence of the definition of the sets. Thus we get a variant of the apriori algorithm which checks the constraints only for one level, and moreover, this is done to reduce the number of candidate itemsets. This is Algorithm 3. Algorithm 3 Apriori with antimonotone constraints C1 = A(X) is the set of all one-itemsets, k = 1 while Ck 6= ∅ do extract elements of Ck which satisfy the constraints az(j) (x) = 0 for |z (j) | = k and put into Ck∗ scan database to determine support of all y ∈ Ck∗ extract frequent itemsets from Ck∗ into L∗k generate Ck+1 (as per ordinary apriori) k := k + 1. end while

3.9.6

Partitioned algorithms

The previous algorithms assumed that all the data was able to fit into main memory and was resident in one place. Also, the algorithm was for one processor. We will look here into partitioned algorithms which lead to parallel, distributed and out-of-core algorithms with few synchronisation points and disk scans. The algorithms have been suggested in [67]. We assume that the data is partitioned into equal parts as D = [D1 , D2 , . . . , Dp ] where D1 = (x(1) , . . . , x(n/p) ), D2 = (x(n/p+1) , . . . , x(2n/p) ), etc. While we assume equal distribution it is simple to generalise the discussions below to non-equal distributions.

86

CHAPTER 3. ASSOCIATION RULES

In each partition Dj an estimate for the support s(a) of a predicate can be determined and we will call this sˆj (a). If sˆ(a) is the estimate of the support in D then one has p 1X sˆ(a) = sˆj (a). p j=1 This leads to a straight-forward parallel implementation of the apriori algorithm: The extraction of the Lk can either be done on all the processors Algorithm 4 Parallel Apriori C1 = A(X) is the set of all one-itemsets, k = 1 while Ck 6= ∅ do scan database to determine support of all z ∈ Ck on each Dj and sum up the results extract frequent itemsets from Ck into Lk generate Ck+1 k := k + 1. end while redundantly or on one master processor and the result can then be communicated. The parallel algorithm also leads to an out-of-core algorithm which does the counting of the supports in blocks. One can equally develop an apriori-tid variant as well. There is a disadvantage of this straight-forward approach, however. It does require many synchronisation points, respectively, many scans of the disk, one for each level. As the disks are slow and synchronisation expensive this will cost some time. We will not discuss and algorithm suggested by [67] which substantially reduces disk scans or synchronisation points at the cost of some redundant computations. First we observe that min sˆk (a) ≤ sˆ(a) ≤ max sˆk (a) k

k

which follows from the summation formula above. A consequence of this is Theorem 10. Each a which is frequent in D is at least frequent in one Dj . Proof. If for some frequent a this would not hold then one would get max sˆj (a) < σ0 if σ0 is the threshold for frequent a. By the observation above sˆ(a) < σ0 which contradicts the assumption that a is frequent.

3.9. APRIORI ALGORITHM

87

Using this one gets an algorithm which generates in a first step frequent k-itemsets Lk,j for each Dj and each k. This requires one scan of the data, or can be done on one processor, respectively. The union of all these frequent itemset is then used as a set of candidate itemsets and the supports of all these candidates is found in a second scan of the data. The parallel variant of the algorithm is then Algorithm 5. Note that the supports for all the levels Algorithm 5 Parallel Association Rules determine S the frequent k-itemsets Lk,j for all Dj in parallel Ckp := pj=1 Lk,j and broadcast determine supports sˆk for all candidates and all partitions in parallel collect all the supports, sum up and extract the frequent elements from Ckp . k are collected simultaneously thus they require only two synchronisation points. Also, the apriori property holds for the Ckp : Proposition 6. The sequence Ckp satisfies the apriori property, i.e., z ∈ Ckp

&

y≤z

p ⇒ y ∈ C|y| .

Proof. If z ∈ Ckp & y ≤ z then there exists a j such that z ∈ Lk,j . By the p apriori property on Dj one has y ∈ L|y|,j and so y ∈ C|y| . In order to understand the efficiency of the algorithm one needs to estimate the size of the Ckp . In the (computationally best case, all the frequent itemsets are identified on the partitions and thus Ckp = Lk,j = Lk . We can use any algorithm to determine the frequent itemsets on one partition, and, if we assume that the algorithm is scalable in the data size the time to determine the frequent itemsets on all processors is equal to 1/p of the time required to determine the frequent itemsets on one processor as the data is 1/p on each processor. In addition we require to reads of the data base which has an expectation of nλτDisk /p where λ is the average size of the market baskets and τDisk is the time for one disk access. There is also some time required for the communication which is proportional to the size of the frequent itemsets. We will leave the further analysis which follows the same lines as our earlier analysis to the reader at this stage.

88

CHAPTER 3. ASSOCIATION RULES

As the partition is random, one can actually get away with the determination of the supports for a small subset of Ckp , as we only need to determine the support for az for which the supports have not been determined in the first scan. One may also wish to choose the minimal support σ for the first scan slightly lower in order to further reduce the amount of second scans required.

3.9.7

Finding strong rules

The determination of rules of the kind az ⇒ ay is the last step in the apriori algorithm. As it does not require any scanning of the data it has been given less attention. However, in [3] the authors show how to apply the antimonotonicity again to get some gains. Remember that we are interested in strong association rules of the form az ⇒ ay for which z ∨ y is a frequent itemset and for which the confidence is larger than a given threshold γ0 > 0: conf(az ⇒ ay ) =

s(az∨y ) ≥ γ0 . s(az )

A simple procedure would now visit each frequent itemset z and look at all pairs z1 , z2 such that z = z1 ∨ z2 and z1 ∧ z2 = 0 and consider all rules az1 ⇒ az2 . This procedure can be improved by taking into account that Theorem 11. Let z = z1 ∨ z2 = z3 ∨ z4 and z1 ∧ z2 = z3 ∧ z4 = 0. Then if az1 ⇒ az2 is a strong association rule and z3 ≥ z1 then so is az3 ⇒ az4 . This is basically “the apriori property for the rules” and allows pruning the tree of possible rules quite a lot. The theorem is again used as a necessary condition. We start the algorithm by considering z = z1 ∨ z2 with 1-itemsets for z2 and looking at all strong rules. Then, if we consider a 2-itemset for z2 both subsets y < z2 need to be consequents of strong rules in order for z2 to be a candidate of a consequent. By constructing the consequents taking into account that all their nearest neighbors (their cover in lattice terminology) need to be consequents as well. Due to the interpretability problem one is mostly interested in small consequent itemsets so that this is not really a big consideration.

3.10. MINING SEQUENCES

3.10

89

Mining Sequences

The following is an example of how one may construct more complex structures from the simple market baskets. We consider here a special case of sequences, see [39, 1]. Let the data be of the form (x1 , . . . , xm ) where each xi is an itemset (not a component as in our earlier notation. Examples of sequences correspond to the shopping behaviour of customers of retailers over time, or the sequence of services a patient receives over time. The focus is thus not on individual market-baskets but on the customers. We do not discuss the temporal aspects, just the sequencial ones. In defining our space of features we include the empty sequence () but not components of the sequences are 0, i.e., xi 6= 0. The rationale for this is that sequences correspond to actions which occur in some order and 0 would correspond to a non-action. We are not interested in the times when a shoper went to the store and didn’t buy anything at all. Any empty component itemsets in the data will also be removed. The sequences also have an intrinsic partial ordering x≤y which holds for (x1 , . . . , xm ) and (y1 , . . . , yk ) when ever there is a sequence 1 ≤ i1 < i2 < · · · < im ≤ k such that xi ≤ yis ,

s = 1, . . . , m.

One can now verify that this defines a partial order on the set of sequences introduced above. However, the set of sequences does not form a lattice as there are not necessarily unique lowest upper or greatest lower bounds. For example, the two sequences ((0, 1), (1, 1), (1, 0)) and ((0, 1), (0, 1)) have the two (joint) upper bounds ((0, 1), (1, 1), (1, 0), (0, 1)) and ((0, 1), (0, 1), (1, 1), (1, 0) which have now common lower bound which is still an upper bound for both original sequences. This makes the search for frequent itemsets somewhat harder.

90

CHAPTER 3. ASSOCIATION RULES

Another difference is that the complexity of the mining tasks has grown considerably, with |I| items one has 2|I| market-baskets and thus 2|I|m different sequences of length ≤ m. Thus it is essential to be able to deal with the computational complexity of this problem. Note in particular, that the probability of any particular sequence is going to be extremely small. However, one will be able to make statements about the support of small subsequences which correspond to shopping or treatment patterns. Based on the ordering, the support of a sequence x is the set of all sequences larger than x: supp(x) = {y|x ≤ y}. The probability of this set of sequences is again called support: s(x) = P (supp(x)). This is estimated by the number of sequences in the data base which are in the support. Note that the itemsets now occur as length 1 sequences and thus the support of the itemsets can be identified with the support of the corresponding 1 sequence. As our focus is now on sequences this is different from the support we get if we look just at the distribution of the itemsets. The length of a sequence is the number of non-empty components. Thus we can now define an apriori algorithm as before. This would start with the determination of all the frequent 1 sequences which correspond to all the frequent itemsets. Thus the first step of the sequence mining algorithm is just the ordinary apriori algorithm. Then the apriori algorithm continues as before, where the candidate generation step is similar but now we join any two sequences which have all components identical except for the last (nonempty) one. Then one gets a sequence of length m+1 from two such sequences of length m by concatenating the last component of the second sequence on to the first one. After that one still needs to check if all subsequences are frequent to do some pruning. There has been some arbitrariness in some of the choices. Alternatives choose the size of a sequence as the sum of the sizes of the itemsets. In this case the candidate generation procedure becomes slightly more complex, see [1].

3.11. THE FP TREE ALGORITHM items f, a, c, d, g, i, m, p a, b, c, f, l, m, o b, f, h, j, o, w b, c, k, s, p a, f, c, e, l, p, m, n

91 s > 0.5 f, c, a, m, p f, c, a, b, m f, b c, b, p f, c, a, m, p

Table 3.5: Simple data base and data base after removal of the items with less than 50% support

3.11

The FP tree algorithm

The Apriori algorithm is very effective for discovering a reasonable number of small frequent itemsets. However it does show severe performance problems for the discovery of large numbers of frequent itemsets. If, for example, there are 106 frequent items then the set of candidate 2-itemsets contains 5 · 1011 itemsets which all require testing. In addition, the Apriori algorithm has problems with the discovery of very long frequent itemsets. For the discovery of an itemset with 100 items the algorithm requires scanning the data for all the 2100 subsets in 100 scans. The bottleneck in the algorithm is the creation of the candidate itemsets, more precisely, the number of candidate itemsets which need to be created during the mining. The reason for this large number is that the candidate itemsets are visited in a breadth-first way. The FP tree algorithm addresses these issues and scans the data in a depth-first way. The data is only scanned twice. In a first scan, the frequent items (or 1-itemsets) are determined. The data items are then ordered based on their frequency and the infrequent items are removed. In the second scan, the data base is mapped onto a tree structure. Except for the root all the nodes are labelled with items, each item can correspond to multiple nodes. We will explain the algorithm with the help of an example, see table 3.5 for the original data and the records with the frequent itemsets only (here we look for support > 0.5). Initially the tree consists only of the root. Then the first record is read and a path is attached to the root such that the node labelled with the first item of the record (items are ordered by their frequency) is adjacent to the root, the second item labels the next neighbor and so on. In addition to the item, the label also contains the number 1, see Step 1 in figure 3.17. Then the second record is included such that any common prefix (in the example

92

CHAPTER 3. ASSOCIATION RULES

f:1

f:2

f:4

c:1

c:2

c:3

a:1

a:2

a:3

b:1

m:1

b:1

m:2

b:1

p:1

p:1

m:1

p:2

m:1

Step 2

b:1 p:1

m:1

Step 1

c:1

Step 5

Figure 3.17: Construction of the FP-Tree the items f,c,a is shared with the previous record and the remaining items are added in a splitted path. The numeric parts of the labels of the shared prefix nodes are increased by one, see Step 2 in the figure. This is then done with all the other records until the whole data base is stored in the tree. As the most common items were ordered first, there is a big likelihood that many prefixes will be shared which results in substantial saving or compression of the data base. Note that no information is lost with respect to the supports. The FP tree structure is completed by adding a header table which contains all items together with pointers to their first occurrence in the tree. The other occurrences are then linked together so that all occurrences of an item can easily be retrieved, see figure 3.18. The FP tree does never break a long pattern into smaller patterns as this is done when searching the candidate itemsets. Long patterns can be directly retrieved from the FP tree. The FP does also contain the full relevant information about the data base. It is compact, as all infrequent items are removed and the highly frequent items share nodes in the tree. The number of nodes is never less than the size of the data base measured in the sum of the sizes of the records but there is anecdotal evidence that compression rates can be over 100. The FP tree is used to find all association rules containing particular items. Starting with the least frequent items, all rules containing those items can be found simply by generating for each item the conditional data base which consists for each path which contains the item of those items which

3.11. THE FP TREE ALGORITHM

93 {}

header table item

support

f

4

c

4

a

3

b

3

m

3

p

3

f:4 c:3

c:1 b:1

a:3

b:1 p:1

m:2

b:1

p:2

m:1

Figure 3.18: Final FP-Tree are between that item and the root. (The lower items don’t need to be considered, as they are considered together with other items.) These conditional pattern bases can then again be put into FP-trees, the conditional FP-trees and for those trees all the rules containing the previously selected and any other item will be extracted. If the conditional pattern base contains only one item, that item has to be the itemset. The frequencies of these itemsets can be obtained from the number labels. An additional speed-up is obtained by mining any long prefix paths and the rest separately and combine the results at the end. Of course any chain does not need to be broken into parts necessarily as all the frequent subsets, together with their frequencies are easily obtained directly.

94

CHAPTER 3. ASSOCIATION RULES

Chapter 4 Predictive Models 4.1

Introduction

The environment gives rise to some of the largest data collections available. This includes collections of topographical data for navigation, elevation data, data about mineral resources, about the biosphere, the atmosphere and the waterways, remotely sensed data, earthquake data, aeromagnetic and gravimetric data, and weather data. These collections are of major economical and social importance. In contrast to the commercial data discussed in the previous section, environmental data has often been collected with a view to prediction. This prediction includes future events like weather forecast, the distribution of hidden quantities like mineral deposits but also for understanding of the interrelation of various factors in the geological sciences. Typically many factors are closely interrelated and interacting so that simple predictive procedures are seldom possible. While predictive modelling is one of the major tools in data mining to distinction between statistics is even less clear than in the previous section. Maybe the focus on computational aspects and consequences of the high dimensionality of the data but also on a handfull of effective procedures which can deal with these apects is a characteristic of data mining contributions to predictive modelling. An example of a small environmental data set is given in figure 4.1. The data has been collected in order to better understand ozone levels in cities and how they develop. We have chosen to display the data as a matrix of scatterplots where the values of each variable is plotted against each other variable. From these scatterplots it becomes clear that there is some structure 95

96

CHAPTER 4. PREDICTIVE MODELS airquality data 50

150

250

15

20

25

30

35

0

5

10

15

20

25

30

50 100

0

250

0

Ozone

25

0

100

Solar.R

35

5

15

Wind

8

9

15

25

Temp

20

30

5

6

7

Month

0

10

Day 0

50

100

150

5

10

15

20

25

30

5

6

7

8

9

airquality data 50

150

250

15

20

25

30

35

0

5

10

15

20

25

30

50 100

0

250

0

Ozone

25

0

100

Solar.R

35

5

15

Wind

8

9

15

25

Temp

20

30

5

6

7

Month

0

10

Day 0

50

100

150

5

10

15

20

25

30

5

6

7

8

9

Figure 4.1: Scatterplot of New York air quality data from 1973 and the same data where all the data columns have been randomly permuted

4.1. INTRODUCTION

97

in this data. For the second part of the display in figure 4.1 we have randomly permuted the data. The “structure” we see in this permuted display is purely coincidental. One can see that in particular data poor regions of the display can lead to fake structure. However, comparing the permuted data with the original data set one can still see that there is some clear signal in the data. Among other things there seem to be clear relations between ozone levels and several of the other features. These relations can be described by ridges in the distributions of the data points. These ridges can often be described as functions between the observed variables. The detection of such functions is the aim of predictive modelling. As these functions are used for the prediction of “yet unseen cases” they are called predictive models. While often the choice of the variable(s) which is predicted, the response variable and the variables used for prediction, the predictor variables is given by the application it is important to understand other interactions between variables as well as they can strongly influence the quality of any predictive model found. For example, one should be very careful when predicting cases where the interactions do not hold. In addition, one should also be very careful when relating the predictive models to any causal relations. In fact, for any two variable there may well be “hidden causes” or variables which will control both variables. The choice of the predictor and response variables is relatively open in data mining applications. Overall, predictive modelling is about finding predictive models which are supported by the data. We choose the supported models from a large collection of possible models, selecting a best ones and comparing several models. Data mining algorithms help unveil such regularities in the form of data concentrations or groupings which can be described more generally by functions of the form y = f (x) While the choice of the feature for the left-hand side may well be part of the data mining we assume here that this so-called response variable has been determined and we name it Y to distinguish it from the other features, called predictor variables xi . The functions f are called predictive models as they allow the prediction of a value of y for given values of x. For example, one may be able to predict the most likely ozone concentration given the predictor variables temperature, month of the year and wind, say. However, the model does only provide

98

CHAPTER 4. PREDICTIVE MODELS

a ridge of the distribution and thus the prediction necessarily has an error. The analysis of this prediction error is a core topic of both statistical analysis and machine learning. The detection and analysis of predictive models is also part of statistics and machine learning. In statistics the aim is to find and validate such models to better understand the data whereas in machine learning such models are used to imitate the mechanisms which underly the generation of the data. In data base retrieval one is interested in finding quickly some data with given characteristics. Fast retrieval engines can also lead to predictive models. The efficient storage of data requires compression. Data mining contains aspects of all these disciplines. It aims at discovering very simple and fundamental models or regularities of the data, it compresses and provides tools for faster queries providing answers which are close to the data. The applications of predictive models are maybe even more wide spread in data mining than the association rules and include: • Medicine – diagnosis, outcome analysis and best practice • Control and model identification in engineering and manufacturing • Public administration and actuarial studies • Marketing and finance • Environmental data mining • Text and web mining • Scientific data mining • Fraud detection in insurance, and computer and data access In fraud detection one is not interested in the actual functional relationships but in points which are “far away” from the functions, the outliers. Regression refers in data mining to the determination of functions y = f (x) where the response variable y is real-valued. The predictor variables x are assumed given and can have any type. In logistic regression, these models are also used to model decision functions or data density distributions. Parametric models, in particular linear models, are immensely popular in data mining, but non-parametric models are also widely used and will be the focus of the following discussion. Data mining applications to regression pose some additional challenges when compared to smaller statistical applications:

4.2. MATHEMATICAL FRAMEWORK

99

• The data mining data sets are very large with millions of transactions. Furthermore, they are high dimensional including up to several thousand features. • The models found should be understandable in the application domain. • The data is usually collected for some other purpose like management of accounts. Thus carefully designed experiments are an exception in data mining studies. This shows that data mining must pay close attention to computational efficiency and simple models are often preferred. Furthermore, even though the data can be very large, the limitations of the information in the data has to be carefully studied and the effect of skewed or long-tail distributions has to be assessed.

4.2 4.2.1

Mathematical Framework The data

The domain is split up in two parts, the domain X of the predictor and the domain Y of the response variable. Thus the variable is a pair (x, y) ∈ X × Y. The data now consists of two sequences x(i) , y (i) , i = 1, . . . , n. The empirical distribution function is in this case n

1X p(x, y) = δ (i) (x)δy(i) (y). n i=1 x Either we will look for structure in this p or in a distribution of a population from which the data has been drawn. The set of the predictor variables X is, in the simplest case a discrete set X = {v1 , . . . , vm } or subset of the set of real numbers X ⊂ R. Here we will also consider tensor products of these two cases so that x = (x1 , . . . , xd ) where each of the components is either an element of a finite set (the case of categorical variables) or of a real set (the case of continuous variables). Typically, we will consider combinations of the two cases.

100

CHAPTER 4. PREDICTIVE MODELS

While the basic structure of X = X1 × · · · × Xd is a tensor product, the X will inherit structure from its components Xi . For example, if the components are all partially ordered sets, we can define a partial order on X as well. With partial order, we can for example introduce cumulative distribution functions. If the components are metric, the space X has an induced metric structure. This will allow us to introduce kernel density estimators for p which are smoothed versions of the empirical distribution function.If the components are Euclidian vector spaces, so is the product. These properties are useful for least squares fitting of linear functions. The property of the tensor product to inherit structure of the components is an important tool. We will use this in the following. While tensor products, also called attribute value pairs is a very common case, in applications we will often see much more complex cases which include graphs, and lists. The set Y of the response variable typically has less structure. Two main cases will be considered. In a first case, the set is finite, in particular Y = {0, 1}. This is the case of classification. The second case is the case of Y = R. These sets characterise some aspects of the function spaces, which lead to lattices in the first case and to vector spaces in the second. In their structure the spaces X and Y are quite different but they also play very different roles. Predictive models consider functional dependencies of the form y = f (x) i.e., sets of the form {(x, f (x))|x ∈ X}. Patterns of this form are the discovery target. This is a hypersurface in the tensor product X × Y.

4.2.2

Support, misclassification rates, Bayes estimators and least squares

Similarly to the case of association rules one could consider the probability of the set defined by any function f as s(f ) = P ({(x, f (x))|x ∈ X}). This is the probability that y = f (x) is true. Instead of this measure one uses L(f ) = 1 − s(f ) which is called the Bayes probability of error, Bayes error, Bayes risk or misclassification rate. Thus we have probability of the set defined by any function f as L(f ) = P ({(x, y)|y 6= f (x), x ∈ X}).

4.2. MATHEMATICAL FRAMEWORK

101

We will follow this convention in the following. In the case of regression one would most likely get L(f ) = 1 as the hypersurface defined by y = f (x) typically will have measure 0 and thus s(f ) = 0. One needs to redefine the support in this case. A natural way would be to consider an ²-neighborhood of the hypersurface, i.e., the set defined by (x, y) where |y − f (x)| ≤ ² for some given ² > 0. Note that this ² is another parameter (like the minimal support σ) which has to be chosen by the analyst and which can drastically influence the outcome of the analysis. Thus we get an ²-support as measure of this set: s² (f ) = P ({(x, y)||y − f (x)| ≤ ², x ∈ X}). The classification problem in machine learning consists of finding a best possible f , i.e., one for which L(f ) is minimal in a certain class of functions. In general one defines the function with maximal support and thus minimal error as fB = argminf L(f ), this is called the Bayes classifier. In the cases where P is unknown this is a purely theoretical concept. Even for many cases where P is given one needs to specify more about the set from which one chooses f in order to get an actual function. The problem in datamining of finding all classifiers which have a given minimal support leads to a slightly different optimisation problem. One needs to characterise the functions f for which L(f ) ≤ 1 − σ for some minimal support σ. The set of admissible f can be very large and a general characterisation can be difficult. This is why in the data mining literature itself usually focusses more on the machine learning problem of finding the best or at least one good classifier. In the case of the regression problem and the empirical distribution function we get the following estimate for the support: 1 #{(x(i) , y (i) ) | |y (i) − f (x(i) )| ≤ ²}. n We can also write this as sum: n 1X s² (f ) = w((y (i) − f (x(i) ))/²) n i=1 s² (f ) =

102

CHAPTER 4. PREDICTIVE MODELS

where w is a characteristic function with: ( 1 if |z| ≤ 1 w(z) = 0 else Now we approximate this function as a truncated quadratic function, i.e. w(z) ≈ (α − βx2 )+ where (·)+ denotes the “positive part” which is zero if the argument is negative and equal to the argument else. The constants α ≈ 0.58 and β ≈ 0.35 are chosen such the 0. to 2. moments are preserved. Then we get approximately n

1X s² (f ) ≈ (α − β((y (i) − f (x(i) )/²)2 )+ . n i=1 We are now interested in functions for which this is small. If ² is large enough and we are close enough to the data the maximum of this is the least squares fit for which n 1 X (i) l(f ) = (y − f (x(i) )2 n i=1 is minimised. This is of course only well defined if the function space is “small enough” and we have enough data points. In order to avoid over-optimistic error rates, a different independent part of the data is used for the evaluation of the error. Alternatively, in particular in the case of small data sets one uses generalized cross validation (GCV) [36]. In many cases one can use the simple data model where y = f0 (x) + e where e is a random error which is independant of x. This models the case where the response is a function of the predictor variable perturbed by some (measurement or observational) noise. The support is in terms of e and x: s² (f ) = P ({(x, e) | |f0 (x) + e − f (x)| ≤ ²}. If the distribution density of x is px (x) and the distribution density of e is pe (e) then Z Z f (x)−f0 (x)+² s² (f ) = px (x) pe (e) de dx. f (x)−f0 (x)−²

A simple midpoint approximation of this is Z s² (f ) ≈ 2² px (x)pe (f (x) − f0 (x))dx.

4.2. MATHEMATICAL FRAMEWORK

103

Now if pe has a maximum for zero error at pe (0) we get an upper bound of the support as s² (f ) ≤ 2²pe (0). This bound is also achieved if f (x) = f0 (x) almost everywhere. Thus the actual function f0 has the highest support. (At least in our approximation or in the limit for ² → 0.) Alternatively, f0 is also characterised as the minimisor of the expectation E((f (x) − y)2 ). This is also written as conditional expectation f0 (x) = E(y|x). Thus in the case where the data has been generated from a function f0 this function is both the function with the highest support and the function which minimises the expectation of the squared residual. The straight-forward determination of this using the average is usually not feasible as it is unlikely that enough (or even one) data points are available for every possible value of x such that f (x) may be estimated by the mean of the corresponding values of Y . Assume for example m = 30 categorical features with 6 possible values each. Then the total number of possible values for the independent variable is 630 ≈ 2 · 1023 which is way beyond the number of observations available. Thus smoothing, simplified and parametric models have to be used. A seemingly simple approach to regression is to quantify or discretise Y and then use a classification technique to approximate f . This approach can make use of all the classification techniques, however, in order to get good performance one may have to use a very high number of classes. Consider for example just the approximation of a simple one-dimensional linear function. A linear model requires only 2 parameters but a classification technique may require hundreds of parameters. Thus real models real models may be much simpler than their approximating (discretised) classification models. In data mining we would want to consider more complex cases and so one really is interested in the support for families of functions rather than the minimimum of the sum of squares. This, however, has not been widely explored and we will in the following mainly consider the sum of squares as a measure. Classification is treated in depth in the machine learning literature, see, e.g. [58]. Often classification is called pattern recognition. It is certainly one form of pattern discovery in data mining. A well-known example from

104

CHAPTER 4. PREDICTIVE MODELS

biology (again from the R-package) is displayed in figure 4.2. It displays three types of Iris and some of their physical characteristics. It appears that from those characteristics it would be possible to predict the type of Iris. In particular, can see that t the predictor variable domain is partitioned by the class of the response variable. Like associations A ⇒ B the functions Y = f (X) provide relations between features of the data. Following is a comparison of some properties of associations and functions: • The dependent variable Y is fixed for classification while finding the antecedent is part of association rule mining. • The features of association rules are the predicates which are Boolean while in classification the features can take arbitrary types. The lefthand side Y of a classifier has values in a finite set {C0 , . . . , Cm } and the most commonly discussed case is m = 2 and in this case Y is basically a predicate. • The performance of a classifier is judged by its accuracy, rather than by support and confidence. • One function f summarises all the data compared to a set of association rules. • Like in the case of association rule induction, the search for good classifiers often starts with a simple classifier and successively more complex ones are considered until the desired performance is achieved. Overall, there are close connections, say, between support and misclassification error, between the features used for classification and association rules. Both association rules and classification models provide an approximation of the data, the association rules in terms of the cumulative or marginal distributions and the predictive models in terms of the ridges in the distributions. In the case where Y and thus f (X) is Boolean the misclassification rate can be determined from the support defined in the previous section as L(f ) = s(χy ) + s(χf (x) ) − 2s(χy ∧ χf (x) ), or, with other words, misclassification occurs if either y is true and f (x) is not true or f (x) is true and y is not. Association rule technology has been applied to to classification and examples of this approach are in [28, 52, 51, 53, 57].

4.2. MATHEMATICAL FRAMEWORK

4.2.3

105

Function spaces

Let F denote the set of all admissible functions f : X → Y considered in the search. Due to the high dimension of X this space can be very large. For example, in the case where X = {0, 1}d the set X contains 2d elements. d If Y = {0, 1} then there are a total of 22 possible functions. And this is the simplest case. Often, one considers subspaces or families of subspaces of ”simpler” functions. The simplest class of functions are constant functions and we will assume that all our function spaces contain this set as a subspace. In the following we will consider several classes of function spaces. 1. A first class is obtained by partitioning the domain X as X=

m [

Xj .

j=1

The partitions are assumed to be pairwise disjoint Xi ∩ Xj = ∅ if i 6= j.One then considers functions which are constant in each partition Xj , i.e., there are constants c1 , . . . , cm such that f (x) = cj if x ∈ Xj . Thus the predictive model f can be formulated as a rule if x ∈ Xj then f (x) = cj . The popularity of tree-based models is not least due to this fact that the funtion can be described by a simple rule. The partitions Xj themselves are based on partitions of the components X1 , . . . , Xd of the tensorproduct X = X1 × · · · Xd . Thus the partitions take the form of “boxes” or tensor products Xj = Xj1 × · · · Xjd . The partitions of the components are determined during the fitting of the data but the underlying class of potential partitions is given. In the case of categorical sets Xk they are subsets and in the continuous case they are subintervalls. Alternatively, the partitions can be based on a metric and a set of centres. In this case a partition Xj is given be the set of all points which

106

CHAPTER 4. PREDICTIVE MODELS are closer to the centre of Xj than any other centre. This partitioning is a Voronoi partitioning, also referred to as Delauney tesselation. This type of partitioning is used more commonly for clustering than for predictive modelling.

2. A second class of functions is characterised by a seminorm. This is given by a linear operator T and a norm k · k (typically L2 ) as kT f k. One then considers functions for which the seminorm is bounded, i.e., kT f k ≤ c for some constant c. The operator T is a differential operator and the meaning of this bound is a quantitative condition for the average smoothness of f . The operator also implies a certain differentiability as f ∈ D(T ), i.e., the function has to be in the domain of T . A consequence of this condition is that function values of neighboring points are close. One then has to solve partitial differential equations. 3. A third class of functions is defined by set of generating functions Bi through linear combinations f (x) =

m X

ci Bi (x).

i=1

The effect characterised by f is then written as the superposition of simpler effects. The number of basis functions can be enormous and so the selection of the ones for which the coefficients are nonzero is the core problem. Thus even in this case, the problems to be solved typically are of a nonlinear nature. 4. A final class of functions is defined directly with measures defined on the x and y.

4.3 4.3.1

Constant Functions Constant classifiers

The best constant classifier (or Bayesian classifier) fB0 (x) = fB0

4.3. CONSTANT FUNCTIONS

107

Edgar Anderson’s Iris Data

v vv v v v v v v v vvv v v vvvv v v vv v vv vv v v v v vc vvc v v vvvc v c c vc c c c c c cc cc cc cc c cc cc c v cc c cc c c c cc c cc cc c c c c cc c

c

2.5 2.0 1.5 1.0 0.5

v vv vv vv v v cv cvv ccvvv c c cc vc c cc c c c c cc c ccc cc c c c cc c c

v

4.5

5.0

5.5

c

v v vv v v v v vv v vv vv vv cvvv cv v c c cc cc cc c c cc

v v v v v v c cv c vc c c c cc cc ccc cc

v v v v v vv

v v v v v c v cv c c ccc ccc c c

s 6.5

7.0

7.5

8.0

s s s s s s sss s s s s ssss sss s sss s s sss s ss s ss s s

sss ss s ss ss s s s ss sss ss s s ss ss ss s s s s sss s s ss s ss sss s s s ss sss s

6.5

v v vv v v c vv v v cc c cc cc c vv v v v v v v c c v v vv v v vv v v ccc cv c cv c v v c c ccv c cvv v vv cc cc c v c cc c cc cc c cc c v

7.5

2.5

v vvvv

5.5

2.0

4.5

s ss sssss s sssss s ssss ssss ss s s sssss s ss

1.5

s v v v c vv c v v c cc v v v vv c c c v vvv cc ccc vvcvv v vv v v cc c ccc v c v cc ccccvv v v v v c cc v cv v c c c v v c cc v cv v c cc c c c c c v c

s

v v v c vv c v v cv v v cc v v vv cc ccccvcv vvvv ccc v c c c vc vvvvv v c ccc c vv c c v v c c c vvvv cc c c cv c c vvvv vvv v v v v vv vvv vv vv v v vv v vv vvv vv c c v v v vvv vv c cc ccc cccc cc v cvv v c c c c cc c cc cc c c cc c c c cc c

v v

v

Petal.Length

v v v vv v v v v v cv cc cc cc

s

6.0

v

ss ss ss sss s s ss ss ss s sss ss ssss s ss s s s s v

v

s s ss s s s ss s ss s sss s s sss ssssss s s ss s

s

s

v

s

v v v v v v v v vvv v vv v v vvv v v v v v vv vv vc v v v v v v v c v c c v c c c vc c vc c c c c ccc c c vc ccc cccc c c cc c c cc c cc c cc

s s s s s ss s s s ss sss s s s s ss s s s s s ss s s s

Sepal.Width

s ss s s ss ss sss sss ss s ss ss s s s ss ss s s s s s ss

vv

v

v cv cv cv v v v c vc v vv v c c

1.0

7

s s s s s s v v s s ss v s ss sss s c vv s s sss s ss v vc c s ss s vc v v v c v v vc vc s ss c c c vc v vc sss v c vc v v v vv ss v cc cccvc c s vcv cvvvc c v v c vc c v v c c cc v v ccv vc v c v c c c c s c vc c c

0.5 v v vv v v v vv v ccv v vv v v c cc cc c v vvv v ccccc cvvvvvv vvv v v ccvv v v cc c c c c cv vcvv c c c c c v cc c ccc cc c ccc v c cc c c v

6

2.0

2.5

3.0

3.5

4.0

c

4.0 v v

s s ss s ss ss s s s ss s s ss s s ss

s v vv v v v vvvv v v v v vv v vvvv v vvvv v v v vv v cvvv v v vv v v v v c cc c v c ccc cvv c c ccc v c cccc ccccc cc c c c cc c c c c cc

v v v

s

s sss ss s s ssssssssss s s ss s s

s

s s s sss s sss s s ss sssss sss ss s s ss 1

2

3

4

Figure 4.2: Iris data

5

6

7

5

Sepal.Length

vv cc vc cv v cc cv c c cc c c c c c cv s

3.5

4

v cv

v vv vv vv c cvv cv cc cv vv c c cv c c cvvv cvv cc cc c s ss s ss

3

v

3.0

2

2.5

1

2.0

Petal.Width

108

CHAPTER 4. PREDICTIVE MODELS

is the one which minimises the misclassification rate P (Y 6= c). As we consider two classes 0 and 1 only one has P (Y 6= 0) = P (Y = 1) = 1 − P (Y = 0) = 1 − P (Y 6= 1). We now introduce p = P (Y = 1) and so the best classifier is defined as ( 0 if p < 0.5 fB0 = . 1 else The misclassification rate of this classifier is then L0B = min(p, 1 − p). This classifier is the best overall if the data does not have any additional structure, i.e., if one has a uniform distribution. Now lets consider the constant random classifier fR0 (x) = Z. where the random variable Z has the same probability distribution as Y but is independant of Y . That is one has P (Z = 1) = p = 1 − P (Z = 0). The misclassification rate of this classifier is L0R = P (Y 6= Z) = P (Y = 1, Z = 0)+P (Y = 0, Z = 1) = p(1−p)+(1−p)p = 2p(1−p). Note that the Bayes classifier has a better rate than this classifier, i.e., L0B ≤ L0R , however, the random classifier reproduces population which the Bayes classifier doesn’t. In particular, if we use the Bayes classifier to generate multiple values of Y then one allways gets the majority class. In contrast, the random classifier will generate a population of points which does have a similar distribution as the original points. The random classifier in particular allows to recover the distribution parameter p.

109

25

ozone concentration

20

100 0

10

15

50

ozone concentration

30

35

150

40

4.3. CONSTANT FUNCTIONS

15

20

25

30

60

35

65

70 temperature

temperature

Figure 4.3: Ozone data, mean, median and best approx.

4.3.2

The best constant real functions

For a random variable Y the best constant model is the expectation E(Y ) as it minimises the expected square deviation: µ = E(Y ) = argmint∈R E((Y − t)2 ). In terms of observed data y (1) , . . . , y (n) one could estimate µ as the minimum of the least squares: n

1 X (i) µ ˆ = argmint∈R (y − t)2 n i=1 where the factor 1/n is not necessary but points to the interpretation of the average sum of squares as an estimate of the expected square deviation. This estimate can be seen to be the average of the observations: n

1 X (i) µ ˆ= y . n i=1 Figure 4.3 shows the example with the Ozone concentration and the mean. In order to understand how well the average approximates the random variable Y we assume that the observations are the observed values of independent, identically distributed random variables Y (i) . The mean is then also a random variable M where n

1 X (i) M := Y . n i=1

110

CHAPTER 4. PREDICTIVE MODELS

This random variable M will be used as an approximation of the random variable Y . We will now look at some properties of this approximation. First, because of the linearity of E the expectation of M is the same as that of Y : E(M ) = E(Y ) = µ. As this is the case, one says that M is an unbiased estimator of Y . The variance of M is n n 1 XX E(M − µ) = 2 E(Y (i) − µ)(Y (j) − µ) n i=1 j=1 2

and as the Y (i) where assumed to be i.i.d. one gets 1 σ2 2 E(M − µ) = E(Y − µ) = . n n 2

This is much smaller than the variance of Y . This does not mean that M is not a good estimator of Y . We measure the quality of this approximation with the mean squared deviation error = E(Y − M )2 . Here Y corresponds to a future observation and is different and independent from all the Y (i) but has the same distribution. Thus the two random variables Y and M have the same expectation and are independent but one has variance σ 2 and the other variance σ 2 /n. Because of the independence one gets E(Y − M )2 = E(Y − µ)2 + E(M − µ)2 . The µ would be the best possible model and the first part is the deviation of the data from the best possible model and the second part is the deviation of the actual estimator from the best possible model. In our case of constant models the first part models the deviation of the data from the model and the second part models the variance of the estimator. More generally, let Yˆ be an estimator for the random variable Y and let the expectation E(Yˆ ) = µ ˆ. Then one gets using only the independence of the random variables involved: E(Y − Yˆ )2 = E(Y − µ)2 + E(Yˆ − µ ˆ)2 + (µ − µ ˆ)2 .

4.3. CONSTANT FUNCTIONS

111

Here, on the right-hand side one first has the expected squared deviation of the best possible model, then the variation of the estimated model due to the variation of the data and finally the difference between the best possible model and the expected value of the estimated model. These errors are the model error, the variance and the bias. Only the second and third parts can be controlled by choosing a different estimator Yˆ . Finding the best constant approximation to Y is thus closely related to finding a good approximation to the mean µ which has both a small variance and an expectation close to the mean. We have seen that M gives an unbiased estimator of Y , in fact, it is the best possible unbiased estimator. This is a special case of the Gauss-Markov theorem which states that the best possible unbiased estimator is the least squares fit. However, we are interested in more general estimators including unbiased ones. The question now is if we can reduce the variance of the estimator by introducing some bias. Obviously we can, take for example the estimator Yˆ = 0. The variance of this estimator is zero but the bias is µ. So the question is now what is the best estimator for Y given we know the observations Y (i) . Now we consider an estimator which is linear in the observations. Furthermore, at this stage there is now reason to discriminate between the various random variables so we only need to consider an estimator which is a multiple of M , i.e., Yˆ = γM . In this case one gets for the error error = σ 2 + γ 2 and this is minimised for γ=

σ2 + (1 − γ)2 µ2 n

µ2 µ2 + σ 2 /n

and in this case the error is error = σ 2 +

σ2 1 . n 1 + σ 2 /(nµ2 )

So if there is a large variance compared to the mean then the second part can be reduced substantially, especially for small numbers of observations. This effect will be more important for the case of non-constant functions. While the average minimises the sum of squares, this corrected average

112

CHAPTER 4. PREDICTIVE MODELS

minimises the penalised sum of squares: n

1 X (i) (y − c)2 + αc2 J(c) = n i=1 where α = σ 2 /µ2 /n. In practice one needs to estimate the parameter α and one way would be to use a test data set and choose α such that the error on the test data set is minimised. In figure 4.3 the case of the first 10 observations together with the mean and the best linear approximation is displayed. Note the difference between the mean and the best constant approximation. The mean is the constant which in a least squares sense fits the data best whereas the best constant approximation is the constant which is obtained from a function class from the data which is expected to fit future data best. This difference is very important in all the discussion about regression models to follow. So far we have only considered linear functions in the observations. An important generalisation considers estimators of the following kind: X m+ wi y (i) where m is a constant which is a known approximation for µ. One can in this case again determine the coefficients wi such that the error is minimised. The mean may in many cases put too much weight on outliers and one might wish to consider taking medians. This, however, will suppress areas with large values and one may wish to separately investigate the occurrence of such outliers which are important for the detection of fraud, for example.

4.4

Tree models

As was mentioned earlier, tree models are based on partioning of the data set into subdomains and choose constant values for the functions on the subdomains. The reason why the models are called tree models is due to the fact that they are generated by recursively partitioning subdomains which then leads to a hierarchy of subdomains for which the Hasse diagram of the subset relation is of the form of a tree which has at the root the full domain X and the leaves are the final partitions Xj . Thus one can introduce a partial order on the function spaces which is induced by the partial order defined on

4.4. TREE MODELS

113

partitions in the sense that {X1 , . . . , Xk } ≤ {Z1 , . . . , Zm } if for each i there exists an j such that Xi ⊂ Zj . In this case the first partition would be finer than the second partition. The search problem is now to find a partition for which the predictive accuracy is high. Note that in principle the finer the partition the bettern one can potentially make the fit. This does not necessarily mean, however that any specific fitting procedure necessarily gives better results for finer partitions. This will be further discussed later on. Decision trees are used in machine learning and statistics [62, 64, 19].

4.4.1

Splitting and impurity functions

Here we consider the effect of one split into two subdomains, i.e., X = Xl ∪ Xr . This split is characterised by a predicate ai : X → {0, 1} which is true on Xl and false on Xr . In terms of these predicates or basis functions one can write the function f defined by this split as f (x) = cl al (x) + cr ar (x). Consider the case of two classes R for y and a density p(x, y). The proportion of class 1R in X is then p = X p(x, R 1)dx, the proportion in the partitions is then pl = Xl p(x, 1)dx and pr = Xr p(x, 1)dx. Now if p > 0.5 then the optimal constant classifier has class 1 and the misclassification rate is 1 − p. The aim of using more complex classifiers is to get improved classification performance. Recall that the misclassification rate of the best constant classifier (choosing the majority class) is φ1 (p) = min(p, (1 − p)). After splitting we chosse the best constant classifier on each of the partitions and thus the misclassification rate of this new classifier is sφ1 (pl ) + (1 − s)φ1 (pr ) where s = P (Xl ) is the proportion of points in the ”left” partition. A splitting is of interest if it allows to improve classification performance, i.e., if φ1 (p) − sφ1 (pl ) − (1 − s)φ1 (pr ) > 0.

114

CHAPTER 4. PREDICTIVE MODELS

The best classifier based on the split domain cannot perform any worse than the constant classifier, however, if the pl and pr are on the same side of 0.5 then the best classifier even after the split is the constant classifier and thus the split does not improve classification performance and will thus be rejected. The best split will separate all the points for which the probability of being in class 1 is larger than 0.5 from the points for which the probability is less than 0.5, so, ideally: Xl = {x|P (Y = 1|X = x) > 0.5}, and Xr = {x|P (Y = 1|X = x) ≤ 0.5}. Computationally, this is not tractable and one then successively approximates such an optimal split (which of course leads to nothing less than the Bayes classifier) by splits based on one variable: Xl = {x|P (Y = 1|Xk = xk ) > 0.5}, and Xr = {x|P (Y = 1|Xk = xk ) ≤ 0.5}. The k is chosen such that the misclassification rate improvement using this split is optimised. So if as usual sk = P (Xl ), pl,k = P (Y = 1|x ∈ Xl ) and pr,k = P (Y = 1|x ∈ Xr ) the best kB is kB = argmink sk φ1 (pl,k ) + (1 − sk )φ1 (pr,k ). This split is computationally feasible for categorical variables where for each value of xk one chooses the majority class. Applying this spliting procedure repeatedly gives ever better classification rates. In most cases the procedure will lead to good approximations of the Bayesian classifier. However, it is a greedy algorithm and it can also fail. This happens for the simple case of x ∈ {0, 1}2 and P (Y = 1|X = (0, 0)) = 0, P (Y = 1|X = (1, 0)) = 1,

P (Y = 1|X = (0, 1)) = 1, P (Y = 1|X = (1, 1)) = 0.

In this case the constant classifier has a misclassification rate of 0.5 and any split along any of the axis will not improve on this. However, the Bayes classifier ( 0 for x = (0, 0) or x = (1, 1) fB (x) = 1 else has a misclassification rate of 0. A way to avoid this situation is to allow splits which depend on two (or more) variables. Another way around this

4.4. TREE MODELS

115

would be to allow (at most a few) splits which do not improve classification performance. In the case of continuous variables, however, the best splits based on one variable are computationally not feasible. One then uses splits which depend on two parameters k and ξ of the form Xl = {x|xk > ξ}, and Xr = {x|xk ≤ ξ}. The k and ξ are chosen such that the misclassification rate improvement using this split is optimised. So if as usual sk,ξ = P (Xl ), pl,k,ξ = P (Y = 1|x ∈ Xl ) and pr,k,ξ = P (Y = 1|x ∈ Xr ) the best kB and ξB are (kB , ξB ) = argmin(k,ξ) sk,ξ φ1 (pl,k,ξ ) + (1 − sk )φ1 (pr,k,ξ ). The same remarks apply here as above, in addition there are other cases where this approach fails to lead to a good approximation of the Bayes classifier. An example is the one-dimensional case where x ∈ [0, 1] and   0 x < 1/3 P (Y = 1|x) = 1 x ∈ [1/3, 2/3] .   0 x > 2/3 Here the now single split will lead to any other classifier than the constant fB0 (x) = 0 with misclassification rate 1/3 but the Bayes classifier is   0 x < 1/3 fB (x) = 1 x ∈ [1/3, 2/3] .   0 x > 2/3 has misclassification rate 0. In order to avoid this problem one can other criteria than the misclassification rate to accept a splitting. In addition to φ1 other functions have been used, see figure 4.4: • φ1 (p) = min(p, (1 − p)) (based on misclassification rate) • φ2 (p) = 2p(1 − p) (Gini [20])

CHAPTER 4. PREDICTIVE MODELS

0.1

0.2

0.3

0.4

0.5

116

0.0

miscl. error gini entropy

0.0

0.2

0.4

0.6

0.8

1.0

p

Figure 4.4: Three impurity functions. • φ3 (p) = −p log(p) − (1 − p) log(1 − p) (Entropy [62]) Any split which decreases φ1 will also decrease φ2 and φ3 . However, and these are the important cases, a decrease in φ2 or φ3 may occur where no decrease in φ1 happens. In the onedimensional example considered above the value of sφ2 (pl ) + (1 − s)φ2 (pr ) can be seen as function of ξ in figure 4.5. Clearly one sees that the function has minima for ξ = 1/3 and ξ = 2/3. Any of these choices will lead to the Bayes classifier after one more split. Other criteria have been used as well for splitting. The functions φi all measure the deviation from a ”pure” one-class situation. We discussed the function φ1 which is the misclassification rate. The function φ2 is the misclassification rate of the random classifier which has the same distribution as the data as was mentioned in the section on constant classifiers. Thus the method will find the best random classifier based on a split. The measure φ3 is the entropy. The difference between the entropy in X and the split Xr , Xl provides information on how well the two classes are mixed in X, respectively, how well they can be separated using a split. One can think of the case where points diffuse accross a boundary between Xr and Xl .

117

φ2

0.30

0.35

0.40

0.45

0.50

4.4. TREE MODELS

0.0

0.2

0.4

0.6

0.8

1.0

ξ

Figure 4.5: Value of sφ2 (pl ) + (1 − s)φ2 (pr ) as a function of the splitting parameter ξ As can be seen from the previous display, all the impurity functions φi are concave, i.e., φ(spl + (1 − s)pr ) ≥ sφ(pl ) + (1 − s)φ(pr ), for s ∈ [0, 1] (Recall that in the examples above p = spl + (1 − s)pr .) The functions φ2 and φ3 are in addition strictly concave, i.e., φ(spl + (1 − s)pr ) > sφ(pl ) + (1 − s)φ(pr ), for s ∈ (0, 1). Consequently, one gets (see [66]): Proposition 7. No splitting increases the impurity. Moreover, the splitting strictly decreases the impurities φ2 and φ3 if pl 6= pr . Proof. The non-increasingness property is due to the concavity and has been shown above. The decrease is a direct consequence of the strict concavity of φ2 and φ3 .

4.4.2

Growing a tree

A tree is an undirected graph, i.e., a pair (V, E) where V is the set of vertices and E is the set of edges, such that there is exactly one path between any two nodes. One says that a tree is a connected acyclic undirected graph. One node of the graph is designated as the root. In our application the nodes correspond to subsets of X. These subsets form a partially ordered set and

CHAPTER 4. PREDICTIVE MODELS

1.5

setosa c 1.0

petal width

2.0

2.5

118

v vv v v vvvv v v v v v virginica v v vv v v v v v v vvvv vv v v vcvv v v vv v v v c v c c c v c ccc cvv c c ccc v ccccc c c c cc cc c c c cc c c c cc

virginica

versicolor

0.5

s s s sss s sss s s ss s s sss sss ss s s ss 1

2

3

4

5

6

7

petal length

Figure 4.6: Domain partition for Iris data the tree is the Hasse diagram of that partial order. We will denote trees relating to the Hasse diagram of partitionings of domains by t. We will only consider the case of binary trees where each non-leaf node has exactly two direct offsprings. For this case we get a recursive definition of a tree: Definition 7. A tree is either • A single vertex • Or it consists of a root vertex and two subtrees with an edge each between the root vertex and the root vertices of the subtrees. The root of the tree corresponds to the full set X and the sets Xj corresponding to the leaves of the tree form a partition of X. One obtaines the partitions by repeatedly splitting subdomains into two parts using one variable at a time and is done using an impurity measure as discussed in the previous section. An example for a partition obtained for the Iris data is in figure 4.6.

4.4. TREE MODELS

119

Petal.Length < 2.45 |

Petal.Width < 1.75 setosa

Petal.Length < 4.95 virginica versicolor

virginica

Figure 4.7: Decision tree for Iris data The tree defines a piecewise constant function. The function is constant on subsets of X corresponding to the leaves of the tree. The function is evaluated recursively. If node i contains the set Xi then there is a piecewise constant function fi which is defined on Xi such that fi (x) = f (x) x ∈ Xi . For the leaves these functions are constant and for the non-leaves they can be evaluated recursively by: ( fl (x) if x ∈ Xl fi (x) = fr (x) else. where Xl and Xr are the descendants of Xi , i.e., Xi = Xl ∪Xr . We set X0 = X. An example of a tree for the partition in figure 4.2 is displayed in figure 4.7. The basic algorithm for the construction of the decision tree consists of repeated assessment of the purity of the nodes in each partition and decisions on which partition to split into subpartitions and how, see Algorithm 6. For

120

CHAPTER 4. PREDICTIVE MODELS

Algorithm 6 Basic decision tree algorithm Partition(Ti ) if all the points in Ti are of the same class then no further splitting else evaluate all splits of Ti find best split T = Tl ∪ Tr Partition(Tl ) Partition(Tr ) end if each split the data corresponding to the subdomain has to be visited. When the splitting algorithm is based on the empirical distribution of a data set with N data points the algorithm stops after at most N steps after which not simple split will improve the impurity. Lemma 6. The partition algorithm terminates after p steps and 0 ≤ p ≤ N . Proof. After each splitting step the algorithm either cannot find a split to improve or it splits an earlier partition in two thus increasing the number of partitions by one. As each partition needs to contain at least one data point there can be at most N partitions and thus at most N partitioning steps. Often one requires that there are more than one data points per partition. In order to handle the large and growing data sizes, scalable parallel algorithms are required. An example is the SPRINT algorithm [68]. The computing platform is a “shared-nothing” parallel computer where the processors have their own memories and disks and the data is distributed evenly over these local disks. The data records are assumed to be sequences of (s) (s) features (x1 , . . . , xd , y (s) ) for s = 1, . . . , N are stored with some redundancy in sorted attribute lists which for each attribute xi consist of sequences (i) (i) (i, xs , y (i) ) sorted on xs . This allows the efficient evaluation of splits in the case of real attributes. This evaluation is done based on histograms of the values of y for each xs . This evaluation can be done in parallel. Hash tables are used to associate the transaction record identifier i with both the processor and the partition and the Gini index is used for splitting. Experiments demonstrate the scalability of this approach. In fact, SPRINT performs well with respect to three important criteria:

4.4. TREE MODELS

121

1. Scaleup: Fix the data per processor and study time as function of number of processors. 2. Speedup: For a constant data size the time is studied as a function of p. 3. Sizeup: The number of processors is fixed and the time is studied as a function of the data size n. If the misclassification rate is determined from the data set which is used for training one gets an estimate which is systematically too low. Better estimates are obtained from the hold out method where the data is partitioned into training data for computation and test data for the estimation of the error. A popular alternative is generalised cross validation or GCV [36] which emulates the holdout technique without the reduction of the size of the training data.

4.4.3

Tree pruning

After the tree building process the tree typically contains a large number of leaves. In the extreme case one has N leaves. The corresponding model will overfit the data and needs to be simplified to reduce the variance. This can be done in the framework of linear models, however. Given a partition [ X= Xj j

corresponding to the leaves of a tree t one defines the linear space of functions of the form X f (x) = cj I(x ∈ Xj ). j

One observes that the Bayes model for the empirical distribution isn’t necessarily the Bayes model for the underlying population distribution. One could now attempt to find coefficients cj such that the misclassification defined on a test data set is minimised. Thus the majority class of the test set is determined for each partition Xj . However, if the final splitting is fine enough, many partitions will not contain any data points at all or very few and one again gets a high variance. A different approach is thus used. One considers models which are obtained by fusing adjacent subsets Xj , i.e., by setting coefficients cj equal for the subsets. This amounts to undoing

122

CHAPTER 4. PREDICTIVE MODELS

certain partitioning steps and is thus called pruning. Now if one looks at the misclassification erro only any pruned tree will perform less favorably than the full tree. One then introduces a penalty on the complexity of the tree as one typically would like to have simpler trees. If the misclassification error is L(f ) one introduces Lα (f ) = L(f ) + α|f | where |f | denotes the number of leaves of the tree. (Equivalently, one could use the number of nodes #f of the tree as #f = 2|f | − 1.) For any α one no considers the problem of minimising Lα and subtrees are now competitive. This new optimisation problem is well-posed and one has: Theorem 12. For each α > 0 there is exactly one fα such that Lα (fα ) ≤ Lα (f ) for all functions f . ......

4.4.4

Trees and rules

The models from (small) decision trees are understandable and can even reasonably easy be cast into standard database queries (which may be expressed in SQL, the standard for relational data base queries). They can be generated fast and show good accuracy. These properties make them well suited for data mining. Related to the understandability is the fact that one can extract if-then rules from trees. In fact, one obtains one rule for each leaf node where the conjunction of the predicates leading to the leaf form the antecedent of the rule and the value in the leaf partition form the consequent. The rules for the tree of figure 4.2 are: • If Petal.Length < 2.45 then class = setosa • If Petal.Length ∈ [2.45,4.95) and Petal.Width < 1.75 then class = versicolor • If Petal.Length ≥ 4.95 and Petal.Width < 1.75 then class = virginica • If Petal.Length ≥ 2.45 and Petal.Width ≥ 1.75 then class = virginica

4.4. TREE MODELS

123

Thus the consequent of any rule for a leaf partition Ti is given by class(x) = ci where ci is the class associated with Ti (typically the majority class) and the antecedent is Y aj (x). Tj ⊂Ti

This is just the project of all the domains which are supported by Ti . These initial rules, however, are often not very useful and may have low support as the supports of the antecedents of the rules are mutually exclusive. Several steps are required to create rules which are useful for data mining from the initial rules [64]: • Remove predicates in the rule which do not improve the rule. This pruning is often done based on a χ2 test. • Rules with overall low quality are removed and similar rules are merged. • A default rule is required which covers all the cases not covered by other rules. These steps do require further assessment of the quality of the generated rules and thus need further scanning of the data. As the pruning step may need to be repeated several times the time spent on this data scanning may be substantial. See [63] for implementation of the rule induction from decision trees.

4.4.5

Regression Trees

Regression trees [19] are based on the same ideas as decision trees. They do have the same advantages, and, in fact, they may be used with logistic regression for the determination of classification probabilities. In contrast to many other techniques based on tensor products of one-dimensional spaces, regression trees are extremely successful in dealing with high dimensions as their complexity is proportional to the dimension. They have two shortcomings, however: They do not represent linear functions well and, more generally, they can not approximate smooth functions accurately as they are discontinuous, piecewise constant functions. The regression tree approach is described by Breiman et al. [20]. Consider the example in figure 4.8 which shows the regression tree as generated by the routine “rpart” of R. The y-axis relates to the error reduction

124

CHAPTER 4. PREDICTIVE MODELS

Temp< 28.06 |

Temp< 30.83

Wind>=11.5 Wind>=14.32

90.06

Solar.R< 79.5 55.6

45.57

72.31

Temp< 25.28 12.22

21.18

34.56

Figure 4.8: Regression tree for Ozone data (temperature in Celsius and wind speed in km/h)

125

0

50

observed

100

150

4.4. TREE MODELS

0

50

100

150

predicted

Figure 4.9: Estimated vs observed values of regression tree for Ozone data related to the splitting. The temperature appears to be the main variable, wind and solar radiation play a secondary role. In figure 4.9 we plot the predicted values on the data points. We can clearly see an outlier which does not seem to fit into the model. The trees define, as in the case of classification, a partitioning of the space of the independent variable x and for each partition a function estimate is provided. On each subdomain a constant function approximation is used. The comments about constant function approximations made earlier apply here as well. In the simplest case the means are used which are 1 X (i) f j := y Nj (i) x

∈Tj

minimises the sum of squared residuals X Rj (f ) := (y (i) − f (x(i) ))2 . x(i) ∈Tj

126

CHAPTER 4. PREDICTIVE MODELS

between constants and the data over the subdomain Tj and for all leaves Tj one has the least squares approximation of f which is constant on Tj as: f (x) = f j ,

x ∈ Tj .

Thus once the partitioning of the domain is done, the determination of the “best” function is straight forward. The splitting is selected based on the reduction of the sum of squares, or, in order to adapt the complexity of the model to the data, one uses a penalised sum of squared residuals: Rα (f ) = R(f ) + α|Lf | where Lf is the number of leaves of f , i.e., a complexity measure, possibly using a different data set for its evaluation. For every variable there are several splittings possible and the one is selected which maximises the decrease of error in ∆R. This splitting is continued until each of the leaf nodes contain at most Nmin observations. The resulting tree is typically overfitting the data to quite a big extent and one would like to select a subtree which is less complex. Complexity is measured by the number of leaves of the tree, i.e., |Lf |. Thus instead of minimising the sum of squares R0 (f ) one attempts to balance complexity with fit on the training data set by minimising Rα (f ) = R0 (f ) + α|Lf |. Note that on the right hand size an independent data size has to be used in practice for the evaluation of R(f ). For an N -dimensional space Ω, an approximation defined by a tree with M leaves and N data points this algorithm does require O(nN M ) operations [33]. Thus the procedure is scalable, i.e., linear in the amount of observations N .

4.5

Linear and additive functionspaces

A large class of regression functions have the following form: X X X f (x) = f0 + fi (xi ) + fi,j (xi , xj ) + fi,j,k (xi , xj , xk ) + · · · i

i,j

i,j,k

where the level of interactions may be determined by the data but often an upper bound is supplied by the user. It is seen that often one or two

4.5. LINEAR AND ADDITIVE FUNCTIONSPACES

127

levels are enough and hardly any cases are found with higher than level 5 interactions. The decompositions, due to the similarity with the interaction models in ANOVA are called ANOVA decompositions.

4.5.1

Additive models

In the case where the tree of basis functions generated by MARS contains two levels, namely, the root B0 ≡ 0 and its children, a model of the following form is generated: d X f (x1 , . . . , xd ) = f0 + fi (xi ). i=1

The univariate components fi of this additive model are piecewise linear in the case of MARS. Other commonly used additive models are based on smoothing splines and local parametric (polynomial) approximations [42]. A unified treatment for all these approximations reveals that in all cases additive models are scalable with respect to data size and, like the multivariate regression splines, conquer the curse of dimensionality. While additive models are computationally very competitive they also show good performance in practical applications. Like other regression models they provide good classification methods, especially if logistic regression is used. In figure 4.10 the output of an additive model is displayed. It shows the four component functions. For this example the plots of the estimated values and the residuals are in Figure 4.11. As before in the other models one point stands out again and one might wonder if this an outlier. In Figure 4.12 we have plotted the estimates and residual of an estimator based on a spline using all the variables and one sees that a good proportion of the variable is explained by interactions of the components. The question is now could we have done with lower order interactions. Note that MARS is able to detect interactions automatically. Maybe just a combination of two of the variables is enough to be able to predict this abnormally high ozone value? So in the next figure 4.13 various models have been tested and their residuals are displayed. In these plots one can see that the interaction of the wind with either temperature or solar radiation does basically explain much of the high ozone concentration. Only a small improvement can be found by including higher order interactions and the floklore is that one does not need any interactions of order higher than 5.

40 20 −40

−20

0

s(Wind,3.02)

20 0 −40

−20

s(Temp,4.45)

40

60

CHAPTER 4. PREDICTIVE MODELS

60

128

15

20

25

30

35

5

10

15

25

30

60 40 20 −40

−20

0

s(Solar.R,3.3)

40 20 0 −20 −40

s(Month,2.9)

20 Wind

60

Temp

5

6

7 Month

8

9

0

50

100

150

200

250

Solar.R

Figure 4.10: Additive model approximation for Ozone data

300

129

20 0

residuals

100

−20

50

−40

0

observed values

40

60

150

4.5. LINEAR AND ADDITIVE FUNCTIONSPACES

0

50

100 predicted values

150

0

50

100 predicted values

Figure 4.11: Additive model approximation for Ozone data

150

CHAPTER 4. PREDICTIVE MODELS

0

−20

−10

0

residuals

100 50

observed values

10

20

150

30

130

0

50

100 predicted values

150

0

50

100

150

predicted values

Figure 4.12: Additive model approximation for Ozone data using highest order interactions

4.5. LINEAR AND ADDITIVE FUNCTIONSPACES

40 50

100

150

0

50

100 predicted values

temp/sol. rad. interaction

wind/sol. rad. interaction

150

40 20 0 −20 −40

−40

−20

0

20

residuals

40

60

predicted values

60

0

residuals

20

residuals

−40

−20

0

20 −40

−20

0

residuals

40

60

temp/wind interaction

60

simple additive

131

0

50

100 predicted values

150

0

50

100

150

predicted values

Figure 4.13: Additive model approximation for Ozone data with various interaction terms

132

CHAPTER 4. PREDICTIVE MODELS

If the probability distribution of the random variables (X1 , . . . , Xd , Y ) which model the observations is known, the best (in the sense of expected squared error) approximation for the additive model is obtained when fi (xi ) = E(Y − f0 −

d X

fk (Xk ) | Xi = xi )

(4.1)

k=1 k6=i

and f0 = E(Y ). These equations do not have a unique solution, but uniqueness can be easily obtained if one introduces constraints like E(fi (Xi )) = 0. In practical algorithms, the estimates are ³approximated by smooth func´n (k) tions. Let fi be the vector of function values fi (xi ) . Furthermore, let k=1 Si be the matrix representing the mapping between the data and the smooth (1) (n) fi . The matrix Si depends on the observations xi , . . . , xi . Replacement of the estimation operator by the matrix Si in equation (4.1) leads to      I S 1 · · · S1 f1 S1 y  S 2 I · · · S 2   f2   S 2 y        .. .. ..   ..  =  ..  . .     . . . .  Sd Sd · · · I fd Sd y If the eigenvalues of the Si are in (0, 1) and this linear system of equations is nonsingular, it can be seen that the Gauss-Seidel algorithm for this system converges. This method of determining the additive model is the backfitting algorithm [42]; it has complexity O(N qd) where q denotes the number of iteration steps. The backfitting algorithm is very general and, in fact, is used even for nonlinear smoothers. For very large data sets, however, the algorithm becomes very costly – even though it is scalable in the data size and does not suffer from the curse of dimensionality – because it needs to revisit the data qd times. The high cost of the solution of the previous linear system of equations resulted from the large size of its matrix. Smaller alternative systems are available for particular cases of smoothers Si . For example, in the case of regression splines one can work with the system of normal equations. Consider the case of fitting with piecewise multilinear functions. These functions include the ones used in MARS and are, on each subdomain, linear combinations of products xj1 · · · xjk where every variable xj occurs at most once.

50

100

150

200

250

150 50 0

300

5

10

15

20

25

30

15

7

8

9

30

35

150

150

0

50

observed

100 0 6

25 Temp

50

100 50 0 5

20

Wind

150

Solar.R

100

0

133

100

150 100 50 0

0

50

100

150

4.5. LINEAR AND ADDITIVE FUNCTIONSPACES

0

5

10

Month

15

20

25

30

0

50

Day

100

150

predicted

Figure 4.14: Regression spline for Ozone data The basis functions of the full space of piecewise multilinear functions are products of hat functions of the form b1 (x1 ) · · · bd (xd ). For the full space, the normal matrix of the least squares fitting problem becomes sparse with nonzeros on 3d of the diagonals. But the matrix is huge, being of order md if each of the d dimensions is discretised by m hat functions. This shows the computational advantage of the additive models where the dimension of the approximating function space is only md.

4.5.2

Regression Splines

Better approximation properties are obtained by piecewise polynomial functions. They are used in the MARS algorithm (Multivariate Adaptive Regression Splines) [33]. Instead of explicitly generating a hierarchy of domains, MARS generates a hierarchy of basis functions which implicitly define a hierarchy of domains.

134

CHAPTER 4. PREDICTIVE MODELS B (x) = 1

0 ³XXX ³³ b D b XXX ³ D XXX b ³³ ³ D b XXX ³³ D b

B2 (x)

B1 (x)

" " \ " \ " " \

B3 (x)

B5 (x)

B6 (x) ¿\ ¿ \ ¿ \

B7 (x)

B4 (x)

B8 (x)

%\ % \ % \

B9 (x)

B10 (x)

Figure 4.15: Hierarchy of MARS basis functions. For the Ozone model the Regression splines find a model of the form X f (x) = f0 + fi (xi ). i

We display the non-constant components and a plot of the predicted against the actual values in figure 4.14. An example of a basis function hierarchy is displayed in Figure 4.15. At the root of the tree is the constant basis function B0 ≡ 1. At each “partitioning” step two new children are generated: Bchild1 = Bparent (x)(xj − ξ)+

and Bchild2 = Bparent (x)(−xj + ξ)+

where (z)+ denotes the usual truncated linear function. It is equal to z for z > 0 and equal to zero if z < 0. The parent, the variable xj and the value ξ are all chosen such that the sum of squared residuals is minimised. While each node can have offsprings several times, there are some rules for the generation of this tree: • The depth of the tree is bounded, typically by a value of 5 or less. This is thought to be sufficient for practical purposes [33] and the bound is important to control the computational work required for the determination of the function values. • A variable xj is only allowed at most once in a factor of a basis function Bk . This guarantees that the function is piecewise multilinear.

4.6. LOCAL MODELS

135

The partitioning of the domain is defined such that on each partition the function is multilinear. Thus this partitioning does not have the same interpretative value as in the case of classification and is not recovered by the algorithm. However, the MARS method generates an ANOVA decomposition. The computational complexity of the algorithm, after some clever updat4 /L) where d ing ideas have been applied, is shown in [33] to be O(dN Mmax is the dimension, N the number of data points, Mmax the number of basis functions considered – some might not be used later because of pruning – and L is the number of levels of the tree. Thus the algorithm is scalable, i.e., the complexity is proportional to the data size. However, the proportionality constant can be very large due to the dependence of O(M 4 ), which limits the number of basis functions which can be used and thus limits the approximation power of this approach. Another limitation is due to the fact that a greedy algorithm is used and the choice of basis functions may not be a global optimum. One problem with the usage of truncated powers is stability. In [5] a variant of the MARS algorithm is suggested where the truncated powers are replaced by a hierarchy of B-splines. In addition, a parallel implementation is provided.

4.6 4.6.1

Local models Radial basis functions

In order to explore the challenge posed by the curse of dimensionality further, we will investigate radial basis function approximation. In recent years, radial basis functions have received a lot of attention both theoretically and in applications. One of their outstanding features is that they are able to approximate high-dimensional functions very effectively. Thus they seem to be able to overcome the curse of dimensionality. In the case of real attributes x ∈ Rd a radial basis function is of the form f (x) =

N X

ci ρ(kx − x(i) k) + p(x),

i=1

where x(i) are the data points. Examples for the function ρ include Gaussians ρ(r) = exp(−αr2 ), powers and thin plate splines ρ(r) = rβ and ρ(r) =

136

CHAPTER 4. PREDICTIVE MODELS

Tps(x = cbind(Wind, Temp, Solar.R, Month), Y = Ozone, cov.function = "Thin plate spline radial basis functions (rad.cov) ") R**2 = 90.14%

1 0

0

−1

50

Y

100

(STD) residuals

2

150

RMSE = 13.78

40

60

80

100

120

20

40

60

80

100

predicted values

predicted values

GCV−points , solid− GCV model, dashed− GCV one

Histogram of std.residuals

120

20

Frequency

10

400 450

0

350

GCV function

550

30

650

20

20

40

60

80

Eff. number of parameters

100

−2

−1

0

1

2

3

std.residuals

Figure 4.16: Thin plate spline approximation for Ozone data rβ ln(r) (for even integers β only), multiquadrics ρ(r) = (r2 + c2 )β/2 and others. The function p(x) is typically a polynomial of low degree and in many cases it is zero. The radial basis function approach may be generalised to metric spaces where the argument of ρ is replaced by the distance d(x, xi ). Reviews on radial basis function research can be found in [30, 61, 21]. Existence, uniqueness and approximation properties have been well studied. An example of a thin plate spline where ρ(t) = tm can be seen for the Ozone data in figure 4.16. The figure shows the residuals and the GCV function estimate which is used to determine the smoothing parameter. As indepdent variables the temperature, solar radiation, the wind speed and the month of the year (in order to include some general weather patterns) where used. The evaluation of f (x) requires the computation of the distances between x and all the data points x(i) . Thus the time required to compute one function value is O(dN ); the complexity is linear in the number of attributes d and the curse of dimensionality has been overcome. However, if many function values

4.6. LOCAL MODELS

137

need to be evaluated, this is still very expensive. Fast methods for evaluation of radial basis functions have been studied in [6, 7, 8, 9]. For example, a multipole method which reduces the complexity to O((m + N ) log(N )) has been suggested in [8, 10] for the evaluation of f (x) for m values of x. For data mining applications, which have very large N , even this is still too expensive. What is required is an approximation for which the evaluation is independent of the data size N and does not suffer under the curse of dimensionality. In the following, we will revisit the determination of the function from the data points and see how the geometry of high-dimensional spaces influences the computational costs. The vector of coefficients of the radial basis functions c = (c1 , . . . , cN ) and the vector of coefficients of the polynomial term d = (d1 , . . . , dm ) are determined, in the case of smoothing, by a linear system of equations of the form: ¸· ¸ · ¸ · y A + αI P c . (4.2) = T 0 P 0 d £ ¤ The matrix A = ρ(kx(i) − x(j) k) i,j=1...n has very few zero elements for the case of the thin plate splines and has to be treated as a dense matrix. However, the influence of the data points is local and mainly the observed points close to x do have an influence on the value of f (x). This locality is shared with the nearest neighbor approximation techniques. However, in higher dimensions points get more sparse. For example, it is observed in [32] that the expected distance of the nearest neighbor in a d-dimensional hypercube grows like O(N −1/d ) with the dimension and that large numbers of data points are required to maintain a uniform coverage of the domain. In particular, a constant distance between a point and its nearest neighbor is obtained if log(N ) = O(d), i.e., the number of points has to grow exponentially with the dimension. This is just another aspect of the curse of dimensionality. If the function to be approximated is smooth enough the number of points available may be sufficient even in high dimensions. But a computational difficulty appears which is related to the concentration of measure [70]. The concentration of measure basically tells us that in high dimensions the neighbors of any point are concentrated close to a sphere around that point. The effect of this on the computation is severe as the determination of a good approximant will require visiting a large number of neighboring points for each evaluation point. In an attempt to decrease the computational work, a compactly supported radial basis function may be used. However,

138

CHAPTER 4. PREDICTIVE MODELS

the support will have to be chosen such that for a very large number of points the values of the radial basis function ρ(kx(i) −x(j) k) will be nonzero. Thus the linear system of equations (4.2) will have a substantial number of nonzeros which ultimately will render the solution computationally infeasible. This is one motivation to find alternatives to radial basis functions with similar properties.

4.7 4.7.1

Models based on Density estimators Bayesian classifiers

The best classifier minimises L(f ) and can be characterised in terms of the probability distribution as: (see [25, p.10]): Theorem 13. Let Y ∈ {0, 1} and ( 1 if P (Y = 1|x) ≥ 0.5 f ∗ (x) = 0 else. Then L(f ∗ ) ≤ L(f ). The classifier f ∗ minimising L(f ) is called Bayes classifier. The minimum L∗ = L(f ∗ ) is the Bayes error and characterises how difficult a certain classification problem is. Other measures have been considered in the literature but are seen to be closely related to the Bayes error [25, p. 21ff]. In practice, the distribution of the random variables are unknown and so the classifier has to be determined from observations. The observations are modelled as a sequence (Xi , Yi )i=1,...,n of identically distributed independent random variables. Practical classifiers are then functions fn (x; X1 , Y1 , . . . , Xn , Yn ). An alternative, more general definition of the Bayesian classifier [29] is f ∗ (x) = argmaxy p(y|x) where as usual p(y|x) = P (Y = y|X = x) is the conditional probability of Y being in class y when X = x. Practical estimators for f ∗ are obtained by the introduction of estimates for the conditional probabilities p(y|x) in the above formula. The classifiers obtained in this way are called plug-in decisions in the case of Boolean y and one has [25, p.16]:

4.7. MODELS BASED ON DENSITY ESTIMATORS

139

Theorem 14. P (f (X) 6= Y ) ≥ 2E |η(X) − η˜(X)| where η(x) = P (Y = 1|x) and η˜ is the estimate of of η and E(·) denotes the expectation. Many estimates of the conditional probability use Bayes’ formula: p(y|x) =

p(x|y)p(y) p(x)

where p(y) is the probability of class y and p(x) is the probability density function in the feature space in order to apply standard density estimators rather than have to estimate a family of probabilities. In the case where the features are conditionally independent random variables one gets p(y)Πdi=1 p(xi |y) p(y|x) = P . d y p(y)Πi=1 p(xi |y) Techniques which are based on this assumption are called naive Bayes estimators. The effectiveness of naive Bayes in the case of independent Boolean features has been demonstrated in [48] and a newer analysis can be found in [50]. It is also claimed that this approach works well even in cases where the features are not independent. Note that f ∗ does not depend on the denominator. One obtains a discriminant function by removing the denominator and taking the logarithm g(y|x) = log p(y) +

d X

log p(xi |y)

i=1

and because of the monotonicity of the logarithm one gets f b (x) = argmaxy g(y|x). The practical determination of f b then uses estimates of the conditional probability densities p(xi |y) for all the variables xi . In the case of categorical xi the estimate used for p(xi |y) is just the frequency and for continuous features one often uses a normal distribution for p(xi | y). In this case one gets X X (xi − µi )2 . g(y|x) ≈ ky + k(xi |y) + 2σi2 continuous x categorical x i

There are two steps in the determination of f (x):

i

140

CHAPTER 4. PREDICTIVE MODELS

1. The determination of of the probability distributions. This step only has to be done once but requires the exploration of all the data points. 2. The determination of the maximum of the discriminant function. This step has to be done for each evaluation of f (x). The determination of p(y) requires n additions in order to count the frequencies of all classes. Only the c values of p(y) need to be stored. In order to determine the distributions p(xi |y) for each feature xi the frequencies in the case of categorical xi do require n additions, in the case of continuous variables and the normal distribution one requires 2n additions and n multiplications to determine all the variances and means. Thus the total number of floating point operations required for the first step is n + d1 n + 3d2 n where d1 is the number of categorical variables and d2 the number of continuous variables (the dimension is thus d = d1 + d2 ). The evaluation of f (x) does require the computation of the discriminant function for all the c possible values of y which requires cd additions for the sum and 5cd2 floating point operations to evaluate the terms of the second sum. Thus the naive Bayes classifier is computationally very efficient, as • the data only has to be read once and does not need to be kept in memory, • the algorithm is scalable in the number of observations, • the algorithm is linear in the number of dimensions, and • in addition, the main operations of the learning stage can be done in parallel both over the test data set and over the dimensions. However, the naive Bayes classifier does rely on two assumptions: First, it is based on the probabilistic model of the observations and second, it requires that the features are independent for each class. However, in practice, features are mostly correlated. While it has been observed that even in examples with correlations naive Bayes often is as effective as tree-based classifiers [49],

4.7. MODELS BASED ON DENSITY ESTIMATORS

training on n cases evaluation

Naive Bayes time space O(nk) O(k) O(k)

141

Flexible Bayes time space O(nk) O(nk) O(nk)

Table 4.1: Costs in time and space of the naive and flexible Bayes classifier it would appear that dealing with the correlations may improve the performance even further. In [27] the region where Bayesian classifiers are optimal was provided as well as further evidence for the good performance. In [49] the Selective Bayesian Classifier is introduced which is able to remove redundant or heavily correlated features. The algorithm starts with no attributes, then adds one at a time until no more improvement is found. As at each step the effect of all the remaining attributes has to be recomputed, the complexity of this method is of O(d2 ). In order to estimate the errors one could consider crossvalidation techniques or hold-out, but the authors have simply used the accuracy on the training set and got good results with this. They included attributes as long as they did not degrade accuracy. In view of the complexity of the selective Bayesian classifier one looses the linear dependence on the dimension, and one has to read the data several times, once for each feature included. One limitation of the approach is the usage of normal distributions for the conditional probabilities in the case of numerical variables. In [45] an approach using kernel estimators for the densities is proposed. This is X µ x − µi ¶ −1 p(Xi = xi |Y = y) = (nh) K . h j This increases in particular the amount of operations for the evaluation stage. It is shown that in most cases a significant improvement in the classification performance is obtained. The time and space costs are given in table 4.1. A different way to include dependencies of the features are the Bayesian networks. A Bayesian network is a directed acyclic graph (DAG) where all the vertices are random variables Xi such that Xi and all the variables which are not descendants of Xi nor parents of Xi are conditionally independent given the parents of Xi . The main property of Bayesian networks is that the probabilities of all the random variables can be computed almost as if they were independent [12, p.308–313] and [43, 34]:

142

CHAPTER 4. PREDICTIVE MODELS

Theorem 15. If (X, E, P ) is a Bayesian network then Y P (X) = P (Xi |C(Xi )). P (C(Xi ))6=0

Conversely, if (X, E) is a DAG and if fi is a nonnegative function such that P fi (Xi ) = 1 then Y P (X) = fi (Xi ). defines a probability space for which (X, E, P ) is a Bayesian network. Furthermore, P (Xi |C(Xi )) is either 0 or fi (Xi ). Here C(Xi ) denotes the parents (or direct causes) of Xi .

Chapter 5 Cluster Analysis Clustering refers to the task of grouping objects into classes of similar objects [47]. Cluster analysis is important in business, for the identification of customers with similar interests for which similar solutions may be found (market segmentation) [13, p.47], but also in science, where taxonomies of animals, stars or archeological artifacts build the foundations for the understanding of evolution and development [37, p.1]. Consider a plot of the iris data using parallel coordinates, see figure 5.11 . One can clearly see that there are three types of flowers, which of course correspond to the known types setosa, versicolor and virginica. The problem of clustering is to determine these three classes if they were not known from the features of the flowers. Cluster analysis has a long history both in statistical analysis and machine learning [13, 14, 37, 40, 66]. Clusters of similar objects form one type of information which is discovered in data mining and can lead to new insights and suggest further actions. However, in order to be applicable for data mining, clustering algorithms need to have the following properties, see [39]: • scalable in order to work on large and growing data sizes, • capable of handling a large number of attributes of various types and complex data, • capable of handling noisy data, • insensitive to the ordering of the data and 1

This example has been taken from the R manual

143

144

CHAPTER 5. CLUSTER ANALYSIS

Petal L.

Petal W.

Sepal W.

Sepal L.

Figure 5.1: Parallel coordinates plot of the iris data • able to represent clusters of arbitrary shapes. The result of a cluster analysis are one or several subsets T 0 ⊂ T of the set of all possible observations. Many clustering algorithms partition the domain T into (possibly disjoint) subdomains Ti such that T =

[

Ti .

In a data mining explorations, clustering is combined with classification and rule detection in a 3-step procedure: 1. Determination of clusters. 2. Induction of a decision tree to predict cluster label. 3. Extraction of rules relating to the clusters. In order to find clusters the dissimilarity or similarity of transactions or objects need to be assessed and most commonly, this is done with metrics on feature vectors. They include:

145 • Minkowski distances (including Euclidean and Manhattan) for realvalued features: Ã d !1/q X ρ(x, y) = (xi − yiq i=1

• Simple matching for categorical (and binary) features: ρ(x, y) = |{xi | xi 6= yi }| . Often the distances are normalised or weighted differences are used. The four main types of clustering algorithms used in data mining are: • Partitioning methods. If the number of clusters is known in advance, the data is partitioned into sets of similar data records. Starting with some initial partitioning (often random), data points are moved between clusters until the differences of objects within the clusters is small and the difference between elements of different clusters is large. • Hierarchical clustering methods. Provides a hierarchical partitioning of the data. Clusters are determined from the hierarchical partitioning. There are two basic approaches. In the agglomerative approach an initial partitioning is given by first identifying each separate data point with one cluster and then repeatedly joining clusters together to form successively larger clusters until only one cluster is left. In the divisive approach first all points are assumed to be part of one big cluster. The partitions are then successively divided into smaller clusters until every point forms one separate cluster. • Density-based methods. Neighboring elements are joined together using a local density measure. Often, only one scan through the database is required for this algorithm. • Model-based methods. Each cluster is modelled, e.g., by a simple distribution function and the best fit of these models to the data is determined. Many clustering algorithms have an O(N 2 ) complexity, which results from the fact that the distances between all the points are computed. There are two major approximation techniques which address this computational

146

CHAPTER 5. CLUSTER ANALYSIS

problem: Sampling [47] where a subset of the data is selected to perform the algorithm, and binning which is based on discretisation of the domain [71]. In both approaches the full dataset may be used for the evaluation of the result, in particular, as this evaluation often only requires O(N ) complexity. Both √ algorithms can lead to scalable algorithms, if, in the first case only O( N ) elements are chosen from the sample, and in the second case, the bin sizes are not chosen too large and the dimension is low. Note that binning gets less useful in higher dimensions as it suffers under the curse of dimensionality.

5.1

Partitioning techniques

The aim of the clustering is to find subsets which are more pure than the original set in the sense that their elements are much more similar on average than the elements of the original domain. A partition T1 , . . . , Tk is represented by the centroids z1 , . . . , zk such that x ∈ Ti ⇐⇒ ρ(x, zi ) ≤ ρ(x, zj ), i, j = 1, . . . , k. The centroids thus define a partitioning which is just the Voronoi tessellation (or Dirichlet tessellation). Figure 5.2 shows the Voronoi tessellation for the case of the iris data set and 2 variables generated by the centers which were determined by the k − means clustering algorithm discussed below. One can see that even though no information about the classes has been used in this case the k-means algorithm is perfectly capable of finding the three main classes. The centroids are used for estimates of an impurity measure of the form k 1 X X J(z1 , . . . , zp ) = ρ(x(j) , zi ) N i=1 (j) x

=

1 N

N X j=1

(5.1)

∈Ti

min ρ(x(j) , zi )

1≤i≤k

(5.2)

The two types of algorithms for partitioning differ in the way they estimate the centroid. In the k-means algorithm, the mean of the (real-valued) observations in Ti is used: 1 X (j) x zi = Ni (j) x

∈Ti

147

2.5

5.1. PARTITIONING TECHNIQUES

v v v v v v

v v

v v

v

v

v

2.0

v v v v v v v v v v v v v v v c c c v v

petal width

1.5

cv v v c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c

1.0

c c

0.5

s s s s s s s s s s s s s s s s s s s s 1

v v v

v

v v v

v

v

v

v v

s s

2

3

4

5

6

7

petal length

Figure 5.2: Clustering of the iris data with k-means where Ni denotes the number of data points in Ti . One disadvantage of the k-means approach is that the mean cannot be guaranteed to be close to any data point at all and the data is limited to real vectors. An alternative is the k-medoid approach where the centroid is chosen to be the most central element of the set, i.e., zi = x(si ) such that

X

x(j) ∈Ti

5.1.1

ρ(x(j) , x(si ) ) ≤

X

ρ(x(j) , x(m) ),

for all x(m) ∈ Ti .

x(j) ∈Ti

k-means algorithm

The k-means algorithm, see algorithm 7, has been discussed in McQueen [54]. It can be seen that the k-means algorithm cannot increase the function J, and, in fact, if any clusters are changed, J is reduced. As J is bounded from below it converges and as a consequence the algorithm converges. It is also known that the k-means will always converge to a local minimum [17]. The k-means algorithm may be viewed as a variant of the EM algorithm [56]. In [26] a parallel k-means algorithm is proposed. The authors also provide a careful analysis of the algorithm’s computational complexity. There are two

148

CHAPTER 5. CLUSTER ANALYSIS

Algorithm 7 k-means algorithm Select k arbitrary data points z1 , . . . , zk . repeat Ti := {x(j) | ρ(x(j) , zi ) ≤ ρ(x(j) , zs ), s = 1, . . . , p } P zi := |T1i | x(j) ∈Ti x(j) . until the zi do not change any more major steps in the algorithm, the determination of the distances between all the points and the recalculation of the centroid. The determination of all the Euclidean distances between the N points and the k cluster means requires 3N xkd floating point operations, (one subtraction, square and one addition per component of all pairs). Finding the minimum for each point requires at total of kN comparisons, then one needs to compute the new average for each cluster which requires nd additions and kd divisions. The cost is usually dominated in data mining by the costs for the determination of all the distances and thus the time is [26]: T = O(N kdI) where I is the number of iterations. For the parallel algorithm (shared nothing) the data is initially distributed over the discs of all the processors. Then each processor computes the distances of its elements to all cluster centers. This is done in parallel and so the most expensive computation gets a parallel speedup of a factor of p (the number of processors). After the sums of all the elements are computed on all the processors these sums are communicated which requires an all–to–all communication with volume of dk per iteration per processor. If this can be done in parallel the time for this required is O(dkIτ ) where the time τ for √ reduction is typically of O( p). Thus the total time is √ O(N kdI/p) + O(dkI p). Now the communication time is small compared to the computation time for large problems, i.e., if N >> p3/2 κ where κ is the ratio of communication time to computation time. Typically, p is in the order of 100 and κ around 1000 thus data sizes of a million do qualify. A scalable k-means method has been discussed in [18].

5.1. PARTITIONING TECHNIQUES

5.1.2

149

k-medoids algorithm

Determining a set of elements x(js ) which minimise J(x(j1 ) , . . . x(jk ) ) defines a discrete optimisation problem. The search graph of µ this ¶ problem has as N nodes all possible sets of centroids, i.e., all possible combinations of k the points x(i) . The edges of the search graph join any two sets of centroids {x(j1 ) , . . . x(jk ) } and {x(i1 ) , . . . x(ik ) } if they only differ in one centroid. The k-medoids algorithm starts at a random node moves to the adjacent node for which the reduction in impurity is maximal. This hill-climbing approach requires the determination of J for all adjacent nodes in order to determine the best direction. The algorithm finds a local minimum. The algorithm stores all the distances between data points and all the centers. From this the minimal distance between any data point and the centers can be determined in O(k(N − k)) comparisons. Thus the complexity of the algorithm is O(k 2 (N − k)2 ). However, for many data points the minimum can actually be found without comparing the distance between the new centroid and the data points and all the earlier computed distances. This is based on the following simple observation: Lemma 7. For the numbers ρ1 , . . . , ρk+1 ∈ R satisfying ρ1 > min(ρ1 , . . . , ρk ) or ρk+1 ≤ ρ1 one has min(ρ2 , . . . , ρk+1 ) = min(min(ρ1 , . . . , ρk ), ρk+1 ). Proof. The conditions on ρ1 imply that including ρ1 does not change the minimum in both cases, i.e., min(ρ2 , . . . , ρk+1 ) = min(ρ1 , . . . , ρk+1 ).

Applying this to the k-medoid algorithm one sees that each term in the sum for J does have a minimum over all the the centroids. If one centroid is replaced by an other, say, centroid 1 is replaced by centroid k + 1 then one has to compute the minimum min(ρ(x(i) , z2 ), . . . , ρ(x(i) zk+1 )). This can be computed from the earlier determined minimum over the original centroids using the previous lemma except in the case which is not covered by the lemma for which the centroid is just the one which is replaced and the

150

CHAPTER 5. CLUSTER ANALYSIS

distance between the new centroid and the data point is larger than the distance between the removed centroid and the data point. This algorithm is the PAM (partitioning around medoids) algorithm, see [47]. The complexity is further reduced by repeated clustering on a random sample of the data and always selecting the cluster with the lowest value of J. This is implemented in the CLARA (Clustering LARge applications) algorithm and leads to an algorithm which is scalable in the number of data points N . However, there are still very many search directions and a random selection is used in the CLARANS (clustering large applications based on randomised search) algorithm [59]. This CLARANS algorithm is one of the early developments in clustering for data mining applications.

5.2

Hierarchical Clustering

Hierarchical clustering algorithms transform a set of points with an associated dissimilarity metric into a tree structure known as a dendrogram. The nodes of the dendrogram correspond to sets of data points, e.g., the leaves are single points. Like decision trees, dendrograms split the data sets at each node and the edges of the dendrogram correspond to set inclusions. The two broad classes of hierarchical clustering algorithms are agglomerative which start with single points and join them together whenever they are close whereas divisive algorithms move from the top down and break the clusters up when they are dissimilar. In contrast to decision trees where the impurity of the nodes is determined through the value of a class attribute, in clustering the dissimilarity measure provides this measure of impurity. An agglomerative hierarchical clustering algorithm consists of the following loop: Initialise all points as single clusters Determine all the dissimilarities between the clusters while There is more than one cluster do Merge the two clusters with the smallest dissimilarity Update the dissimilarities end while It is important to assess the dissimilarity or impurity which is introduced through the joining of two partitions T1 and T2 . The single link (or nearest

151

3

3 33 3 3 33 3 3 3 33 3 3 3 33 3 3 32 2 22 3 3 2 3 3 33 3 3 33 3 3 2 3 2 22 2 2 22 2 2 2 2 22 2 2 3 3 3 33 3 3 3 3 33 3 3 3 33 3 11 1 11 1 1 1 1 1 11 1 11 1 11 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 12 2 2 2 22 2 2 22 2 2 2 2 22 2 2 2 2 2 2 2 2 2 2 2 2

2

0

1

2

Height

4

5

6

5.2. HIERARCHICAL CLUSTERING

types of iris hclust (*, "complete")

Figure 5.3: Clustering of the iris data with agglomerative hierarchical clustering neighbor) technique defines the dissimilarity as ρ(A, B) = min{ρ(x, y)|x ∈ T1 , y ∈ T2 },

for two partitions T1 and T2 .

Other measures are used and they have a big effect on the computational load. After the full generation of the dendrogram, it is pruned down to a desired level. The single link algorithm computes all the distances but does not require to store them all. The classical SLINK [69] algorithm requires O(N 2 ) time and O(N ) space. Figure 5.3 shows the dendrogram obtained using hierarchical agglomerative clustering for the iris data set. One can see that again the clusters obtained are almost identical with the classes of the iris. In the case of a database which was distributed over p processors an algorithm is presented in [46] for single-link clustering which has O(pN 2 ) time and O(pN ) space complexity. The communication costs of the algorithm are O(N ). The algorithm has the following three steps: Apply the hierarchical clustering algorithm at each site. Transmit the local dendrograms to the facilitator site.

152

CHAPTER 5. CLUSTER ANALYSIS

Generate the global dendrogram. The local dendrograms are communicated together with the distances of each partition. Then an upper bound on the distance is obtained as the sum of the distances of the shortest path in each partition. This is an upper bound on the actual distance. One has v   v u uX p uX X u p u 2  (x1,i,j − xi,2,j )  ≤ t ρ(x1 , x2 ) = t (distdendrogramj (x1 , x2 )) j=1

i∈Pj

j=1

Basically, it uses the distance defined in the actual dendrograms. The merging algorithm is costly but the main advantage is that not all the data needs to be communicated to one processor. For a parallel implementation of SLINK see [60]. There is a large variety of data mining clustering techniques including BIRCH [74] and CURE [38]. A further class of clustering algorithms is based on density considerations, both using parametric models [22] and nonparameteric approaches in DBSCAN [31].

Bibliography [1] Jean-Marc Adamo. Data Mining for Association Rules and Sequential Patterns. Springer, New York, 2001. [2] R. Agrawal, T. Imielinski, and T. Swami. Mining association rules between sets of items in large databases. In Proc., ACM SIGMOD Conf. on Manag. of Data, pages 207–216, Washington, D.C., 1993. [3] R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In J. Bocca, M. Jarke, and C. Zaniolo, editors, Proc. Int. Conf. on Very Large Data Bases, pages 478–499, Santiago, Chile, 1994. Morgan Kaufman, San Francisco. [4] U. Alon, N. Barkai, D. Notterman, K. Gish, S. Ybarra, D. Mack, and A. Levine. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon cancer tissues probed by oligonucleotide arrays. Cell Biology, 96:6745–6750, 1999. [5] S. Bakin, M. Hegland, and M. Osborne. Can mars be improved with b-splines? In B.J. Noye, M.D. Teubner, and A.W. Gill, editors, Computational Techniques and Applications: CTAC97, pages 75–82. World Scientific Publishing Co., 1998. [6] R. K. Beatson, G. Goodsell, and M. J. D. Powell. On multigrid techniques for thin plate spline interpolation in two dimensions. In The mathematics of numerical analysis (Park City, UT, 1995), volume 32 of Lectures in Appl. Math., pages 77–97. Amer. Math. Soc., Providence, RI, 1996. [7] R. K. Beatson and W. A. Light. Fast evaluation of radial basis functions: methods for two-dimensional polyharmonic splines. IMA J. Numer. Anal., 17(3):343–372, 1997. 153

154

BIBLIOGRAPHY

[8] R. K. Beatson and G. N. Newsam. Fast evaluation of radial basis functions. I. Comput. Math. Appl., 24(12):7–19, 1992. Advances in the theory and applications of radial basis functions. [9] R. K. Beatson and M. J. D. Powell. An iterative method for thin plate spline interpolation that employs approximations to Lagrange functions. In Numerical analysis 1993 (Dundee, 1993), volume 303 of Pitman Res. Notes Math. Ser., pages 17–39. Longman Sci. Tech., Harlow, 1994. [10] R.K. Beatson and G.N. Newsam. A far field expansion for generalised multiquadrics in rn . In Computational Techniques and Applications: CTAC97. World Scientific, 1997. [11] Gordon Bell and James N. Gray. The revolution yet to happen. In P.J. Denning and R. M. Metcalfe, editors, Beyond Calculation, pages 5–32. Springer Verlag, 1997. [12] E.A. Bender. Mathematical Methods in Artificial Intelligence. IEEE CS Press, Los Alamitos, California, 1996. [13] Michael J.A. Berry and Gordon Linoff. Data Mining Techniques: For Marketing, Sales and Customer Support. John Wiley and Sons, 1997. [14] Alex Berson and Stephen J. Smith. Data Warehousing, Data Mining, and OLAP. McGraw-Hill Series on Data Warehousing and Data Management. McGraw-Hill, NewYork, 1997. [15] H.-H. Bock and E. Diday, editors. Analysis of Symbolic Data. Springer, 2000. [16] B´ela Bolloba´s. Combinatorics. Cambridge University Press, 1986. [17] L. Bottou and Y. Bengio. Convergence properties of the k–means algorithm. In G. Tesauro and D. Touretzky, editors, Adv. in Neural Info. Proc. Systems, volume 7, pages 585–592. MIT Press, Cambridge MA, 1995. [18] P. Bradley, U. Fayyad, and C. Reina. Scaling clustering algorithms to large databases. In Proc. of the Fourth Int. Conf. on KDD, pages 9–15, 1998.

BIBLIOGRAPHY

155

[19] L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone. Classification and Regression Trees. Chapman and Hall/CRC, first reprint 1998. [20] Leo Breiman, Jerome H. Friedman, Richard A. Olshen, and Charles J. Stone. Classification and Regression Trees. Wadsworth International Group, Blemont, California, 1984. [21] Martin D. Buhmann. New developments in the theory of radial basis function interpolation. In Multivariate approximation: from CAGD to wavelets (Santiago, 1992), volume 3 of Ser. Approx. Decompos., pages 35–75. World Sci. Publishing, River Edge, NJ, 1993. [22] P. Cheeseman and J. Stutz. Bayesian classification (autoclass): Theory and results. In U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R.Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, pages 153–180. AAAI/MIT Press, Cambridge, MA, 1996. [23] C.J. Date. An introduction to database systems. Addison-Wesley, Reading, Mass., 1995. [24] B. A. Davey and H. A. Priestley. Introduction to lattices and order. Cambridge University Press, New York, second edition, 2002. [25] L. Devroye, L. Gy¨orgfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition, volume 31 of Applications of Mathematics. Springer, 1996. [26] I.S. Dhillon and D.S. Modha. A data-clustering algorithm on distributed memory multiprocessors. In Zaki and Ho [73]. [27] P. Domingos and M. Pazzani. On the optimality of the simple bayesian classifier under zero–one loss. Machine Learning, 29:103–130, 1997. [28] G. Dong and J. Li. Efficient mining of emerging patterns: Discovering trends and differences. In Proc. 1999 Int. Conf. Knowledge Discovery and Data Mining (KDD’99), pages 43–52, San Diego, 1999. [29] R.O. Duda and P.E. Hart. Pattern Classification and Scene Analysis. John Wiley, New York, 1973. [30] Nira Dyn. Interpolation and Approximation by Radial and Related Functions, volume 1, pages 211–234. Academic Press, 1989.

156

BIBLIOGRAPHY

[31] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases. In Proc. 1996 Int. Conf. Knowledge Discovery and Data Mining (KDD’96), pages 226–231, 1996. [32] Jerome H. Friedman. Flexible metric nearest neighbor classification. Technical report, Department of Statistics, Stanford University, 1994. [33] J.H. Friedman. Multivariate adaptive regression splines. The Annals of Statistics, 19:1–141, 1991. [34] N. Friedman, D. Geiger, and M. Goldszmidt. Bayesian network classifiers. Machine Learning, 29:131–163, 1997. [35] Bernhard Ganter and Rudolf Wille. Formal Concept Analysis. Springer, 1999. [36] G. Golub, M. Heath, and G. Wahba. Generalized cross validation as a method for choosing a good ridge parameter. Technometrics, 21:215– 224, 1979. [37] A. D. Gordon. Classification. Chapman & Hall, London, 1981. Methods for the exploratory analysis of multivariate data, Monographs on Applied Probability and Statistics. [38] S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large databases. In Proc. ACM SIGMOD Int. Conf. on Management of Data, pages 73–84. ACM Press, 1998. [39] Jiawei Han and Micheline Kamber. Data Mining, Concepts and Techniques. Morgan Kaufmann Publishers, 2001. [40] John A. Hartigan. Clustering algorithms. John Wiley & Sons, New YorkLondon-Sydney, 1975. Wiley Series in Probability and Mathematical Statistics. [41] T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning. Data mining, inference, and prediction. Springer Series in Statistics. Springer-Verlag, New York, 2001.

BIBLIOGRAPHY

157

[42] T.J. Hastie and R.J. Tibshirani. Generalized additive models, volume 43 of Monographs on statistics and applied probability. Chapman and Hall, 1990. [43] D. Heckerman, D. Geiger, and D.E. Chickering. Learning baysian networks. Machine Learning, 20(3):197–243, 1995. [44] donor Jeff Schlimmer. Congressional Quarterly Almanac, 98th Congress, 2nd session 1984, volume XL. Congressional Quarterly Inc., 1985. [45] G.H. John and P. Langley. Estimating continuous distributions in baysian classifiers. In P. Besnard and S. Hanks, editors, Proc. Eleventh Conference on Uncertainty in Artificial Intelligence, pages 338–345, Montreal Canada, 1995. Morgan Kaufmann. [46] E.L. Johnson and H. Kargupta. Collective, hierarchical clustering from distributed heterogeneous data. In Zaki and Ho [73], pages 221–244. [47] Leonard Kaufman and Peter J. Rousseeuw. Finding groups in data. John Wiley & Sons Inc., New York, 1990. An introduction to cluster analysis, A Wiley-Interscience Publication. [48] P. Langley, W. Iba, and K. Thompson. An analysis of baysian classifiers. In W. Swartout, editor, Proc. Tenth National Conference on Artificial Intelligence, pages 223–228, San Jose CA, Menlo Park, 1992. AAAI. [49] P. Langley and S. Sage. Induction of selective baysian classifiers. In R.L. Mantaras and D. Poole, editors, Proc. Tenth Conference on Uncertainty in Artificial Intelligence, pages 399–406, Seattle, WA, 1994. Morgan Kaufmann. [50] P. Langley and S. Sage. Tractable average-case analysis of naive bayesian classifiers. In Proc. Sixteenth International Conference on Machine Learning, pages 220–228, Bled, Slovenia, 1999. Morgan Kaufmann. [51] B. Lent, A. Swami, and J. Widom. Clustering association rules. In Proc. 1997 Int. Conf. Data Engineering (ICDE’97), pages 220–231, Birmingham, England, 1997. [52] J. Li, G. Dong, and K. Ramamohanrarao. Making use of the most expressive jumping emerging patterns for classification. In

158

BIBLIOGRAPHY Proc. 2000 Pacific-Asia Conf. Knowledge Discovery and Data Mining (PAKDD’00), pages 220–232, Kyoto, Japan, 2000.

[53] B. Liu, W. Hsu, and Y. Ma. Integrating classifcation and association rule mining. In Proc. 1998 Int. Conf. Knowledge Discovery and Data Mining (KDD’98), pages 80–86, New York, 1998. [54] J. MacQueen. Some methods for classification and analysis of multivariate observations. In Proc. Fifth Berkeley Sympos. Math. Statist. and Probability (Berkeley, Calif., 1965/66), pages Vol. I: Statistics, pp. 281–297. Univ. California Press, Berkeley, Calif., 1967. [55] H. Mannila, H. Toivonen, and A.I. Verkamo. Discovering frequent episodes in sequences. In Proc. 1st Int. Conf. Knowledge Discovery Databases and Data Mining, pages 210–215, Menlo Park, Calif., 1995. AAAI Press. [56] G.J. McLachlan and T. Krishnan. The EM Algorithm and Extensions. Wiley, 1996. [57] D. Meretakis and B. W¨ uthrich. Extending naive bayes classifiers using long itemsets. In Proc. 1999 Int. Conf. Knowledge Discovery and Data Mining, pages 165–174, San Diego, 1999. [58] T.M. Mitchell. Machine Learning. WCB McGraw-Hill, 1997. [59] R. Ng and J. Han. Efficient and effective clustering methods for spatial data mining. In Proc. of the 20th Int. Conf. on Very Large Data Bases, pages 144–155. Morgan Kaufmann, 1994. [60] C. Olson. Parallel algorithms for hierarchical clustering. Parallel Computing, 8:1313–1325, 1995. [61] M. J. D. Powell. The theory of radial basis function approximation in 1990. In W. Light, editor, Advances in numerical analysis, Vol. II (Lancaster, 1990), Oxford Sci. Publ., pages 105–210. Oxford Univ. Press, New York, 1992. [62] J.R. Quinlan. Induction of decision trees. Machine Learning, 1(1):81– 106, 1986.

BIBLIOGRAPHY

159

[63] J.R. Quinlan. Generating production rules from decision trees. In Proc. of IJCAI–87, pages 304–307, 1987. [64] J.R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. [65] N. Ramakrishnan and A.Y. Grama. Data mining: From serendipity to science. Computer, 32(8):34–37, August 1999. [66] B. D. Ripley. Pattern Recognition and Neural Networks. Cambridge University Press, Cambridge, 1996. [67] A. Savasere, E. Omieski, and S. Navanthe. An efficient algorithm for mining association rules in large databases. In Proc. 21st Int. Conf. Very Large Data Bases, pages 432–444. Morgan Kaufmann, San Francisco, 1995. [68] J. Shafer, R. Agrawal, and M. Mehta. Sprint: A scalable parallel classifier for data mining. In Proc. 1996 Int. Conf. Very Large Data Bases (VLDB’96), pages 544–555, Bombay, 1996. [69] R. Sibson. Slink: An optimally efficient algorithm for the single-link cluster method. The Computer Journal, 16:30–34, 1973. [70] M. Talagrand. A new look at independence. Ann. Prob., 23:1–37, 1996. [71] W. Wang, J. Yang, and R. Muntz. Sting: A statistical information grid approach to spatial data mining. In Proc. 1997 In. Conf. Very Large Data Bases (VLDB’97), pages 186–195, 1997. [72] Peter Whittle. Probability via expectation. Springer Texts in Statistics. Springer-Verlag, New York, fourth edition, 2000. [73] M.J. Zaki and C.-T. Ho, editors. Large-Scale Parallel Data Mining, volume 1759 of Lecture Notes in Artificial Intelligence. Springer, 2000. [74] T. Zhang, R. Ramakrishnan, and M. Livny. Birch: An efficient data clustering method for very large databases. In Proc. 1996 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD’96), pages 103–114, Montreal, Canada, 1996.