Foundations and Advances in Data Mining - Semantic Scholar

63 downloads 0 Views 6MB Size Report
training method : This means the training methods the rule extraction al- gorithm can be applied to. ...... Linial N, Mansour Y, Nisan N (1993) Constant Depth Circuits, Fourier Trans- form, and ...... hend human mumblings. It only indicates a ...
Wesley W. Chu T. Y. Lin Editors

Foundations and Advances in Data Mining

Springer-Verlag

Professor Dr. Wesley W. Chu Computer Science Department University of California, Los Angeles California, 90095 [email protected] Professor Dr. Tsau Young Lin Department of Computer Science, San Jose State University San Jose, CA 95192 [email protected]

Preface With the growing use of information technology and the recent advances in web systems, the amount of data available to users has increased exponentially. Thus, there is a critical need to understand the content of the data. As a result, data-mining has become a popular research topic in recent years for the treatment of the “data rich and information poor” syndrome. Currently, application oriented engineers are only concerned with their immediate problems, which results in an ad hoc method of problem solving. Researchers, on the other hand, lack an understanding of the practical issues of data-mining for realworld problems and often concentrate on issues (e.g. incremental performance improvements) that are of no significance to the practitioners In this volume, we hope to remedy these problems by (1) presenting a theoretical foundation of data-mining, and (2) providing important new directions for data-mining research. We have invited a set of well respected data mining theoreticians to present their views on the fundamental science of data mining. We have also called on researchers with practical data mining experiences to present new important data-mining topics. This book is organized into two parts. The first part consists of four chapters presenting the foundations of data mining, which describe the theoretical point of view and the capabilities and limits of current available mining techniques. The second part consists of seven chapters which discuss the new data mining topics. The first part of the book includes four chapters. The first chapter, authored by T. Poggio and S. Smale, is entitled “The Mathematics of Learning: Dealing with Data.” The authors present the mathematical formula of learning theory. In particular, they present an algorithm for supervised learning by training sets and show that the algorithm performs well in a number of applications involving regression as well as binary classification. The second chapter, by H. Tsukimoto, is entitled “Logical Regression Analysis: From Mathematical Formulas to Linguistic Rules.” He presents a solution for solving the accurate prediction and comprehensive rules in supervised learning. The author has developed a data mining technique called Logical Regression Analysis which consists of regression analysis, and the Approximation Method, that can provide comprehensive rules and also accurate prediction. The paper also shows how to apply the techniques for mining images. The third chapter, by T.Y. Lin, is entitled “ A Feature/Attribute Theory for Association Mining and Constructing the Complete Feature Set” The author points out the importance of selecting correct attributes in data mining and develops a theory of features for association mining (AM). Based on the isomorphism theorem in AM, he concludes that it is sufficient to perform AM in canonical models, and constructs the complete feature set for every canonical model. Using the isomorphism theorem, the complete feature set can be derived for each relation. Though the number of possible features is enormous, it can be shown that the high frequency patterns features can be derived within polynomial time. The fourth chapter is entitled “A new theoretical framework for K-means-type clustering,” and is authored by J.

VI

Preface

Peng and Y. Xia. The authors present generalized K-means type clustering method as the 0-1 semi-definite programming (SDP). The classical K-means algorithm, minimal sum of squares (MSSC), can be interpreted as a special heuristic. Moreover, the 0-1 SDP model can be further approximated by the relaxed and polynomially solvable linear and semi-definite programming. The 0-1 SDP model can be applied to MSCC and to other scenarios of clustering as well. The second part of the book, from Chapters 5 to 11, present seven topics covering recent advances in data mining. Chapter 5, entitled “Clustering via Decision Tree Construction,” is authored by B. Liu, Y. Xia, and P. Yu. They propose a novel clustering technique based on supervised learning called decision tree construction. The key idea is to use a decision tree to partition the data space into cluster (or dense) regions and empty (or sparse) regions (which produce outliers and anomalies). This technique is able to find “natural” clusters in large high dimensional spaces efficiently. Experimental data shows that this technique is effective and scales well for large high dimensional datasets. Chapter 6, “Incremental Mining on Association Rules,” is written by Wei-Guang Teng and Ming-Syan Chen. Due to the increasing use of the record-based databases where data is being constantly added, incremental mining is needed to keep the knowledge current. The authors propose incremental mining techniques to update the data mining on association rules. Chapter 7, is entitled “Mining Association Rules from Tabular Data Guided by Maximal Frequent Itemsets” and authored by Q. Zou, Y. Chen, W. Chu, and X. Lu. Since many scientific applications are in tabular format, the authors propose to use the maximum frequency itemset (MFI) as a road map to guide us towards generating association rules from tabular dataset. They propose to use information from previous searches to generate MFI and the experimental results show that such an approach to generating MFI yields significant improvements over conventional methods. Further, using tabular format rather than transaction data set to derive MFI can reduce the search space and the time needed for support-counting. The authors use spreadsheet to present rules and use spreadsheet operations to sort and select rules, which is a very convenient way to query and organize rules in a hierarchical fashion. An example was also given to illustrate the process of generating association rules from the tabular dataset using past medical surgery data to aid surgeons in their decision making. Chapter 8, entitled “Sequential Pattern Mining by Pattern-Growth: Principles and Extensions,” presents the sequential pattern growth method and studies the principles and extensions of the method such as (1) mining constraint-based sequential patterns, (2) mining multi-level, multi dimensional sequential patters, and (3) mining top-k closed sequential patterns. They also discuss the applications in bio-sequence pattern analysis and clustering sequences. Chapter 9, entitled “Web Page Classification,” is written by B. Choi and Z. Yao. It describes systems that automatically classify web pages into meaningful subject-based and genre-based categories. The authors describe tools for building automatic web page classification systems, which are essential for web mining and constructing semantic web. Chapter 10 is entitled “Web Mining – Concepts, Applications, and Research Directions,” and was written by Jaideep Srivastava, Prasanna Desikan, and Vipin Kumar. The authors present the application of data mining techniques to extract knowledge from web content, structure, and usage. An overview of accomplishments in technology and applications in web mining is also included. Chapter 11, by Chris Clifton, Murat Kantarcioglu, and

Preface

VII

Jaideep Vaidya is entitled, “Privacy-Preserving Data Mining.” The goal of privacypreserving data mining is to develop data mining models that do not increase the risk of misuse of the data used to generate those models. The author presents two classes of privacy-preserving data-mining. The first is based on adding noise to the data before providing it to the data miner. Since real data values are not revealed, individual privacy is preserved. The second class is derived from the cryptographic community. The data sources collaborate to obtain data mining results without revealing anything except those results. Finally, we would like to thank the authors for contributing their work in the volume and the reviewers for commenting on the readability and accuracy of the work. We hope the theories presented in this volume will give data mining practitioners a scientific perspective in data mining and thus provide more insight into their problems. We also hope that the new data mining topics will stimulate further research in these important directions. Wesley W. Chu and T. Y. Lin

Table of Contents Part I Foundations of Data Mining -----------------------------------------------------------

1

The Mathematics of Learning-----------------------------------------------------------Tomaso Poggio, Steve Smale

3

Logical Regression Analysis: From Mathematical Formulas to Linguistic Rules Hiroshi Tsukimoto

19

A Feature/Attribute Theory for Association Mining and Constructing the Complete Feature Set---------------------------------------------------------------------Tsau-Young Lin A New Theoretical Framework for K-means-type Clustering----------------------Jiming Peng, Yu Xia Part II Recent Advances in Data Mining -----------------------------------------------------

61

77

95

Clustering via Decision Tree Construction--------------------------------------------Bing Liu, Yiyuan Xia, Philip S. Yu

97

Incremental Mining on Association Rules---------------------------------------------Wei-Guang Teng, Ming-Syan Chen

123

Mining Association Rules from Tabular Data Guided by Maximal Frequent Itemsets-------------------------------------------------------------------------------------Qinghua Zou, Yu Chen, Wesley W. Chu, Xinchun Lu

161

Sequential Pattern Mining by Pattern-Growth: Principles and Extensions-------Jiawei Han, Jian Pei, Xifeng Yan

181

Web Page Classification-----------------------------------------------------------------Ben Choi, Zhongmei Yao

217

X

Table of Contents

Web Mining – Concepts, Applications, and Research Directions-----------------Jaideep Srivastava, Prasanna Desikan, Vipin Kumar

279

Privacy-Preserving Data Mining------------------------------------------------------Chris Clifton, Murat Kantarcioglu, Jaideep Vaidya

313

Part I

Foundations of Data Mining

The Mathematics of Learning: Dealing with Data? Tomaso Poggio1 and Steve Smale2 1

CBCL, McGovern Institute, Artificial Intelligence Lab, BCS, MIT, [email protected] Toyota Technological Institute at Chicago and Professor in the Graduate School, University of California, Berkeley, [email protected]

2

Summary. Learning is key to developing systems tailored to a broad range of data analysis and information extraction tasks. We outline the mathematical foundations of learning theory and describe a key algorithm of it.

1 Introduction The problem of understanding intelligence is said to be the greatest problem in science today and “the” problem for this century - as deciphering the genetic code was for the second half of the last one. Arguably, the problem of learning represents a gateway to understanding intelligence in brains and machines, to discovering how the human brain works and to making intelligent machines that learn from experience and improve their competences as children do. In engineering, learning techniques would make it possible to develop software that can be quickly customized to deal with the increasing amount of information and the flood of data around us. Examples abound. During the last decades, experiments in particle physics have produced a very large amount of data. Genome sequencing is doing the same in biology. The Internet is a vast repository of disparate information which changes rapidly and grows at an exponential rate: it is now significantly more than 100 Terabytes, while the Library of Congress is about 20 Terabytes. We believe that a set of techniques, based on a new area of science and engineering becoming known as “supervised learning” – will become a key technology to extract information from the ocean of bits around us and make sense of it. Supervised learning, or learning-from-examples, refers to systems that are trained, instead of programmed, with a set of examples, that is a set of inputoutput pairs. Systems that could learn from example to perform a specific task would have many applications. A bank may use a program to screen loan applications and approve the “good” ones. Such a system would be ?

Reprinted by permission from Notices of the AMS, 50(5), 2003, pp.537–544

4

Tomaso Poggio and Steve Smale

trained with a set of data from previous loan applications and the experience with their defaults. In this example, a loan application is a point in a multidimensional space of variables characterizing its properties; its associated output is a binary “good” or “bad” label. In another example, a car manufacturer may want to have in its models, a system to detect pedestrians that may be about to cross the road to alert the driver of a possible danger while driving in downtown traffic. Such a system could be trained with positive and negative examples: images of pedestrians and images without people. In fact, software trained in this way with thousands of images has been recently tested in an experimental car of Daimler. It runs on a PC in the trunk and looks at the road in front of the car through a digital camera [36, 26, 43]. Algorithms have been developed that can produce a diagnosis of the type of cancer from a set of measurements of the expression level of many thousands human genes in a biopsy of the tumor measured with a cDNA microarray containing probes for a number of genes [46]. Again, the software learns the classification rule from a set of examples, that is from examples of expression patterns in a number of patients with known diagnoses. The challenge, in this case, is the high dimensionality of the input space – in the order of 20, 000 genes – and the small number of examples available for training – around 50. In the future, similar learning techniques may be capable of some learning of a language and, in particular, to translate information from one language to another. What we assume in the above examples is a machine that is trained, instead of programmed, to perform a task, given data of the form (xi , yi )m i=1 . Training means synthesizing a function that best represents the relation between the inputs xi and the corresponding outputs yi . The central question of learning theory is how well this function generalizes, that is how well it estimates the outputs for previously unseen inputs. As we will see later more formally, learning techniques are similar to fitting a multivariate function to a certain number of measurement data. The key point, as we just mentioned, is that the fitting should be predictive, in the same way that fitting experimental data (see figure 1) from an experiment in physics can in principle uncover the underlying physical law, which is then used in a predictive way. In this sense, learning is also a principled method for distilling predictive and therefore scientific “theories” from the data. We begin by presenting a simple “regularization” algorithm which is important in learning theory and its applications. We then outline briefly some of its applications and its performance. Next we provide a compact derivation of it. We then provide general theoretical foundations of learning theory. In particular, we outline the key ideas of decomposing the generalization error of a solution of the learning problem into a sample and an approximation error component. Thus both probability theory and approximation theory play key roles in learning theory. We apply the two theoretical bounds to the algorithm and describe for it the tradeoff – which is key in learning theory and its applications – between number of examples and complexity of the hypothesis

The Mathematics of Learning: Dealing with Data

5

Fig. 1. How can we learn a function which is capable of generalization – among the many functions which fit the examples equally well (here m = 7)?

space. We conclude with several remarks, both with an eye to history and to open problems for the future.

2 A key algorithm 2.1 The algorithm How can we fit the “training” set of data Sm = (xi , yi )m i=1 with a function f : X → Y – with X a closed subset of IRn and Y ⊂ IR – that generalizes, eg is predictive? Here is an algorithm which does just that and which is almost magical for its simplicity and effectiveness: 1. Start with data (xi , yi )m i=1 0 2. Choose a symmetric, positive definite function Kx (xP ) = K(x, x0 ), continn uous on X×X. A kernel K(t, s) is positive definite if i,j=1 ci cj K(ti , tj ) ≥ 0 for any n ∈ IN and choice of t1 , ..., tn ∈ X and c1 , ..., cn ∈ IR. An example of such a Mercer kernel is the Gaussian K(x, x0 ) = e− restricted to X × X. 3. Set f : X → Y to f (x) =

m X i=1

kx−x0 k2 2σ 2

.

(1)

ci Kxi (x).

(2)

6

Tomaso Poggio and Steve Smale

where c = (c1 , ..., cm ) and (mγI + K)c = y

(3)

where I is the identity matrix, K is the square positive definite matrix with elements Ki,j = K(xi , xj ) and y is the vector with coordinates yi . The parameter γ is a positive, real number. The linear system of equations 3 in m variables is well-posed since K is positive and (mγI + K) is strictly positive. The condition number is good if mγ is large. This type of equations has been studied since Gauss and the algorithms for solving it efficiently represent one the most developed areas in numerical and computational analysis. What does Equation 2 say? In the case of Gaussian kernel, the equation approximates the unknown function by a weighted superposition of Gaussian “blobs” , each centered at the location xi of one of the m examples. The weight ci of each Gaussian is such to minimize a regularized empirical error, that is the error on the training set. The σ of the Gaussian (together with γ, see later) controls the degree of smoothing, of noise tolerance and of generalization. Notice that for Gaussians with σ → 0 the representation of Equation 2 effectively becomes a “look-up” table that cannot generalize (it provides the correct y = yi only when x = xi and otherwise outputs 0). 2.2 Performance and examples The algorithm performs well in a number of applications involving regression as well as binary classification. In the latter case the yi of the training set (xi , yi )m i=1 take the values {−1, +1}; the predicted label is then {−1, +1}, depending on the sign of the function f of Equation 2. Regression applications are the oldest. Typically they involved fitting data in a small number of dimensions [53, 44, 45]. More recently, they also included typical learning applications, sometimes with a very high dimensionality. One example is the use of algorithms in computer graphics for synthesizing new images and videos [38, 5, 20]. The inverse problem of estimating facial expression and object pose from an image is another successful application [25]. Still another case is the control of mechanical arms. There are also applications in finance, as, for instance, the estimation of the price of derivative securities, such as stock options. In this case, the algorithm replaces the classical Black-Scholes equation (derived from first principles) by learning the map from an input space (volatility, underlying stock price, time to expiration of the option etc.) to the output space (the price of the option) from historical data [27]. Binary classification applications abound. The algorithm was used to perform binary classification on a number of problems [7, 34]. It was also used to perform visual object recognition in a view-independent way and in particular face recognition and sex categorization from face images [39, 8]. Other applications span bioinformatics for classification of human cancer from microar-

The Mathematics of Learning: Dealing with Data

7

ray data, text summarization, sound classification3 Surprisingly, it has been realized quite recently that the same linear algorithm not only works well but is fully comparable in binary classification problems to the most popular classifiers of today (that turn out to be of the same family, see later). 2.3 Derivation The algorithm described can be derived from Tikhonov regularization. To find the minimizer of the the error we may try to solve the problem – called Empirical Risk Minimization (ERM) – of finding the function in H which minimizes m

1 X (f (xi ) − yi )2 m i=1 which is in general ill-posed, depending on the choice of the hypothesis space H. Following Tikhonov (see for instance [19]) we minimize, instead, over the hypothesis space HK , for a fixed positive parameter γ, the regularized functional m 1 X (yi − f (xi ))2 + γkf k2K , (4) m i=1 where kf k2K is the norm in HK – the Reproducing Kernel Hilbert Space (RKHS), defined by the kernel K. The last term in Equation 4 – called regularizer – forces, as we will see, smoothness and uniqueness of the solution. Let us first define the norm kf k2K . Consider the space of the linear span of Kxj . We use xj to emphasize that the elements of X used in this construction do not have anything to do in general with the training set (xi )m i=1 . Define an inner product in this space by setting hK , K i = K(x, x ) and extend linx xj j Pr early to j=1 aj Kxj . The completion of the space in the associated norm is the RKHS, that is a Hilbert space HK with the norm kf k2K (see [10, 2]). Note that hf, Kx i = f (x) for f ∈ HK (just let f = Kxj and extend linearly). To minimize the functional in Equation 4 we take the functional derivative with respect to f , apply it to an element f of the RKHS and set it equal to 0. We obtain m 1 X (yi − f (xi ))f (xi ) − γhf, f i = 0. (5) m i=1 Equation 5 must be valid for any f . In particular, setting f = Kx gives f (x) =

m X

ci Kxi (x)

(6)

i=1 3

The very closely related Support Vector Machine (SVM) classifier was used for the same family of applications, and in particular for bioinformatics and for face recognition and car and pedestrian detection [46, 25].

8

Tomaso Poggio and Steve Smale

where ci =

yi − f (xi ) mγ

(7)

since hf, Kx i = f (x). Equation 3 then follows, by substituting Equation 6 into Equation 7. Notice also that essentially the same derivation for a generic loss function V (y, f (x)), instead of (f (x) − y)2 , yields the same Equation 6, but Equation 3 is now different and, in general, nonlinear, depending on the form of V . In particular, the popular Support Vector Machine (SVM) regression and SVM classification algorithms correspond to special choices of non-quadratic V , one to provide a ’robust” measure of error and the other to approximate the ideal loss function corresponding to binary (miss)classification. In both cases, the solution is still of the same form of Equation 6 for any choice of the kernel K. The coefficients ci are not given anymore by Equations 7 but must be found solving a quadratic programming problem.

3 Theory We give some further justification of the algorithm by sketching very briefly its foundations in some basic ideas of learning theory. Here the data (xi , yi )m i=1 is supposed random, so that there is an unknown probability measure ρ on the product space X × Y from which the data is drawn. This measure ρ defines a function fρ : X → Y (8) R satisfying fρ (x) = ydρx , where ρx is the conditional measure on x × Y . From this construction fρ can be said to be the true input-output function reflecting the environment which produces the data. Thus a measurement of the error of f is Z (f − fρ )2 dρX (9) X

where ρX is the measure on X induced by ρ (sometimes called the marginal measure). The goal of learning theory might be said to “find” f minimizing this error. Now to search for such an f , it is important to have a space H – the hypothesis space – in which to work (“learning does not take place in a vacuum”). Thus consider a convex space of continuous functions f : X → Y , (remember Y ⊂ IR) which as a subset of C(X) is compact, and where C(X) is the Banach space of continuous functions with ||f || = maxX |f (x)|. A basic example is H = IK (BR ) (10) where IK : HK → C(X) is the inclusion and BR is the ball of R in HK . Pradius m 1 2 Starting from the data (xi , yi )m = z one may minimize (f (x i ) − yi ) i=1 i=1 m over f ∈ H to obtain a unique hypothesis fz : X → Y . This fz is called the empirical optimum and we may focus on the problem of estimating

The Mathematics of Learning: Dealing with Data

Z

(fz − fρ )2 dρX

9

(11)

X

It is useful towards this end to break the problem into steps by defining a R “true optimum” fH relative to H, by taking the minimum over H of X (f − fρ )2 . Thus we may exhibit Z Z 2 (fz − fρ ) = S(z, H) + (fH − fρ )2 = S(z, H) + A(H) (12) X

X

where

Z S(z, H) =

(fz − fρ )2 −

X

Z

(fH − fρ )2

(13)

X

The first term, (S) on the right in Equation 12 must be estimated in probability over z and the estimate is called the sample errror (sometime also the estimation error). It is naturally studied in the theory of probability and of empirical processes [16, 30, 31]. The second term (A) is dealt with via approximation theory (see [15] and [12, 14, 13, 32, 33]) and is called the approximation error. The decomposition of Equation 12 is related, but not equivalent, to the well known bias (A) and variance (S) decomposition in statistics. 3.1 Sample Error First consider an estimate for the sample error, which will have the form: S(z, H) ≤ 

(14)

with high confidence, this confidence depending on  and on the sample size m. Let us be more precise. Recall that the covering number or Cov#(H, η) is the number of balls in H of radius η needed to cover H. Theorem 3.1 Suppose |f (x)−y| ≤ M for all f ∈ H for almost all (x, y) ∈ X ×Y . Then P robz∈(X×Y )m {S(z, H) ≤ } ≤ 1 − δ m

 )e− 288M 2 . where δ = Cov#(H, 24M

The result is Theorem C ∗ of [10], but earlier versions (usually without a topology on H) have been proved by others, especially Vapnik, who formulated the notion of VC dimension to measure the complexity of the hypothesis space for the case of {0, 1} functions. In a typical situation of Theorem 3.1 the hypothesis space H is taken to be as in Equation 10, where BR is the ball of radius R in a Reproducing Kernel Hilbert Space (RKHS) with a smooth K (or in a Sobolev space). In this context, R plays an analogous role to VC dimension[50]. Estimates for the covering numbers in these cases were provided by Cucker, Smale and Zhou [10, 54, 55]. The proof of Theorem 3.1 starts from Hoeffding inequality (which can be regarded as an exponential version

10

Tomaso Poggio and Steve Smale

of Chebyshev’s inequality of probability theory). One applies this estimate to the function X ×Y → IR which takes (x, y) to (f (x)−y)2 . Then extending the estimate to the set of f ∈ H introduces the covering number into the picture. With a little more work, theorem 3.1 is obtained. 3.2 Approximation Error R The approximation error X (fH − fρ )2 may be studied as follows. Suppose B : L2 → L2 is a compact, strictly positive (selfadjoint) operator. Then let E be the Hilbert space {g ∈ L2 , kB −s gk < ∞} with inner product hg, hiE = hB −s g, B −s hiL2 . Suppose moreover that E → L2 factors as E → C(X) → L2 with the inclusion JE : E ,→ C(X) well defined and compact. Let H be JE (BR ) when BR is the ball of radius R in E. A theorem on the approximation error is Theorem 3.2 Let 0 < r < s and H be as above. Then 2r

2s 1 s−r −r kfρ − fH k ≤ ( ) kB fρ k s−r R

2

We now use k · k for the norm in the space of square integrable functions on 1/2 X, with measure ρX . For our main example of RKHS, take B = LK , where K is a Mercer kernel and Z LK f (x) = f (x0 )K(x, x0 ) (15) X

and we have taken the square root of the operator LK . In this case E is HK as above. Details and proofs may be found in [10] and in [48]. 3.3 Sample and approximation error for the regularization algorithm The previous discussion depends upon a compact hypothesis space H from which the experimental optimum fz and the true optimum fH are taken. In the key algorithm of section 2 , the optimization is done over all f ∈ HK with a regularized error function. The error analysis of sections 3.1 and 3.2 must therefore be extended. Thus let fγ,z be the empirical optimum for the regularized problem as exhibited in Equation 4 m

1 X (yi − f (xi ))2 + γkf k2K . m i=1 Then

Z

(fγ,z − fρ )2 ≤ S(γ) + A(γ)

(16)

(17)

The Mathematics of Learning: Dealing with Data

11

where A(γ) (the approximation error in this context) is −1

A(γ) = γ 1/2 kLK 4 fρ k2

(18)

and the sample error is S(γ) =

32M 2 (γ + C)2 ∗ v (m, δ) γ2

(19)

where v ∗ (m, δ) is the unique solution of m 3 4m v − ln( )v − c1 = 0. 4 δ

(20)

Here C, c1 > 0 depend only on X and K. For the proof one reduces to the case of compact H and applies theorems 3.1 and 3.2. Thus finding the optimal solution is equivalent to finding the best tradeoff between A and S for a given m. In our case, this bias-variance problem is to minimize S(γ) + A(γ) over γ > 0. There is a unique solution – a best γ – for the choice in Equation 4. For this result and its consequences see [11].

4 Remarks The tradeoff between sample complexity and hypothesis space complexity For a given, fixed hypothesis space H only the sample error component of the error of fz can be be controlled (in Equation 12 only S(z, H) depends on the data). In this view, convergence of S to zero as the number of data increases (theorem 3.1) is then the central problem in learning. Vapnik called consistency of ERM (eg convergence of the empirical error to the true error) the key problem in learning theory and in fact much modern work has focused on refining the necessary and sufficient conditions for consistency of ERM (the uniform Glivenko-Cantelli property of H, finite Vγ dimension for γ > 0 etc., see [19]). More generally, however, there is a tradeoff between minimizing the sample error and minimizing the approximation error – what we referred to as the bias-variance problem. Increasing the number of data points m decreases the sample error. The effect of increasing the complexity of the hypothesis space is trickier. Usually the approximation error decreases but the sample error increases. This means that there is an optimal complexity of the hypothesis space for a given number of training data. In the case of the regularization algorithm described in this paper this tradeoff corresponds to an optimum value for γ as studied by [11, 35, 3]. In empirical work, the optimum value is often found through cross-validation techniques [53]. This tradeoff between approximation error and sample error is probably the most critical issue in determining good performance on a given problem. The class of regularization algorithms, such as Equation 4, shows clearly that it is also a tradeoff –

12

Tomaso Poggio and Steve Smale

quoting Girosi – between the curse of dimensionality (not enough examples) and the blessing of smoothness (which decreases the effective “dimensionality” eg the complexity of the hypothesis space) through the parameter γ. The regularization algorithm and Support Vector Machines There is nothing to stop us from using the algorithm we described in this paper – that is square loss regularization – for binary classification. Whereas SVM classification arose from using – with binary y – the loss function V (f (x, y)) = (1 − yf (x))+ , we can perform least-squares regularized classification via the loss function V (f (x, y)) = (f (x) − y)2 . This classification scheme was used at least as early as 1989 (for reviews see [7, 40] and then rediscovered again by many others (see [21, 49]), including Mangasarian (who refers to square loss regularization as “proximal vector machines”) and Suykens (who uses the name “least square SVMs”). Rifkin ( [47]) has confirmed the interesting empirical results by Mangasarian and Suykens: “classical” square loss regularization works well also for binary classification (examples are in tables 1 and 2). 800 250 100 30 SVM RLSC SVM RLSC SVM RLSC SVM RLSC 0.131 0.129 0.167 0.165 0.214 0.211 0.311 0.309 Table 1. A comparison of SVM and RLSC (Regularized Least Squares Classification) accuracy on a multiclass classification task (the 20newsgroups dataset with 20 classes and high dimensionality, around 50, 000), performed using the standard “one vs. all” scheme based on the use of binary classifiers. The top row indicates the number of documents/class used for training. Entries in the table are the fraction of misclassified documents. From [47].

52 20 10 3 SVM RLSC SVM RLSC SVM RLSC SVM RLSC 0.072 0.066 0.176 0.169 0.341 0.335 0.650 0.648 Table 2. A comparison of SVM and RLSC accuracy on another multiclass classification task (the sector105 dataset, consisting of 105 classes with dimensionality about 50, 000). The top row indicates the number of documents/class used for training. Entries in the table are the fraction of misclassified documents. From [47].

In references to supervised learning the Support Vector Machine method is often described (see for instance a recent issue of the Notices of the AMS

The Mathematics of Learning: Dealing with Data

13

[28]) according to the “traditional” approach, introduced by Vapnik and followed by almost everybody else. In this approach, one starts with the concepts of separating hyperplanes and margin. Given the data, one searches for the linear hyperplane that separates the positive and the negative examples, assumed to be linearly separable, with the largest margin (the margin is defined as the distance from the hyperplane to the nearest example). Most articles and books follow this approach, go from the separable to the nonseparable case and use a so-called “kernel trick” (!) to extend it to the nonlinear case. SVM for classification was introduced by Vapnik in the linear, separable case in terms of maximizing the margin. In the non-separable case, the margin motivation loses most of its meaning. A more general and simpler framework for deriving SVM algorithms for classification and regression is to regard them as special cases of regularization and follow the treatment of section 2. In the case of linear functions f (x) = w · x and separable data, max1 , which is in turn imizing the margin is exactly equivalent to maximizing ||w|| 2 equivalent to minimizing ||w|| , which corresponds to minimizing the RKHS norm. The regularization algorithm and learning theory The Mercer theorem was introduced in learning theory by Vapnik and RKHS by Girosi [22] and later by Vapnik [9, 50]. Poggio and Girosi [41, 40, 23] had introduced Tikhonov regularization in learning theory (the reformulation of Support Vector Machines as a special case of regularization can be found in [19]). Earlier, Gaussian Radial Basis Functions were proposed as an alternative to neural networks by Broomhead and Loewe. Of course, RKHS had been pioneered by Parzen and Wahba ([37, 53]) for applications closely related to learning, including data smoothing (for image processing and computer vision, see [4, 42]). A Bayesian interpretation The learning algorithm Equation 4 has an interesting Bayesian interpretation [52, 53]: the data term – that is the first term with the quadratic loss function – is a model of (Gaussian, additive) noise and the RKHS norm (the stabilizer) corresponds to a prior probability on the hypothesis space H. Let us define P [f |Sm ] as the conditional probability of the function f given the training examples Sm = (xi , yi )m i=1 , P [Sm |f ] as the conditional probability of Sm given f , i.e. a model of the noise, and P [f ] as the a priori probability of the random field f . Then Bayes theorem provides the posterior distribution as P [Sm |f ] P [f ] . P [f |Sm ] = P (Sm ) If the noise is normally distributed with variance σ, then the probability P [Sm |f ] is 2 1 − 1 2 Pm i=1 (yi −f (xi )) P [Sm |f ] = e 2σ ZL where ZL is a normalization constant. If P [f ] = other normalization constant, then P [f |Sm ] =

1 1 e−( 2σ2 ZD ZL Zr

Pm

1 −kf k2K Zr e

i=1 (yi −f (xi ))

2

where Zr is an-

+kf k2K )

.

14

Tomaso Poggio and Steve Smale

One of the several possible estimates of f from P [f |Sm ] is the so called Maximum A Posteriori (MAP) estimate, that is max P [f |Sm ] = min

m X

(yi − f (xi ))2 + 2σ 2 kf k2K .

i=1

which is the same as the regularization functional, if λ = 2σ 2 /m (for details and extensions to models of non Gaussian noise and different loss functions see [19]). Necessary and sufficient conditions for learnability Compactness of the hypothesis space H is sufficient for consistency of ERM, that is for bounds of the type of Theorem 3.1 on the sample error. The necessary and sufficient condition is that H is a uniform Glivenko-Cantelli class of functions, in which case no specific topology is assumed for H4 . There are several equivalent conditions on H such as finiteness of the Vγ dimension for all positive γ (which reduces to finiteness of the VC dimension for {0, 1} functions) 5 . We saw earlier that the regularization algorithm Equation 4 ensures (through the resulting compactness of the “effective” hypothesis space) well-posedness of the problem. It also yields convergence of the empirical error to the true error (eg bounds such as Theorem 3.1). An open question is whether there is a connection between well-posedness and consistency. For well-posedness the critical condition is usually stability of the solution. In the learning problem, this condition refers to stability of the solution of ERM with respect to small changes of the training set Sm . In a similar way, the condition number (see [6] and especially [29]) characterizes the stability of the solution of Equation 3. Is it possible that some specific form of stability may be necessary and sufficient for consistency of ERM? Such a result would be surprising because, a priori, there is no reason why there should be a connection between wellposedness and consistency: they are both important requirements for ERM but they seem quite different and independent of each other. Learning theory, sample complexity and brains The theory of supervised learning outlined in this paper and in the references has achieved a remarkable degree of completeness and of practical success in many applications. 4

Definition: Let F be a class of functions f . F is a uniform Glivenko-Cantelli class if for every ε > 0 lim sup IP{sup |Eρm f − Eρ f | > ε} = 0. (21) m→∞

5

ρ

f ∈F

where ρn is the empirical measure supported on a set x1 , ..., xn . In [1] – following [51, 17] – a necessary and sufficient condition is proved for uniform convergence of |Iemp [f ] − Iexp [f ]|, in terms of the finiteness for all γ > 0 of a combinatorial quantity called Vγ dimension of F (which is the set V (x), f (x), f ∈ H), under some assumptions on V . The result is based on a necessary and sufficient (distribution independent) condition proved by [51] which uses the metric entropy of F defined as Hm (, F ) = supxm ∈X m log N (, F , xm ), where N (, F , xm ) is the -covering of F wrt lx∞m ( lx∞m is the l∞ distance on the points xm ): Theorem (Dudley, (,F ) see [18]). F is a uniform Glivenko-Cantelli class iff limm→∞ Hmm = 0 for all  > 0.

The Mathematics of Learning: Dealing with Data

15

Within it, many interesting problems remain open and are a fertile ground for interesting and useful mathematics. One may also take a broader view and ask: what next? One could argue that the most important aspect of intelligence and of the amazing performance of real brains is the ability to learn. How then do the learning machines we have described in the theory compare with brains? There are of course many aspects of biological learning that are not captured by the theory and several difficulties in making any comparison. One of the most obvious differences, however, is the ability of people and animals to learn from very few examples. The algorithms we have described can learn an object recognition task from a few thousand labeled images. This is a small number compared with the apparent dimensionality of the problem (thousands of pixels) but a child, or even a monkey, can learn the same task from just a few examples. Of course, evolution has probably done a part of the learning but so have we, when we choose for any given task an appropriate input representation for our learning machine. From this point of view, as Donald Geman has argued, the interesting limit is not ”m goes to infinity,” but rather ”m goes to zero”. Thus an important area for future theoretical and experimental work is learning from partially labeled examples (and the related area of active learning). In the first case there are only a small number ` of labeled pairs (xi , yi )`i=1 – for instance with yi binary – and many unlabeled data (xi )m `+1 , m >> `. Though interesting work has begun in this direction, a satisfactory theory that provides conditions under which unlabeled data can be used is still lacking. A comparison with real brains offers another, and probably related, challenge to learning theory. The “learning algorithms” we have described in this paper correspond to one-layer architectures. Are hierarchical architectures with more layers justifiable in terms of learning theory? It seems that the learning theory of the type we have outlined does not offer any general argument in favor of hierarchical learning machines for regression or classification. This is somewhat of a puzzle since the organization of cortex – for instance visual cortex – is strongly hierarchical. At the same time, hierarchical learning systems show superior performance in several engineering applications. For instance, a face categorization system in which a single SVM classifier combines the real-valued output of a few classifiers, each trained to a different component of faces – such as eye and nose –, outperforms a single classifier trained on full images of faces ([25]). The theoretical issues surrounding hierarchical systems of this type are wide open, and likely to be of paramount importance for the next major development of efficient classifiers in several application domains. Why hierarchies? There may be reasons of efficiency – computational speed and use of computational resources. For instance, the lowest levels of the hierarchy may represent a dictionary of features that can be shared across multiple classification tasks (see [24]). Hierarchical system usually decompose a task in a series of simple computations at each level – often an advantage for fast implementations. There may also be the more fundamental issue of sample complexity. We mentioned that an obvious dif-

16

Tomaso Poggio and Steve Smale

ference between our best classifiers and human learning is the number of examples required in tasks such as object detection. The theory described in this paper shows that the difficulty of a learning task depends on the size of the required hypothesis space. This complexity determines in turn how many training examples are needed to achieve a given level of generalization error. Thus the complexity of the hypothesis space sets the speed limit and the sample complexity for learning. If a task – like a visual recognition task – can be decomposed into low-complexity learning tasks, for each layer of a hierarchical learning machine, then each layer may require only a small number of training examples. Of course, not all classification tasks have a hierarchical representation. Roughly speaking, the issue is under which conditions a function of many variables can be approximated by a function of a small number of functions of subsets of the original variables. Neuroscience suggests that what humans can learn can be represented by hierarchies that are locally simple. Thus our ability of learning from just a few examples, and its limitations, may be related to the hierarchical architecture of cortex. This is just one of several possible connections, still to be characterized, between learning theory and the ultimate problem in natural science – the organization and the principles of higher brain functions. Acknowledgments Thanks to Felipe Cucker, Federico Girosi, Don Glaser, Sayan Mukherjee, Massimiliano Pontil, Martino Poggio and Ryan Rifkin.

References 1. N. Alon, S. Ben-David, N. Cesa-Bianchi, and D. Haussler. Scale-sensitive dimensions, uniform convergence, and learnability. J. of the ACM, 44(4):615–631, 1997. 2. N. Aronszajn. Theory of reproducing kernels. Trans. Amer. Math. Soc., 686:337– 404, 1950. 3. A.R. Barron. Approximation and estimation bounds for artificial neural networks. Machine Learning, 14:115–133, 1994. 4. M. Bertero, T. Poggio, and V. Torre. Ill-posed problems in early vision. Proceedings of the IEEE, 76:869–889, 1988. 5. D. Beymer and T. Poggio. Image representations for visual learning. Science, 272(5270):1905–1909, June 1996. 6. O. Bousquet and A. Elisseeff. Stability and generalization. Journal of Machine Learning Research, (2):499–526, 2002. 7. D.S. Broomhead and D. Lowe. Multivariable functional interpolation and adaptive networks. Complex Systems, 2:321–355, 1988. 8. R. Brunelli and T. Poggio. Hyberbf networks for real object recognition. In Proceedings IJCAI, Sydney, Australia, 1991. 9. C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:1–25, 1995. 10. F. Cucker and S. Smale. On the mathematical foundations of learning. Bulletin of AMS, 39:1–49, 2001. 11. F. Cucker and S. Smale. Best choices for regularization parameters in learning theory: on the bias-variance problem. Foundations of Computational Mathematics, 2(4):413–428, 2002.

The Mathematics of Learning: Dealing with Data

17

12. I. Daubechies. Ten lectures on wavelets. CBMS-NSF Regional Conferences Series in Applied Mathematics. SIAM, Philadelphia, PA, 1992. 13. R. DeVore, R. Howard, and C. Micchelli. Optimal nonlinear approximation. Manuskripta Mathematika, 1989. 14. R. A. DeVore, D. Donoho, M Vetterli, and I. Daubechies. Data compression and harmonic analysis. IEEE Transactions on Information Theory Numerica, 44:2435– 2476, 1998. 15. R.A. DeVore. Nonlinear approximation. Acta Numerica, 7:51–150, 1998. ¨ 16. L. Devroye, L. Gyorfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Number 31 in Applications of mathematics. Springer, New York, 1996. 17. R.M. Dudley. Universal Donsker classes and metric entropy. Ann. Prob., 14(4):1306–1326, 1987. 18. R.M. Dudley, E. Gine, and J. Zinn. Uniform and universal glivenko-cantelli classes. Journal of Theoretical Probability, 4:485–510, 1991. 19. T. Evgeniou, M. Pontil, and T. Poggio. Regularization networks and support vector machines. Advances in Computational Mathematics, 13:1–50, 2000. 20. T. Ezzat, G. Geiger, and T. Poggio. Trainable videorealistic speech animation. In Proceedings of ACM SIGGRAPH 2002, San Antonio, TX, pages 388–398, 2002. 21. G. Fung and O. L. Mangasarian. Proximal support vector machine classifiers. In KDD 2001: Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, 2001. 22. F. Girosi. An equivalence between sparse approximation and Support Vector Machines. Neural Computation, 10(6):1455–1480. 23. F. Girosi, M. Jones, and T. Poggio. Regularization theory and neural networks architectures. Neural Computation, 7:219–269, 1995. 24. T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer Series in Statistics. Springer Verlag, Basel, 2001. 25. B. Heisele, T. Serre, M. Pontil, T. Vetter, and T. Poggio. Categorization by learning and combining object parts. In Advances in Neural Information Processing Systems 14 (NIPS’01), volume 14, pages 1239–1245. MIT Press, 2002. 26. B. Heisele, A. Verri, and T. Poggio. Learning and vision machines. Proceedings of the IEEE, 90:1164–1177, 2002. 27. J. Hutchinson, A. Lo, and T. Poggio. A nonparametric approach to pricing and hedging derivative securities via learning networks. The Journal of Finance, XLIX(3), July 1994. 28. R. Karp. Mathematical challenges from genomics and molecular biology. Notices of the AMS, 49:544–553, 2002. 29. S. Kutin and P. Niyogi. Almost-everywhere algorithmic stability and generalization error. Technical report TR-2002-03, University of Chicago, 2002. 30. S. Mendelson. Improving the sample complexity using global data. IEEE Transactions on Information Theory, 48(7):1977–1991, 2002. 31. S. Mendelson. Geometric parameters in learning theory. Submitted for publication, 2003. 32. C. A. Micchelli. Interpolation of scattered data: distance matrices and conditionally positive definite functions. Constructive Approximation, 2:11–22, 1986. 33. C.A. Micchelli and T.J. Rivlin. A survey of optimal recovery. In C.A. Micchelli and T.J. Rivlin, editors, Optimal Estimation in Approximation Theory, pages 1–54. Plenum Press, New York, 1976. 34. J. Moody and C. Darken. Fast learning in networks of locally-tuned processing units. Neural Computation, 1(2):281–294, 1989.

18

Tomaso Poggio and Steve Smale

35. P. Niyogi and F. Girosi. Generalization bounds for function approximation from scattered noisy data. Advances in Computational Mathematics, 10:51–80, 1999. 36. C. Papageorgiou, M. Oren, and T. Poggio. A general framework for object detection. In Proceedings of the International Conference on Computer Vision, Bombay, India, January 1998. 37. E. Parzen. An approach to time series analysis. Ann. Math. Statist., 32:951–989, 1961. 38. T. Poggio and R. Brunelli. A novel approach to graphics. A.I. Memo No. 1354, Artificial Intelligence Laboratory, Massachusetts Institute of Technology, 1992. 39. T. Poggio and S. Edelman. A network that learns to recognize 3D objects. Nature, 343:263–266, 1990. 40. T. Poggio and F. Girosi. Networks for approximation and learning. Proceedings of the IEEE, 78(9), September 1990. 41. T. Poggio and F. Girosi. Regularization algorithms for learning that are equivalent to multilayer networks. Science, 247:978–982, 1990. 42. T. Poggio, V. Torre, and C. Koch. Computational vision and regularization theory. Nature, 317:314–319, 1985b. 43. T. Poggio and A. Verri. Introduction: Learning and vision at cbcl. International Journal of Computer Vision, 38-1, 2000. 44. M. J. D. Powell. Radial basis functions for multivariable interpolation: a review. In J. C. Mason and M. G. Cox, editors, Algorithms for Approximation. Clarendon Press, Oxford, 1987. 45. M.J.D. Powell. The theory of radial basis functions approximation in 1990. In W.A. Light, editor, Advances in Numerical Analysis Volume II: Wavelets, Subdivision Algorithms and Radial Basis Functions, pages 105–210. Oxford University Press, 1992. 46. Ramaswamy, Tamayo, Rifkin, Mukherjee, Yeang, Angelo, Ladd, Reich, Latulippe, Mesirov, Poggio, Gerlad, Loda, Lander, and Golub. Multiclass cancer diagnosis using tumor gene expression signatures. Proceedings of the National Academy of Science, December 2001. 47. R. M. Rifkin. Everything Old Is New Again: A Fresh Look at Historical Approaches to Machine Learning. PhD thesis, Massachusetts Institute of Technology, 2002. 48. S. Smale and D. Zhou. Estimating the approximation error in learning theory. Analysis and Applications, 1:1–25, 2003. 49. J. Suykens, T. Van Gestel, J. De Brabanter, B. De Moor, and J. Vandewalle. Least Squares Support Vector Machines. 50. V. N. Vapnik. Statistical Learning Theory. Wiley, New York, 1998. 51. V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequences of events to their probabilities. Th. Prob. and its Applications, 17(2):264– 280, 1971. 52. G. Wahba. Smoothing and ill-posed problems. In M. Golberg, editor, Solutions methods for integral equations and applications, pages 183–194. Plenum Press, New York, 1979. 53. G. Wahba. Splines Models for Observational Data. Series in Applied Mathematics, Vol. 59, SIAM, Philadelphia, 1990. 54. D. Zhou. The regularity of reproducing kernel hilbert spaces in learning theory. 2001. preprint. 55. D. Zhou. The covering number in learning theory. J. Complexity, 18:739–767, 2002.

Logical Regression Analysis:From mathematical formulas to linguistic rules Hiroshi Tsukimoto Tokyo Denki University [email protected]

1 Introduction Data mining means the discovery of knowledge from (a large amount of) data, and so data mining should provide not only predictions but also knowledge such as rules that are comprehensible to humans. Data mining techniques should satisfy the two requirements, that is, accurate predictions and comprehensible rules. Data mining consists of several processes such as preprocessing and learning. This paper deals with learning. The learning in data mining can be divided into supervised learning and unsupervised learning. This paper deals with supervised learning, that is, classification and regression. The major data mining techniques are neural networks, statistics, decision trees, and association rules. When these techniques are applied to real data, which usually consist of discrete data and continuous data, they each have their own problems. In other words, there is no perfect technique, that is, there is no technique which can satisfy the two requirements of accurate predictions and comprehensible rules. Neural networks are black boxes, that is, neural networks are incomprehensible. Multiple regression formulas, which are the typical statistical models, are black boxes too. Decision trees do not work well when classes are continuous [19], that is, if accurate predictions are desired, comprehensibility has to be sacrificed, and if comprehensibility is desired, accurate predictions have to be sacrificed. Association rules, which are unsupervised learning techniques, do not work well when the right-hand sides of rules, which can be regarded as classes, are continuous [21]. The reason is almost the same as that for decision trees. Neural networks and multiple regression formulas are mathematical formulas. Decision trees and association rules are linguistic rules. Mathematical formulas can provide predictions but cannot provide comprehensible rules. On the other hand, linguistic rules can provide comprehensible rules but cannot provide accurate predictions in continuous classes.

20

Hiroshi Tsukimoto

How can we solve the above problem? The solution is rule extraction from mathematical formulas. Rule extraction from mathematical formulas is needed for developing the perfect data mining techniques satisfying the two data mining requirements, accurate predictions and comprehensible rules. Several researchers have been developing rule extraction techniques from neural networks, and therefore, neural networks can be used in data mining [14], whereas few researchers have studied rule extraction from linear formulas. We have developed a rule extraction technique from mathematical formulas such as neural networks and linear formulas. The technique is called the Approximation Method. We have implemented the Approximation Method for neural networks in a data mining tool KINOsuite-PR[41]. Statistics have been developing many techniques. The outputs of several techniques are linear formulas. The Approximation Method can extract rules from the linear formulas, and so the combination of statistical techniques and the Approximation Method works well for data mining. We have developed a data mining technique called Logical Regression Analysis(LRA). LRA consists of regression analysis1 and the Approximation Method. Since there are several regression analyses, LRA has several versions. In other words, LRA can be applied to a variety of data, because a variety of regression analyses have been developed. So far, in data mining, many researchers have been dealing with symbolic data or numerical data and few researchers have been dealing with image data. One of LRA’s merits is that LRA can deal with image data, that is, LRA can discover rules from image data. Therefore, this paper explains the data mining from images by LRA. There are strong correlations between the pixels of images, and so the numbers of samples are small compared with the numbers of the pixels, that is, the numbers of attributes(variables). A regression analysis for images must work well when the correlations among attributes are strong and the number of data is small. Nonparametric regression analysis works well when the correlations among attributes are strong and the number of data is small. There are many kinds of images such as remote-sensing images, industrial images and medical images. Within medical images, there are many types such as brain images, lung images, and stomach images. Brain functions are the most complicated and there are a lot of unknown matters, and consequently the discovery of rules between brain areas and brain functions is a significant subject. Therefore, we have been dealing with functional brain images. We have been applying LRA using nonparametric regression analysis to fMRI images to discover the rules of brain functions. In this paper, for sim1

The regression analysis includes the nonlinear regression analysis using neural networks

Logical Regression Analysis

21

plification, “LRA using nonparametric regression analysis” is abbreviated to “LRA”. LRA was applied to several experimental tasks. This paper reports the experimental results of finger tapping and calculation. The results of LRA include discoveries that have never been experimentally confirmed. Therefore, LRA has the possibility of providing new evidence in brain science.

Data

Regression Analysis

The Approximation Method

Rules predictions (Predictions) Fig. 1. Logical Regression Analysis

Section 2 explains the problems in the major data mining techniques and how the problems can be solved by rule extraction from mathematical formulas. Section 3 surveys the rule extraction from neural networks. Section 4 explains the Approximation Method, which has been developed by the author. The Approximation Method is mathematically based on the multilinear function space, which is also explained. Section 5 explains the Approximation Method in the continuous domain. Section 6 discusses a few matters. Section 7 briefly explains the data mining from fMRI images. Section 8 describes the data mining from fMRI images by Logical Regression Analysis(LRA). Section 9 shows the experiments of calculations. This paper is mainly based on [32], [33],[35], and [36].

22

Hiroshi Tsukimoto

2 The problems of major data mining techniques The major data mining techniques, that is, decision trees, neural networks, statistics and association rules, are reviewed in terms of the two requirements of accurate predictions and comprehensible rules. 2.1 Neural networks

x w1 S1 w2

w5

w3 w4

w6

z

S2 y Input layer

Hidden layer

Output layer

Fig. 2. Neural network

Neural networks can provide accurate predictions in the discrete domain and the continuous domain. The problem is that the training results of neural networks are sets of mathematical formulas, that is, neural networks are incomprehensible black boxes. For example, Fig. 2 shows a trained neural network. In Fig. 2, x and y are inputs, z is an output, and ti ’s are the outputs of the two hidden units. Each output is represented as follows: t1 = S(w1 x + w2 y + h1 ), t2 = S(w3 x + w4 y + h2 ), z = S(w5 t1 + w6 t2 + h3 ), where wi ’s stand for weight parameters, S(x) stands for sigmoid function and hi ’s stand for biases. The weight parameters are as follows:

Logical Regression Analysis

23

w1 = 2.51, w2 = −4.80, w3 = −4.90, w4 = 2.83, w5 = 2.52, w6 = −4.81, h1 = −0.83, h2 = −1.12, h3 = −0.82. From the above weight parameters and the above formulas, we cannot understand what the neural network has learned. Therefore, the trained neural network is a black box. What the neural network has learned will be explained later using a rule extraction technique. 2.2 Multiple regression analysis There are a lot of statistical methods. The most typical method is multiple regression analysis, and therefore, only multiple regression analysis is discussed here. Multiple regression analysis usually uses linear formulas, and so only linear regression analysis is possible and nonlinear regression analysis is impossible, while neural networks can perform nonlinear regression analysis. However, linear regression analysis has the following advantages. 1. The optimal solution can be calculated. 2. The linear regression analysis is the most widely used statistical method in the world. 3. Several regression analysis techniques such as nonparametric regression analysis and multivariate autoregression analysis have been developed based on the linear regression analysis. Linear regression analysis can provide appropriate predictions in the continuous domain and discrete domain. The problem is that multiple regression formulas are mathematical formulas as showed below. y=

n X

ai xi + b,

i=1

where y is a dependent variable, xi s are independent variables, b is a constant, ai s are coefficients, and n is the number of the independent variables. The mathematical formulas are incomprehensible black boxes. A mathematical formula consisting of a few independent variables may be understandable. However, a mathematical formula consisting of a lot of independent variables cannot be understood. Moreover, in multivariate autoregression analysis, there are several mathematical formulas, and so the set of the mathematical formulas cannot be understood at all. Therefore, rule extraction from linear regression formulas is important. 2.3 Decision trees When a class is continuous, the class is discretized into several intervals. When the number of the intervals, that is, the number of the discretized classes, is

24

Hiroshi Tsukimoto

small, comprehensible trees can be obtained, but the tree cannot provide accurate predictions. For example, Fig. 3 shows a tree where there are two continuous attributes and a continuous class. To improve the prediction accuracy, let the number of intervals be large. When the number of intervals is large, the trees obtained are too complicated to be comprehensible. ¿From the above simple discussion, we can conclude that it is impossible to obtain trees which can satisfy the two requirements for data mining techniques, accurate predictions and comprehensible rules at the same time. Therefore, decision trees cannot work well when classes are continuous [19]. As a solution for continuous classes, for example, Quinlan presented Cubist[40]. Cubist generates piecewise-linear models, which are a kind of regression trees. Fig. 4 shows an example. As seen from this figure, the tree is an extension of linear formulas, and so the tree is a prediction model, that is, it is incomprehensible. As a result, this solution has solved the inaccurate prediction problem but has generated the incomprehensibility problem. 2.4 Association rules Association rules are described as a ∧ b → c, where c can be regarded as a class. Association rules do not work well when “classes” like c in the above rule are continuous. When there are many intervals, the rules are too complicated to be comprehensible, whereas when there are few intervals, the rule cannot provide accurate predictions, and therefore some concessions are needed [21]. Some techniques such as fuzzy techniques can be applied to the above problem [11], but while fuzzy techniques can improve predictions, they degrade the comprehensibility.

x 0 for 1≤i≤k and σi=0 for i>k. The product of the resulting matrices is a matrix Ak which is an approximation to A with rank k, i.e.

Ak = U k ∑ k VkT

(3.39)

where k k% globally, it must have support > k% on at least one of the individual sites. A distributed algorithm for this would work as follows: Request that each site send all rules with support at least k. For each rule returned, request that all sites send the count of their transactions that support the rule, and the total count of all transactions at the site. From

320

Chris Clifton et al.

R=17

Site 1 ABC: 5 DBSize = 100

R+count-5%*DBsize = 17+5-5%*100

ABC: Yes! 18 ≥ R?

Site 3 ABC: 20 DBSize = 300

17

13

13+20-5%*300

Site 2 ABC: 6 DBSize=200 17+6-5%*200

Fig. 1. Determining if itemset support exceeds 5% threshold

this, we can compute the global support of each rule, and (from the lemma) be certain that all rules with support at least k have been found. This has been shown to be an effective pruning technique[7]. In order to use the above lemma, we need to compute the union of locally large sets. We then use the secure summation and comparison only on the candidate itemsets contained in the union. Revealing candidate itemsets means that the algorithm is no longer fully secure: itemsets that are large at one site, but not globally large, would not be disclosed by a fully secure algorithm. However, by computing the union securely, we prevent disclosure of which site, or even how many sites, support a particular itemset. This release of innocuous information (included in the final result) enables a completely secure algorithm that approaches the efficiency of insecure distributed association rule mining algorithms. The function now being computed reveals more information than the original association rule mining function. However, the key is that we have provable limits on what is disclosed. We now demonstrate how to securely compute a union. Secure union of locally large itemsets One way to securely compute a union is to directly apply secure circuit evaluation as follows: For each possible large k-itemset, each site can create a 0/1 vector such that if the ith itemset is locally supported at the site, it will set the ith bit of its vector to 1 otherwise it will set it to 0. Let’s denote this vector as vj for site j and let vj (i) be the ith bit of this vector. All the itemsets are arranged according to lexicographic order. Now for any given itemset, we

Privacy-Preserving Data Mining

321

can find its index i, and evaluate ∨nj=1 vj (i) where ∨ is the or gate. Assuming that we use a secure generic circuit evaluation for or gate (∨), the above protocol is secure and reveals nothing other than the set union result. However expensive circuit evaluation is needed for each potential large k-itemset. This secure method does not use the fact that local pruning eliminates some part of the large itemsets. We will now give a much more efficient method for this problem. Although the new method reveals a little more information than the above protocol, a precise description of what is revealed is given, and we prove that nothing else is revealed. To obtain an efficient solution without revealing what each site supports, we instead exchange locally large itemsets in a way that obscures the source of each itemset. We assume a secure commutative encryption algorithm with negligible collision probability. Intuitively, under commutative encryption, the order of encryption does not matter. If a plaintext message is encrypted by two different keys in a different order, it will be mapped to the same ciphertext. Formally, commutativity ensures that Ek1 (Ek2 (x)) = Ek2 (Ek1 (x)). There are several examples of commutative encryption schemes, RSA is perhaps the best known. The detailed algorithm is given in Algorithm 1. We now briefly explain the key phases of the algorithm. The main idea is that each site encrypts the locally supported itemsets, along with enough “fake” itemsets to hide the actual number supported. Each site then encrypts the itemsets from other sites. In Phases 2 and 3, the sets of encrypted itemsets are merged. Due to commutative encryption, duplicates in the locally supported itemsets will be duplicates in the encrypted itemsets, and can be deleted. The reason this occurs in two phases is that if a site knows which fully encrypted itemsets come from which sites, it can compute the size of the intersection between any set of sites. While generally innocuous, if it has this information for itself, it can guess at the itemsets supported by other sites. Permuting the order after encryption in Phase 1 prevents knowing exactly which itemsets match, however separately merging itemsets from odd and even sites in Phase 2 prevents any site from knowing the fully encrypted values of its own itemsets. Phase 4 decrypts the merged frequent itemsets. Commutativity of encryption allows us to decrypt all itemsets in the same order regardless of the order they were encrypted in, preventing sites from tracking the source of each itemset. The detailed algorithm assumes the following representations: F represents the data that can be used as fake itemsets. |LLei(k) | represents the set of the encrypted k itemsets at site i. Ei is the encryption and Di is the decryption by site i. An illustration of the above protocol is given in figure 2. Using commutative encryption, each party encrypts its own frequent itemsets (e.g., Site 1 encrypts itemset C ). The encrypted itemsets are then passed to other parties, until all parties have encrypted all itemsets. These are passed to a common party to eliminate duplicates, and to begin decryption. (In the figure, the full

322

Chris Clifton et al.

Algorithm 1 [20] Finding secure union of large itemsets of size k Require: N ≥ 3 sites numbered 0..N − 1, set F of non-itemsets. Phase 0: Encryption of all the rules by all sites for each site i do generate LLi(k) (Locally Large k-itemsets) LLei(k) = ∅ for each X ∈ LLi(k) do LLei(k) = LLei(k) ∪ {Ei (X)} end for for j = |LLei(k) | + 1 to |CG(k) | do LLei(k) = LLei(k) ∪ {Ei (random selection from F )} end for end for Phase 1: Encryption by all sites for Round j = 0 to N − 1 do if Round j= 0 then Each site i sends permuted LLei(k) to site (i + 1) mod N else Each site i encrypts all items in LLe(i−j mod N )(k) with Ei , permutes, and sends it to site (i + 1) mod N end if end for{At the end of Phase 1, site i has the itemsets of site (i + 1) mod N encrypted by every site} Phase 2: Merge odd/even itemsets Each site i sends LLei+1 mod N to site 1 − ((i + 1 mod N ) mod 2) d(N −1)/2e Site 0 sets RuleSet1 = ∪j=1 LLe(2j−1)(k) b(N −1)/2c

Site 1 sets RuleSet0 = ∪j=0

LLe(2j)(k)

Phase 3: Merge all itemsets Site 1 sends permuted RuleSet1 to site 0 Site 0 sets RuleSet = RuleSet0 ∪ RuleSet1 Phase 4: Decryption for i = 0 to N − 1 do Site i decrypts items in RuleSet using Di Site i sends permuted RuleSet to site i + 1 mod N end for Site N − 1 decrypts items in RuleSet using DN −1 RuleSet(k) = RuleSet − F Site N − 1 broadcasts RuleSet(k) to sites 0..N − 2

Privacy-Preserving Data Mining

323

set of itemsets are shown to the left of Site 1, after Site 1 decrypts.) This set is then passed to each party, and each party decrypts each itemset. The final result is the common itemsets (C and D in the figure).

E2(E3(C)) E2(E3(D))

Site 1 C

C D E3(E2(E1(C)))

Site 3 C

E1(C)

E3(C) E3(D) Site 2 D

E2(E1(C))

Fig. 2. Determining global candidate itemsets

Clearly, Algorithm 1 finds the union without revealing which itemset belongs to which site. It is not, however, secure under the definitions of secure multi-party computation. It reveals the number of itemsets having common support between sites, e.g., sites 3, 5, and 9 all support some itemset. It does not reveal which itemsets these are, but a truly secure computation (as good as giving all input to a “trusted party”) could not reveal even this count. Allowing innocuous information leakage (the number of itemsets having common support) allows an algorithm that is sufficiently secure with much lower cost than a fully secure approach. If we deem leakage of the number of commonly supported itemsets as acceptable, we can prove that this method is secure under the definitions of secure multi-party computation. The idea behind the proof is to show that given the result, the leaked information, and a site’s own input, a site can simulate everything else seen during the protocol. Since the simulation generates everything seen during execution of the protocol, the site clearly learns nothing new from the protocol beyond the input provided to the simulator. One key is that the simulator does not need to generate exactly what is seen in any particular run of the protocol. The exact content of messages passed during the protocol is dependent on the random choice of keys; the simulator must generate an equivalent distribution, based on random choices made by the simulator, to the distribution of messages seen in real executions of the protocol. A formal proof that this proof technique shows that a protocol preserves privacy can be found in [17]. We use this approach to prove that Algorithm

324

Chris Clifton et al.

1 reveals only the union of locally large itemsets and a clearly bounded set of innocuous information. Theorem 2. [20] Algorithm 1 privately computes the union of the locally large itemsets assuming no collusion, revealing at most the result ∪N i=1 LLi(k) and: 1. Size of intersection of locally supported itemsets between any subset of odd numbered sites, 2. Size of intersection of locally supported itemsets between any subset of even numbered sites, and 3. Number of itemsets supported by at least one odd and one even site. Proof. Phase 0: Since no communication occurs in Phase 0, each site can simulate its view by running the algorithm on its own input. Phase 1: At the first step, each site sees LLei−1(k) . The size of this set is the size of the global candidate set CG(k) , which is known to each site. Assuming the security of encryption, each item in this set is computationally indistinguishable from a number chosen from a uniform distribution. A site can therefore simulate the set using a uniform random number generator. This same argument holds for each subsequent round. Phase 2: In Phase 2, site 0 gets the fully encrypted sets of itemsets from the other even sites. Assuming that each site knows the source of a received message, site 0 will know which fully encrypted set LLe(k) contains encrypted itemsets from which (odd) site. Equal itemsets will now be equal in encrypted form. Thus, site 0 learns if any odd sites had locally supported itemsets in common. We can still build a simulator for this view, using the information in point 1 above. If there are k itemsets known to be common among all bN/2c odd sites (from point 1), generate k random numbers and put them into the simulated LLei(k) . Repeat for each bN/2c − 1 subset, etc., down to 2 subsets of the odd sites. Then fill each LLei(k) with randomly chosen values until it reaches size |CGi(k) |. The generated sets will have exactly the same combinations of common items as the real sets, and since the values of the items in the real sets are computationally indistinguishable from a uniform distribution, their simulation matches the real values. The same argument holds for site 1, using information from point 2 to generate the simulator. Phase 3: Site 1 eliminates duplicates from the LLei(k) to generate RuleSet1 . We now demonstrate that Site 0 can simulate RuleSet1 . First, the size of RuleSet1 can be simulated knowing point 2. There may be itemsets in common between RuleSet0 and RuleSet1 . These can be simulated using point 3: If there are k items in common between even and odd sites, site 0 selects k random items from RuleSet0 and inserts them into RuleSet1 . RuleSet1 is then filled with randomly generated values. Since the encryption guarantees that the values are computationally indistinguishable from a uniform distribution, and the set sizes |RuleSet0 |, |RuleSet1 |, and |RuleSet0 ∩ RuleSet1 |

Privacy-Preserving Data Mining

325

(and thus |RuleSet|) are identical in the simulation and real execution, this phase is secure. Phase 4: Each site sees only the encrypted items after decryption by the preceding site. Some of these may be identical to items seen in Phase 2, but since all items must be in the union, this reveals nothing. The simulator for site i is built as follows: take the values generated in Phase 2 step N −1−i, and place them in the RuleSet. Then insert random values in RuleSet up to the proper size (calculated as in the simulator for Phase 3). The values we have not seen before are computationally indistinguishable from data from a uniform distribution, and the simulator includes the values we have seen (and knew would be there), so the simulated view is computationally indistinguishable from the real values. The simulator for site N − 1 is different, since it learns RuleSet(k) . To simulate what it sees in Phase 4, site N − 1 takes each item in RuleSet(k) , the final result, and encrypts it with EN −1 . These are placed in RuleSet. RuleSet is then filled with items chosen from F , also encrypted with EN −1 . Since the choice of items from F is random in both the real and simulated execution, and the real items exactly match in the real and simulation, the RuleSet site N − 1 receives in Phase 4 is computationally indistinguishable from the real execution. Therefore, we can conclude that above protocol is privacy-preserving in the semi-honest model with the stated assumptions. u t The information disclosed by points 1-3 could be relaxed to the number of itemsets support by 1 site, 2 sites, ..., N sites if we assume anonymous message transmission. The number of jointly supported itemsets can also be masked by allowing sites to inject itemsets that are not really supported locally. These fake itemsets will simply fail to be globally supported, and will be filtered from the final result when global support is calculated as shown in the next section. The jointly supported itemsets “leak” then becomes an upper bound rather than exact, at an increased cost in the number of candidates that must be checked for global support. While not truly zero-knowledge, it reduces the confidence (and usefulness) of the leaked knowledge of the number of jointly supported itemsets. In practical terms, revealing the size (but not content) of intersections between sites is likely to be of little concern. A complimentary problem of mining association rules over vertically partitioned data is addressed in [37, 34]. While we do not describe these techniques here, we would like to emphasize that the different model of distribution requires very different solution techniques. 1.3 Decision Trees The first paper discussing the use of Secure Multiparty Computation for data mining gave a procedure for constructing decision trees[24], specifically running ID3 [31] between two parties, each containing a subset of the training

326

Chris Clifton et al.

entities. Of particular interest is the ability to maintain “perfect” security in the SMC sense, while trading off efficiency for the quality of the resulting decision tree. Building an ID3 decision tree is a recursive process, operating on the decision attributes R, class attribute C, and training entities T . At each stage, one of three things can happen: 1. R may be empty; i.e., the algorithm has no attributes on which to make a choice. In this case a leaf node is created with the class of the leaf being the majority class of the transactions in T . 2. All the transactions in T may have the same class c, in which case a leaf is created with class c. 3. Otherwise, we recurse: a) Find the attribute A that is the most effective classifier for transactions in T , specifically the attribute that gives the highest information gain. b) Partition T based on the values ai of A. c) Return a tree with root labeled A and edges ai , with the node at the end of edge ai constructed from calling ID3 with R − {A}, C, T (Ai ). In step 3a, information gain is defined as the change in the entropy relative to the class attribute. Specifically, the entropy HC (T ) =

X



c∈C

|T (c)| |T (c)| log . |T | |T |

Analogously, the entropy after classifying with A is HC (T |A) =

X a∈A



|T (a)| HC (T (a)). |T |

Information gain is def

Gain(A) = HC (T ) − HC (T |A). The goal, then, is to find A that maximizes Gain(A), or minimizes HC (T |A). Expanding, we get: HC (T |A) =

X |T (a)| HC (T (A)) |T |

a∈A

µ ¶ X |T (a, c)| 1 X |T (a, c)| |T (a)| − log |T | |T (a)| |T (A)| a∈A c∈C Ã ! XX X 1 − |T (a, c)| log(|T (a, c)|) + |T (a)| log(|T (a)|)(1) = |T |

=

a∈A c∈C

a∈A

Privacy-Preserving Data Mining

327

Looking at this from the point of view of privacy preservation, we can assume that R and C are known to both parties. T is divided. In Step 1 we need only determine the class value of the majority of the transactions in T . This can be done using circuit evaluation 1. Since each party is able to compute the count of local items in each class, the input size of the circuit is fixed by the number of classes, rather than growing with the (much larger) training data set size. Step 2 requires only that we determine if all of the items are of the same class. This can again be done with circuit evaluation, here testing for equality. Each party gives as input either the single class ci of all of its remaining items, or the special symbol ⊥ if its items are of multiple classes. The circuit returns the input if the input values are equal, else it returns ⊥.1 It is easy to prove that these two steps preserve privacy: Knowing the tree, we know the majority class for Step 1. As for Step 2, if we see a tree that has a “pruned” branch, we know that all items must be of the same class, or else the branch would have continued. Interestingly, if we test if all items are in the same class before testing if there are no more attributes (reversing steps 1 and 2, as the original ID3 algorithm was written), the algorithm would not be private. The problem is that Step 2 reveals if all of the items are of the same class. The decision tree doesn’t contain this information. However, if a branch is “pruned” (the tree outputs the class without looking at all the attributes), we know that all the training data at that point are of the same class – otherwise the tree would have another split/level. Thus Step 2 doesn’t reveal any knowledge that can’t be inferred from the tree when the tree is pruned – the given order ensures that this step will only be taken if pruning is possible. This leaves Step 3. Note that once A is known, steps 3b and 3c can be computed locally – no information exchange is required, so no privacy breach can occur. Since A can be determined by looking at the result tree, revealing A is not a problem, provided nothing but the proper choice for A is revealed. The hard part is Step 3a: computing the attribute that gives the highest information gain. This comes down to finding the A that minimizes Equation 1. Note that since the database is horizontally partitioned, |T (a)| is really |T1 (a)| + |T2 (a)|, where T1 and T2 are the two databases. The idea is that the parties will compute (random) shares of each (|T1 (a, c)|+|T2 (a, c)|) log(|T1 (a, c)|+ |T2 (a, c)|), and (|T1 (a)| + |T2 (a)|) log(|T1 (a)| + |T2 (a)|). The parties can then locally add their shares to give each a random share of HC (T |A). This is repeated for each attribute A, and a (small) circuit, of size linear in the number of attributes, is constructed to select the A that gives the largest value. 1

The paper by Lindell and Pinkas gives other methods for computing this step, however circuit evaluation is sufficient – the readers are encouraged to read [24] for the details.

328

Chris Clifton et al.

The problem, then is to efficiently compute (x + y) log(x + y). Lindell and Pinkas actually give a protocol for computing (x+y) ln(x+y), giving shares of HC (T |A)·|T |·ln 2. However, the constant factors are immaterial since the goal is simply to find the A that minimizes the equation. In [24] three protocols are given: Computing shares of ln(x+y), computing shares of x·y, and the protocol for computing the final result. The last is straightforward: Given shares u1 and u2 of ln(x + y), the parties call the multiplication protocol twice to give shares of u1 · y and u2 · x. Each party then sums three multiplications: the two secure multiplications, and the result of multiplying its input (x or y) with its share of the logarithm. This gives each shares of u1 yu2 x + u1 x + u2 y = (x + y)(u1 + u2 ) = (x + y) ln(x + y). The logarithm and multiplication protocols are based on oblivious polynomial evaluation[27]. The idea of oblivious polynomial evaluation is that one party has a polynomial P , the other has a value for x, and the party holding x obtains P (x) without learning P or revealing x. Given this, the multiplication protocol is simple: The first party chooses a random r and generates the polynomial P (y) = xy − r. The resulting of evaluating this on y is the second party’s share: xy − r. The first party’s share is simply r. The challenge is computing shares of ln(x+y). The trick is to approximate ln(x + y) with a polynomial, specifically the Taylor series: ln(1 + ²) =

k X (−1)i−1 ²i i=1

i

Let 2n be the closest power of 2 to (x + y). Then (x + y) = 2n (1 + ²) for some −1/2 ≤ ² ≤ 1/2. Now ln(x) = ln(2n (1 + ²)) = n ln 2 + ² −

²3 ²2 + − ... 2 3

We determine shares of 2N n ln 2 and 2N ² (where N is an upper bound on n) using circuit evaluation. This is a simple circuit. ² · 2n = (x + y) − 2n , and n is obtained by inspecting the two most significant bits of (x + y). There are a small (logarithmic in the database size) number of possibilities for 2N n ln 2, and ² · 2N is obtained by left shifting ² · 2n . Assume the parties share of 2N n ln 2 are α1 and α2 , and the shares of 2N ² are β1 and β2 . The first party defines P (x) =

k X (−1)i−1 (α1 + x)i i=1

2N (i−1)

i

−r

and defines it’s share u1 = β1 + r. The second party defines its share as β2 + P (α2 ). Note that P (α2 ) computes the Taylor series approximation times 2N , minus the random r. Since 2N is public, it is easily divided out later, so the parties do get random shares of an approximation of ln(x + y).

Privacy-Preserving Data Mining

329

As discussed in Section 1.2, all arithmetic is really done over a sufficiently large field, so that the random values (e.g., shares) can be chosen from a uniform distribution. In addition, the values in the Taylor series are multiplied by the least common multiple of 2, ..., k to eliminate fractions. The key points to remember are the use of oblivious polynomial evaluation, and the use of an efficiently computable (bounded) approximation when efficiently and privately computing the real value is difficult. There has also been a solution for constructing ID3 on vertically partitioned data[12]. This work assumes the class of the training data is shared, but some the attributes are private. Thus most steps can be evaluated locally. The main problem is computing which site has the best attribute to split on – each can compute the gain of their own attributes without reference to the other site. 1.4 Third party based solutions The use of an outside party often enables more efficient solutions to secure computations. The key issues are what level of trust is placed in this third, outside party; and what level of effort is required of the third party. Generally the trust issue is rooted in collusion: What happens if parties collude to violate privacy? This gives us a hierarchy of types of protocols: No trusted third party. The most general type of protocol meets the strong statements of Definition 1: No party learns anything beyond what it can infer from the results. If parties collude, they are treated as one from the point of view of the definition: What can they infer from their combined inputs and results? Non-colluding untrusted third party protocol. These protocols allow all parties to utilize an untrusted third party to do part of the computation. The third party learns nothing by itself (it need not even see the results). Provided this third party does not collude with one or more of the other parties, this method preserves privacy as well as a fully secure protocol. Typically data is sent to this party in some “encrypted” form such that it cannot make any sense of the data by itself. This party performs some computations and replies to the local parties, which then remove the effect of the encryption to get back the final result. The key is that the untrusted third party does not see any “cleartext” data and is assumed to not collude with any of the other parties. Commodity server protocol. Commodity server protocols also requires a noncolluding third party. They differ from non-colluding untrusted third party protocols in that only one way communication is allowed from the commodity server to the other parties. Because of this, the commodity server clearly learns nothing (absent collusion). The general approach is to use the commodity server to generate complementary data (e.g., publicprivate encryption key pairs), each part of which is given to a different party.

330

Chris Clifton et al.

The commodity server model has been proven to be powerful enough to do all secure computation[5]. Thus, in terms of scope and power commodity server protocols are equivalent to protocols with an untrusted third party. They are simpler to prove secure though typically more complicated than untrusted third party protocols. Trusted third party protocol. The gold standard for security is the assumption that we have a trusted third party to whom we can give all data. The third party performs the computation and delivers only the results – except for the third party, it is clear that nobody learns anything not inferable from its own input and the results. The goal of secure protocols is to reach this same level of privacy preservation, without the (potentially insoluble) problem of finding a third party that everyone trusts. The complexity of fully secure protocols generally increases with an increasing number of parties. Simple (completely secure) solutions for two parties do not extend easily to more than two parties. In such scenarios, it is often worthwhile to reduce the single complex problem to a series of twoparty sub-problems. One approach is to make use of untrusted non-colluding third party protocols, using some of the participating parties to serve as “untrusted” parties for other parties in the protocol. If the target function consists of additive sub-blocks, or the target function can be reduced to a combination of associative functions, such an approach is possible. The key is to find an untrusted third party solution to the two-party problem, then securely combine the two-party results in a way that gives the desired final result. We now give a couple of examples typifying these cases, and show solutions that illuminate the basic concept. First consider the following geometric function: Consider an n-dimensional space split between r different parties, P1 , . . . , Pr . Pi owns a variable number ni of dimensions / axes such that P r i=1 ni = n. A point X in this n-dimensional space would have its n coordinates split between the r parties. Thus, party i would know ni of the co-ordinates of the point X. We assume that there is some way of linking the co-ordinates of the same point together across all the parties (i.e., a join key). Now, assume there are k points Y1 , . . . , Yk split between the r parties. The target function is to jointly compute the index i of the point Yi that is the “closest” to point X according to some distance metric D. Why is this problem interesting? K-means clustering over vertically partitioned data can be easily reduced to this problem. K-means clustering is an iterative procedure that starts with K arbitrary cluster means. In each iteration, all of the points are assigned to the current closest cluster (based on distance from mean). Once all the points are assigned, the cluster means are recomputed based on the points assigned to each cluster. This procedure is repeated until the clusters converge. One easy convergence condition is to stop when the difference between the old means and the new means is sufficiently small. The key step, assigning a point to a cluster, is done by finding the clos-

Privacy-Preserving Data Mining

331

est cluster to the point. This is solved by our earlier “abstract” geometrical problem. 2 One possible distance metric is the Minkowski distance, dM . The Minkowski distance dM between two points X and Y is defined as dM =

( n X

) m1 (xi − yi )

m

i=1

Note that m = 2 gives the Euclidean distance, while m = 1 gives the Manhattan distance. Instead of comparing two distances, we get the same result by comparing the mth power of the distances. Note that if we do not take the mth root, the target function is additive. We can exploit this additiveness to get an efficient protocol. The problem is formally defined as follows. Consider r parties P1 , . . . , Pr , each with their own k-element vector Xi :       x12 x1r x11  x2r   x22   x21        P1 has X1 =  .  , P2 has  .  , . . . , Pr has  .  .  ..   ..   ..  xkr xk2 xk1 The goal is to compute the index l that represents the row with the minimum sum. Formally, find X argmin( xij ) i=1..k

j=1..r

For use in k-means clustering, xij = |µij − pointj |, or site Pj ’s component of the distance between a point and the cluster i with mean µi . The security of the algorithm is based on three key ideas. 1. Disguise the site components of the distance with random values that cancel out when combined. 2. Compare distances so only the comparison result is learned; no party knows the distances being compared. 3. Permute the order of clusters so the real meaning of the comparison results is unknown. The algorithm also requires three non-colluding sites. These parties may be among the parties holding data, but could be external as well. They need only know the number of sites r and the number of clusters k. Assuming they do not collude with each other, they learn nothing from the algorithm. For simplicity of presentation, we will assume the non-colluding sites are P1 , P2 , and Pr among the data holders. 2

A solution for clustering in horizontally partitioned data has also been developed[23], this relies heavily on secure summation.

332

Chris Clifton et al.

The algorithm proceeds as follows. Pr Site P1 generates a length k random vector Vi for each site i, such that i=1 Vi = 0. P1 also chooses a permutation π of 1..k. P1 then engages each site Pi in the permutation algorithm (see Section 1.4) to generate the sum of the vector Vi and Pi ’s distances Xi . The resulting vector is known only to Pi , and is permuted by π known only to P1 , i.e., Pi has π(Vi + Xi ), but does not know π or Vi . P1 and P3 . . . Pr−1 send their vectors to Pr . Sites P2 and Pr now engage in a series of secure addition / comparisons to find the (permuted) indexPof the minimum distance. Pr PrSpecifically, they want r to find if x + v < x + v . Since ∀l, li li mi mi i=1 i=1 i=1 vli = 0, the result is Pr Pr x < x , showing which cluster (l or m) is closest to the point. li mi i=1 i=1 Pr has all components of the sum except X2 + V2 . For each comparison, we use a secure circuit evaluation that calculates a2 + ar < b2 + br , without disclosing anything but the comparison result. After k − 1 such comparisons, keeping the minimum each time, the minimum cluster is known. P2 and Pr now know the minimum cluster in the permutation π. They do not know the real cluster it corresponds to (or the cluster that corresponds to any of the others items in the comparisons.) For this, they send the minimum i back to site P1 . P1 broadcasts the result π −1 (i), the proper cluster for the point. Algorithm 2 reproduces the full algorithm from [36]. We now describe the two key building blocks borrowed from the Secure Multiparty Computation literature. The secure addition and comparison consists of a circuit that has two inputs from each party, sums the first input of both parties and the second input of both parties, and returns the result of comparing the two sums. This (simple) circuit is evaluated securely using the generic algorithm. Though the generic algorithm is impractical for large inputs and many parties, it is quite efficient for a limited number invocations of the secure add and compare function. For two parties, the message cost is O(circuit size), and the number of rounds is constant. We can add and compare numbers with O(m = log(number of entities)) bits using an O(m) size circuit. A graphical depiction of stages 1 and 2 is given in Figures 3(a) and 3(b). We now give the permutation algorithm of [11], which simultaneously computes a vector sum and permutes the order of the elements in the vector. The permutation problem is an asymmetric two party algorithm, formally defined as follows. There exist 2 parties, A and B. B has an n-dimensional vector X = (x1 , . . . , xn ), and A has an n-dimensional vector V = (v1 , . . . , vn ). A also has a permutation π of the n numbers. The goal is to give B the result π(X + V), without disclosing anything else. In particular, neither A nor B can learn the other’s vector, and B does not learn π. For our purposes, the V is a vector of random numbers from a uniform random distribution, used to hide the permutation of the other vector. The solution makes use of a tool known as Homomorphic Encryption. An encryption function H : R → S is called additively homomorphic if there is

Privacy-Preserving Data Mining

333

Algorithm 2 closest cluster[36]: Find minimum distance cluster Require: r parties, each with a length k vector X of distances. Three of these parties (trusted not to collude) are labeled P1 , P2 , and Pr . 1: {Stage 1: Between P1 and all other parties} 2: P1 generates r random vectors Vi summing to 0 (see Algorithm 3). 3: P1 generates a random permutation π over k elements 4: for all i = 2 . . . r do 5: Ti (at Pi ) = add and permute(Vi , π(at P1 ), Xi (at Pi )) {This is the permutation algorithm described in Section 1.4} 6: end for 7: P1 computes T1 = π(X1 + V1 ) 8: 9: {Stage 2: Between all but P2 and Pr } 10: for all i = 1, 3 . . . r − 1 do 11: Pi sends Ti to Pr 12: end for Pr 13: Pr computes Y = T1 + i=3 Ti 14: 15: {Stage 3: Involves only P2 and Pr } 16: minimal ← 1 17: for j=2..k do 18: if secure add and compare(Yj + T2j < Yminimal + T2minimal ) then 19: minimal ← j 20: end if 21: end for 22: 23: {Stage 4: Between Pr (or P2 ) and P1 } 24: Party Pr sends minimal to P1 25: P1 broadcasts the result π −1 (minimal)

Algorithm 3 genRandom[36]: Generates a (somewhat) random matrix Vk×r Require: Random number generator rand producing values uniformly distributed over 0..n − 1 spanning (at least) the domain of the distance function −D . Ensure: The sum of the resulting vectors is 0. 1: for all i = 1 . . . k do 2: P artSumi ← 0 3: for j = 2 . . . r do 4: Vij ← rand() 5: P artSumi ← P artSumi + Vij (mod n) 6: end for 7: Vi1 ← −P artSumi (mod n) 8: end for

334

Chris Clifton et al.

P2 E2(X 2), E 2

X2

P1

π (X 1+ V1)

π (E2(X 2+ V 2))

P3

P1 Vi , π

π (X 3+ V3) Pr

Er (X r), E r

π (X r−1+ V r−1)

π (E r(X r+ V r)) Pr Xr

(a) Stage 1

Pr−1

(b) Stage 2

Fig. 3. Closest Cluster Computation

an efficient algorithm Plus to compute H(x + y) from H(x) and H(y) that does not reveal x or y. Many such systems exist; examples include systems by Benaloh[6], Naccache and Stern [26], Okamoto and Uchiyama[28], and Paillier [30]. This allows us to perform addition of encrypted data without decrypting it. The permutation algorithm consists of the following steps: 1. B generates a public-private keypair (Ek , Dk ) for a homomorphic encryption scheme. 2. B encrypts its vector X to generate the encrypted vector X0 = (x01 , . . . , x0n ), x0i = Ek (xi ). 3. B sends X0 and the public key Ek to A. 4. A encrypts its vector V generating the encrypted vector V0 = (v10 , . . . , vn0 ), vi0 = Ek (vi ). 5. A now multiplies the components of the vectors X0 and V0 to get T0 = (t01 , . . . , t0n ), t0i = x0i ∗ vi0 . Due to the homomorphic property of the encryption, x0i ∗ vi0 = Ek (xi ) ∗ Ek (vi ) = Ek (xi + vi ) so T0 = (t01 , . . . , t0n ), t0i = Ek (xi + vi ). 6. A applies the permutation π to the vector T0 to get T0p = π(T0 ), and sends T0p to B. 7. B decrypts the components of T0p giving the final result Tp = (tp 1 , . . . , tp n ), tp i = xp i + vp i .

Privacy-Preserving Data Mining

335

Intersection We now give an algorithm that demonstrates another way of using untrusted, non-colluding third parties. The specific algorithm is for computing the size of set intersection. This is useful for finding association rules in vertically partitioned data. Assume the database is a boolean matrix where a 1 in cell (i, j) represents that the ith transaction contains feature j. The TID-list representation of the database has a transaction identifier list associated with every feature / attribute. This TID-list contains the identifiers of transactions that contain the attribute. Now, to count a frequency of an itemset < AB > (where A and B are at different sites), it is necessary to count the number of transactions having both A and B. If we intersect the TID-lists for both A and B, we get the transactions containing both A and B. The size of this intersection set gives the (in)frequency of the itemset. Thus, the frequency of an itemset can be computed by securely computing the size of the intersection set of the TID-lists. In addition to being useful for association rule mining, the algorithm illuminates a general technique that we can use to extend untrusted third party solutions for two parties to multiple parties. This works whenever the target function is associative. As in Section 1.2, the approach leaks some innocuous information, but can be proven to leak no more than this information. We start with the general technique, then discuss the specific application to set intersection. Given k parties, the goal is to compute a function y ∈ Fg , where y = f (x1 , . . . , xk ), where x1 , . . . , xk are the local inputs of the k parties. If the function can be decomposed into smaller invocations of an associative function, we can rewrite y = x1 ⊗ x2 ⊗ . . . ⊗ xk . If we have a protocol fs to securely compute the two-input function ⊗, we can construct a protocol to compute y as follows. The key idea is to create two partitions P0 and P1 . Split the k parties equally into the two partitions. We can now use the parties in partition Pi as untrusted third parties to evaluate partition P1−i . To visualize this, construct a binary tree on the partition Pi with the leaves being the parties in Pi (Figure 4).3 There can be at most |Pi | − 1 interior nodes in the binary tree. Due to the (almost) equi-partitioning, the following invariant always holds: |P1−i | ≥ |Pi | − 1, for both values of i. Thus, there are sufficient parties in the other partition to act as interior nodes. The role of the parties in partition P1−i is to act as the commodity server or untrusted third party for the parties in partition Pi . In the first round, the k/4 of the parties from the other partition act as third parties for the k/2 parties in the first partition. For the remaining log k/2 − 1 rounds the other k/4 parties of the 2nd partition act as third 3

While the example assumes k is a power of 2, a proper assignment of parties to partitions is also possible if the tree is not complete. This is described in [37].

336

Chris Clifton et al.

parties upwards along the tree. Each third party receives some form of the intermediate result, and utilizes it in the next round. It is important to analyze the amount of data revealed to the third party at this point and modify the protocol if necessary to limit the information disclosure. The entire process is illustrated in Figure 4, where we show the process for partition P0 consisting of the first k/2 parties. Thus, all of the parties Pk/2+1 , . . . Pk act as third parties / commodity servers in a single call to the protocol fs when applied to the parties at the leaf nodes. There are a total of log k/2 rounds in which several calls to the protocol fs are made in parallel. 1,..,k/2 P k

Stage (log 2k/2)

P

1,2,3,4

k/2−2,..,k/2

P

P 1,2

P

3,4

k/2+1

P P1

P k/2+2

P

P P2

P3

k−2,k−3 P 3k/4−1

P4

Stage 2 ...

P 3k/4

P P k/2−1

Stage 1 P k/2

Fig. 4. The general protocol process applied on partition P0

Once a similar process is done for the other partition P1 , the two topmost representatives of the two parties use a secure two party protocol f 0 to compute the final result. Every party possibly acquires some information about a few of the other parties, which goes against the precept of secure multi-party computation. But as long as the information revealed is held within strict (and provable) bounds, it is often worthwhile to trade this limited information disclosure for efficiency and practicality. We summarize the use of this approach to solve secure set intersection[35, 37]. The problem is defined as follows. There are k parties, P1 , . . . , Pk , each with a local set Sk drawn from a common (global) universe U . They wish to compute | ∩kj=1 Sj |, i.e., the cardinality of the common intersection set. This is useful for several applications for example data mining association rules (see [37] for details.)

Privacy-Preserving Data Mining

337

We now outline a two party protocol f∩ using a third untrusted party to compute |Sa ∩ Sb | for two parties A and B. The key idea behind protocol f∩ is to use commutative encryption (as described in Section 1.2) to allow comparing items without revealing them. Parties A and B generate encryption keys Ea and Eb respectively. A encrypts the items in its set Sa with Ea and sends them to B. Similarly, B encrypts the items in Sb with Eb and sends them to A. Each site now encrypts the received items with its own key, and sends the doubly-encrypted sets Sa0 and Sb0 to U . U now finds the intersection of these two sets. Because of commutativity of the encryption, an item x ∈ Sa ∩ Sb will correspond to an item Ea (Eb (x)) = Eb (Ea (x)) that appears in both Sa0 and Sb0 . Therefore, the size of the intersection |Sa0 ∩ Sb0 | = |Sa ∩ Sb |. Thus U learns the size of the intersection, but learns nothing about the items in the intersection. Extending this to more than two parties is simple. We use the tree based evaluation for each partition. The lowest layer (consisting of leaves) proceeds as above. At the higher layers, the parties encrypt with the keys of their sibling’s children. Since a party never sees any of the values from the sibling’s children (even after encryption), knowing the keys gives no information. More details are given in [37].

2 Privacy Preservation through Noise Addition The other approach to privacy-preserving data mining is based on adding random noise to the data, then providing the noisy dataset as input to the data mining algorithm. The privacy-preserving properties are a result of the noise: Data values for individual entities are distorted, and thus individually identifiable (private) values are not revealed. An example would be a survey: A company wishes to mine data from a survey of private data values. While the respondents may be unwilling to provide those data values directly, they would be willing to provide randomized/distorted results. What makes this work interesting is how the mining of the noisy data set is done. Na¨ıvely running a data mining algorithm on the data may work – for example, adding noise from a gaussian distribution centered at 0 will preserve averages – but does not always give good results. However, using knowledge of how the noise was generated enables us to do better. In particular, what is used is knowledge of the distribution that the noise came from (e.g., uniform or gaussian and the appropriate parameters). Knowing the distribution the random values came from does not reveal the specific values used to mask each entity, so privacy is still preserved. However, as we shall see the knowledge of the distribution of the noise does enable us to improve data mining results. The problem addressed in [3] was building decision trees. If we return to the description of ID3 in Section 1.3, we see that Steps 1 and 3c do not reference the (noisy) data. Step 2 references only the class data, which is assumed to be known (for example, the survey may be demographics of existing customers

338

Chris Clifton et al.

– the company already knows which are high-value customers, and wants to know what demographics correspond to high-value customers.) This leaves Steps 3a and 3b: Finding the attribute with the maximum information gain and partitioning the tree based on that attribute. Looking at Equation 1, the only thing needed is |T (a, c)| and |T (a)|.4 |T (a)| requires partitioning the entities based on the attribute value, exactly what is needed for Step 3b. The problem is that the attribute values are modified, so we don’t know which entity really belongs in which partition. Figure 5 demonstrates this problem graphically. There are clearly peaks in the number of drivers under 25 and in the 25-35 age range, but this doesn’t hold in the noisy data. The ID3 partitioning should reflect the peaks in the data.

100

original with noise

Number of Drivers

80

60

40

20

0 0

20

40

60

80

100

Age Fig. 5. Original distribution vs. distribution after random noise addition.

A second problem comes from the fact that the data is assumed to be ordered (otherwise “adding” noise makes no sense.) As a result, where to divide partitions is not obvious (as opposed to categorical data). Again, reconstructing the distribution can help. We can see that in Figure 5 partitioning the data at ages 30 and 50 would make sense – there is a natural “break” in the 4

[3] actually uses the gini coefficient rather than information gain. While this may affect the quality of the decision tree, it has no impact on the discussion here. We stay with information gain for simplicity.

Privacy-Preserving Data Mining

339

data at those points anyway. However, we can only see this from the actual distribution. The split points are not obvious in the noisy data. Both these problems can be solved if we know the distribution of the original data, even if we do not know the original values. The problem remains that we may not get the right entities in each partition, but we are likely to get enough that the statistics on the class of each partition will still hold (In [3] experimental results are given demonstrating this fact.) What remains is the problem of estimating the distribution of the real data (X) given the noisy data (w) and the distribution of the noise (Y ). This is accomplished through Bayes’ rule: Z a 0 FX (a) ≡ fX (z|X + Y = w)dz −∞ Z a fX+Y (w|X = z)fX (z) = dz fX+Y (w) −∞ Z a f (w|X = z)fX (z) R ∞ X+Y = dz 0 0 0 −∞ −∞ fX+Y (w|X = z )fX (z )dz Ra fX+Y (w|X = z)fX (z)dz = R−∞ ∞ f (w|X = z)fX (z)dz −∞ X+Y Ra fY (w − z)fX (z)dz = R−∞ ∞ f (w − z)fX (z)dz −∞ Y Given the actual data values wi = xi + yi , we use this to estimate the distribution function as follows: Ra n n 1X 0 1 X −∞ fY (wi − z)fX (z)dz 0 R∞ FX (a) = F = n i=1 Xi n i=1 −∞ fY (wi − z)fX (z)dz Differentiating gives us the posterior density function: n

0 fX (a) =

1X f (w − a)fX (a) R∞ Y i n i=1 −∞ fY (wi − z)fX (z)dz

(2)

The only problem is, we don’t know the real density function fX . However, starting with an assumption of a uniform distribution, we can use Equation 2 to iteratively refine the density function estimate, converging on an estimate of the real distribution for X. In [3] several optimizations are given, for example partitioning the data to convert the integration into sums. They also discuss tradeoffs in when to compute distributions: Once for each attribute? Separately for each class? For only the data that makes it to each split point? They found that reconstructing each attribute separately for each class gave the best performance/accuracy tradeoff, with classification accuracy substantially better than na¨ıvely running

340

Chris Clifton et al.

on the noisy data, and approaching that of building a classifier directly on the real data. One question with this approach is how much privacy is given? With the secure multiparty computation based approaches, the definition of privacy is clear. However, given a value that is based on the real value, how do we know how much noise is enough? Agrawal and Srikant proposed a metric based the confidence in estimating a value within a specified width: If it can be estimated with c% confidence that a value x lies in the interval [xl , xh ], then the privacy at the c% confidence level is |xh − xl |. The quantify this in terms of a percentage: The privacy metric for noise from a uniform distribution is the confidence times twice the interval width of the noise: 100% privacy corresponds to a 50% confidence that the values is within two distribution widths of the real value, or nearly 100% confidence that it is within one width. They have an equivalent definition for noise from a gaussian distribution. Agrawal and Aggarwal (not the same Agrawal) pointed out problems with this definition of privacy[1]. The very ability to reconstruct distributions may give us less privacy than expected. Figure 5 demonstrates this. Assume the noise is known to come from a uniform distribution over [−15, 15], and the actual/reconstructed distribution is as shown by the bars. Since there are no drivers under age 16 (as determined from the reconstructed distribution), a driver whose age is given as 1 in the “privacy-preserving” dataset is known to be 16 years old – all privacy for this individual is lost. They instead give a definition based on entropy (discussed in Section 1.3). Specifically, if a random variable Y has entropy H(Y ), the privacy is 2H(Y ) . This has the nice property that for a uniform distribution, the privacy is equivalent to the width of the interval from which the random value is chosen. This gives a meaningful way to compare different sources of noise. They also provide a solution to the loss of privacy obtained through reconstructing the original data distribution. The idea is based on conditional entropy. Given the reconstructed distribution X, the privacy is now 2H (Y |X). This naturally captures the expected privacy in terms of the interval width description: a reconstruction distribution that eliminates part of an interval (or makes it highly unlikely) gives a corresponding decrease in privacy. There has been additional work in this area, such as techniques for association rules[14, 32]. Techniques from signal processing have also been applied to distribution reconstruction [21], generalizing much of this work. One problem is the gap between known abilities to reconstruct distributions and lower bounds on ability to reconstruct actual data values: the jury is still out on how effective these techniques really are at preserving privacy.

Privacy-Preserving Data Mining

341

3 Conclusions and Recommendations While privacy-preserving data mining does have the potential to reconcile the concerns of data mining proponents and privacy advocates, it has not reached the level of an effective panacea. Two issues remain. First, the rigor required of the cryptography and security protocols communities must be brought to this field. While some of the work in this field does approach this level of rigor, much of the work does not. For some work, particularly with the noise addition approach, it is not clear if a determined adversary could compromise privacy (and in some cases, it is clear that they can [21, 10].) The distributed approach has a clear set of standards borrowed from the cryptography community, it is important that work be judged against these standards. In particular, work must move beyond the semi-honest model. This could mean developing efficient solutions secure against malicious adversaries, or possibly new definitions such as “proof against collusion” that meet practical needs and are defined with the rigor of the semi-honest and malicious definitions. The second problem may be more challenging. Privacy-preserving data mining has operated under the assumption that data mining results do not of themselves compromise privacy. This is not necessarily true. While there has been some work on restricting data mining results [4, 18, 33, 29, 19], this has emphasized protection against revealing specific results. This work does not address connections between the results and compromise of source data items. While work on limiting classification strength may address this issue[8], the proposed method also prevents the data from being useful for data mining in any form. Achieving a reasonable connection between individual privacy and data mining results is still an open problem. Until this is solved, the full concerns of privacy advocates will not have been addressed. That said, privacy-preserving data mining in its current form does have practical applications. In some circumstances, the value of the data mining results may exceed the potential cost to individual privacy. This is most likely true where the individual data items reflect commercial data (e.g., intellectual property) rather than personal information. For example, U.S. antitrust law frowns on the general sharing of information between competitors, however if the shared information is limited to that absolutely necessary to achieve some consumer benefit the sharing is likely to pass legal muster. The concept that use of information is allowed when necessary also shows up in the European Community privacy recommendations[13], reconciling these laws with the potential privacy breach of data mining results will be easier if privacypreserving data mining techniques are used to ensure that only the results are disclosed. In summary, while privacy-preserving data mining has achieved significant success, many challenges remain.

342

Chris Clifton et al.

References 1. D. Agrawal and C. C. Aggarwal, “On the design and quantification of privacy preserving data mining algorithms,” in Proceedings of the Twentieth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. Santa Barbara, California, USA: ACM, May 21-23 2001, pp. 247–255. [Online]. Available: http://doi.acm.org/10.1145/375551.375602 2. R. Agrawal and R. Srikant, “Fast algorithms for mining association rules,” in Proceedings of the 20th International Conference on Very Large Data Bases. Santiago, Chile: VLDB, Sept. 12-15 1994, pp. 487–499. [Online]. Available: http://www.vldb.org/dblp/db/conf/vldb/vldb94-487.html 3. ——, “Privacy-preserving data mining,” in Proceedings of the 2000 ACM SIGMOD Conference on Management of Data. Dallas, TX: ACM, May 14-19 2000, pp. 439–450. [Online]. Available: http://doi.acm.org/10.1145/342009. 335438 4. M. Atallah, E. Bertino, A. Elmagarmid, M. Ibrahim, and V. Verykios, “Disclosure limitation of sensitive rules,” in Knowledge and Data Engineering Exchange Workshop (KDEX’99), Chicago, Illinois, Nov. 8 1999, pp. 25–32. [Online]. Available: http://ieeexplore.ieee.org/iel5/6764/18077/00836532.pdf? isNumber=18077&%prod=CNF&arnumber=00836532 5. D. Beaver, “Commodity-based cryptography (extended abstract),” in Proceedings of the twenty-ninth annual ACM symposium on Theory of computing. El Paso, Texas, United States: ACM Press, 1997, pp. 446–455. 6. J. Benaloh, “Dense probabilistic encryption,” in Proceedings of the Workshop on Selected Areas of Cryptography, Kingston, Ontario, May 1994, pp. 120–128. [Online]. Available: http://research.microsoft.com/crypto/papers/dpe.ps 7. D. W.-L. Cheung, V. Ng, A. W.-C. Fu, and Y. Fu, “Efficient mining of association rules in distributed databases,” IEEE Transactions on Knowledge and Data Engineering, vol. 8, no. 6, pp. 911–922, Dec. 1996. 8. C. Clifton, “Using sample size to limit exposure to data mining,” Journal of Computer Security, vol. 8, no. 4, pp. 281–307, Nov. 2000. [Online]. Available: http://iospress.metapress.com/openurl.asp?genre= article&issn=0926-227X&%volume=8&issue=4&spage=281 9. “Total information awareness (TIA) system,” Defense Advanced Research Projects Agency. [Online]. Available: http://www.darpa.mil/iao/TIASystems. htm 10. I. Dinur and K. Nissim, “Revealing information while preserving privacy,” in Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. ACM Press, 2003, pp. 202–210. [Online]. Available: http://doi.acm.org/10.1145/773153.773173 11. W. Du and M. J. Atallah, “Privacy-preserving statistical analysis,” in Proceeding of the 17th Annual Computer Security Applications Conference, New Orleans, Louisiana, USA, December 10-14 2001. [Online]. Available: http://www.cerias.purdue.edu/homes/duw/research/paper/acsac2001.ps 12. W. Du and Z. Zhan, “Building decision tree classifier on private data,” in IEEE International Conference on Data Mining Workshop on Privacy, Security, and Data Mining, C. Clifton and V. Estivill-Castro, Eds., vol. 14. Maebashi City, Japan: Australian Computer Society, Dec. 9 2002, pp. 1–8. [Online]. Available: http://crpit.com/Vol14.html

Privacy-Preserving Data Mining

343

13. “Directive 95/46/EC of the european parliament and of the council of 24 october 1995 on the protection of individuals with regard to the processing of personal data and on the free movement of such data,” Official Journal of the European Communities, vol. No I., no. 281, pp. 31–50, Oct. 24 1995. [Online]. Available: http://europa.eu.int/comm/internal market/privacy/ 14. A. Evfimievski, R. Srikant, R. Agrawal, and J. Gehrke, “Privacy preserving mining of association rules,” in The Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada, July 23-26 2002, pp. 217–228. [Online]. Available: http://doi.acm.org/ 10.1145/775047.775080 15. M. Feingold, M. Corzine, M. Wyden, and M. Nelson, “Data-mining moratorium act of 2003,” U.S. Senate Bill (proposed), Jan. 16 2003. [Online]. Available: http://thomas.loc.gov/cgi-bin/query/z?c108:S.188: 16. O. Goldreich, S. Micali, and A. Wigderson, “How to play any mental game - a completeness theorem for protocols with honest majority,” in 19th ACM Symposium on the Theory of Computing, 1987, pp. 218–229. [Online]. Available: http://doi.acm.org/10.1145/28395.28420 17. O. Goldreich, “Secure multi-party computation,” Sept. 1998, (working draft). [Online]. Available: http://www.wisdom.weizmann.ac.il/∼oded/pp.html 18. T. Johnsten and V. Raghavan, “Impact of decision-region based classification algorithms on database security,” in Proceedings of the Thirteenth Annual IFIP WG 11.3 Working Conference on Database Security, Seattle, Washington, July 26–28 1999. [Online]. Available: http://www.cacs.usl.edu/Publications/ Raghavan/JR99.ps.Z 19. ——, “A methodology for hiding knowledge in databases,” in Proceedings of the IEEE ICDM Workshop on Privacy, Security and Data Mining, ser. Conferences in Research and Practice in Information Technology, vol. 14. Maebashi City, Japan: Australian Computer Society, Dec. 9 2002, pp. 9–17. [Online]. Available: http://crpit.com/confpapers/CRPITV14Johnsten.pdf 20. M. Kantarcıoˇ glu and C. Clifton, “Privacy-preserving distributed mining of association rules on horizontally partitioned data,” IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 4, July 2004. 21. H. Kargupta, S. Datta, Q. Wang, and K. Sivakumar, “On the privacy preserving properties of random data perturbation techniques,” in Proceedings of the Third IEEE International Conference on Data Mining (ICDM’03), Melbourne, Florida, Nov. 19-22 2003. 22. M. Lewis, “Department of defense appropriations act, 2004,” July 17 2003, title VIII section 8120. Enacted as Public Law 108-87. [Online]. Available: http://thomas.loc.gov/cgi-bin/bdquery/z?d108:h.r.02658: 23. X. Lin, C. Clifton, and M. Zhu, “Privacy preserving clustering with distributed EM mixture modeling,” Knowledge and Information Systems, to appear 2004. 24. Y. Lindell and B. Pinkas, “Privacy preserving data mining,” in Advances in Cryptology – CRYPTO 2000. Springer-Verlag, Aug. 20-24 2000, pp. 36–54. [Online]. Available: http://link.springer.de/link/service/series/0558/ bibs/1880/18800036.htm 25. M. Murkowski and M. Wyden, “Protecting the rights of individuals act,” U.S. Senate Bill (proposed), July 31 2003. [Online]. Available: http://thomas.loc.gov/cgi-bin/bdquery/z?d108:s.01552:

344

Chris Clifton et al.

26. D. Naccache and J. Stern, “A new public key cryptosystem based on higher residues,” in Proceedings of the 5th ACM conference on Computer and communications security. San Francisco, California, United States: ACM Press, 1998, pp. 59–66. 27. M. Naor and B. Pinkas, “Oblivious transfer and polynomial evaluation,” in Proceedings of the thirty-first annual ACM symposium on Theory of computing. Atlanta, Georgia, United States: ACM Press, 1999, pp. 245–254. 28. T. Okamoto and S. Uchiyama, “A new public-key cryptosystem as secure as factoring,” in Advances in Cryptology - Eurocrypt ’98, LNCS 1403. SpringerVerlag, 1998, pp. 308–318. 29. S. R. M. Oliveira and O. R. Za¨ıane, “Protecting sensitive knowledge by data sanitization,” in Proceedings of the Third IEEE International Conference on Data Mining (ICDM’03), Melbourne, Florida, Nov. 19-22 2003. 30. P. Paillier, “Public key cryptosystems based on composite degree residuosity classes,” in Advances in Cryptology - Eurocrypt ’99 Proceedings, LNCS 1592. Springer-Verlag, 1999, pp. 223–238. 31. J. R. Quinlan, “Induction of decision trees,” Machine Learning, vol. 1, no. 1, pp. 81–106, 1986. [Online]. Available: http://ipsapp008.kluweronline.com/content/ getfile/4984/119/4/abstract.h%tm 32. S. J. Rizvi and J. R. Haritsa, “Maintaining data privacy in association rule mining,” in Proceedings of 28th International Conference on Very Large Data Bases. Hong Kong: VLDB, Aug. 20-23 2002, pp. 682–693. [Online]. Available: http://www.vldb.org/conf/2002/S19P03.pdf 33. Y. Saygin, V. S. Verykios, and C. Clifton, “Using unknowns to prevent discovery of association rules,” SIGMOD Record, vol. 30, no. 4, pp. 45–54, Dec. 2001. [Online]. Available: http://www.acm.org/sigmod/record/issues/ 0112/SPECIAL/5.pdf 34. J. Vaidya and C. Clifton, “Privacy preserving association rule mining in vertically partitioned data,” in The Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Edmonton, Alberta, Canada, July 23-26 2002, pp. 639–644. [Online]. Available: http://doi.acm.org/ 10.1145/775047.775142 35. ——, “Leveraging the “multi” in secure multi-party computation,” in Workshop on Privacy in the Electronic Society held in association with the 10th ACM Conference on Computer and Communications Security, Washington, DC, Oct. 30 2003. 36. ——, “Privacy-preserving k-means clustering over vertically partitioned data,” in The Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, Aug. 24-27 2003. 37. ——, “Secure set intersection cardinality with application to association rule mining,” Journal of Computer Security, submitted. 38. A. C. Yao, “How to generate and exchange secrets,” in Proceedings of the 27th IEEE Symposium on Foundations of Computer Science. IEEE, 1986, pp. 162– 167.