A Minimum Sphere Covering Approach to Learning

15 downloads 0 Views 97KB Size Report
In this paper we present an integer programming formulation of the minimum sphere covering problem that seeks to construct a minimum number of spheres to ...
Institute for Brain and Neural Systems Brown University

IBNS Technical Report 2006-3 March 2006

A Minimum Sphere Covering Approach to Learning

1

by Jigang Wang, Predrag Neskovic, Leon N Cooper Department of Physics and Institute for Brain and Neural Systems Brown University, Providence, RI 02912

1 This

work was supported in part by Army Research Office under Grants DAAD19-01-1-0754 and W911NF-04-1-0357. Jigang Wang was supported in part by a Brown University dissertation fellowship.

c

2006 Brown University, Providence RI 02912

ABSTRACT In this paper we present an integer programming formulation of the minimum sphere covering problem that seeks to construct a minimum number of spheres to represent the training data. Using soft threshold functions, we further derive a linear programming problem whose solution gives rise to radial basis function classifiers and sigmoid function classifiers. In contrast to traditional RBF and sigmoid function networks, in which the number of units is specified a priori, our method provides a new way to construct RBF and sigmoid function networks that explicitly minimizes the number of base units in the resulting classifiers. Our approach is advantageous compared to SVMs with Gaussian kernels in that it provides a natural construction of kernel matrices and it directly minimizes the number of basis functions. Experiments using real-world datasets have demonstrated the competitiveness of our method in terms of classification performance and sparsity of the solution.

1

Introduction

Instance-based learning algorithms, such as the nearest neighbor rule and the RCE network, can easily construct arbitrarily complex decision boundaries. However, from the generalization point of view, the challenge is to control the complexity of the learning algorithms. Take the RCE network as an example. The RCE network, first proposed by Reilly et al., is a supervised learning algorithm that constructs a set of regions to represent each pattern class [1]. With this incremental approach, it is easy to construct complex decision boundaries to deal with problems involving nonlinear decision boundaries. However, in order to have good generalization performance, it is essential to control the capacity of the learning algorithm. As such, it is desirable for the learning algorithm to construct a minimum number of spheres to represent the class regions. In this paper, we present a sphere-covering approach to learning by formulating the minimum sphere covering problem as a constrained optimization problem, or more specifically, an integer programming problem. To enable smoother decision boundaries, we further extend the integer programming problem to a linear programming problem using radial basis functions and/or sigmoid functions. The resulting linear program has a form similar to that of the linear programming boosting (LP boosting) problem and the support vector machines (SVMs) with L1 norm. Depending on which function is used to extend the threshold function in the minimum sphere covering formulation, the resulting classifier is a linear combination of radial basis functions (RBFs) or sigmoid functions. In contrast to classical RBF networks and multi-layer perceptrons (MLPs), our approach, which is based on the theory relating data compression and generalization [2], provides a new way to construct RBF classifiers and sigmoid function classifiers. In addition, unlike RBF networks and MLPs, in which the number of units has to be specified a priori, the number of RBF or sigmoid units in our algorithm is determined automatically. The new approach is also advantageous compared to SVMs with Gaussian kernels in that it adapts the kernel width locally, even though a global smoothing parameter is still in place. Experimental results demonstrate better generalization ability and better compression rate. The remainder of this paper is organized as follows. In Section 2 we give a brief introduction of the minimum sphere cover problem, introduce some necessary notations, and derive the integer programming and linear programming formulations of the problem. In Section 3 we test the new algorithms on real-world datasets and compare them to SVM classifiers with Gaussian kernels. Concluding remarks are given in Section 4.

2

Minimum sphere covering

The minimum sphere covering approach we are going to describe below is motivated by the RCE network. Given a set of training examples, it seeks to construct a set of spheres to cover the class regions in the input space. With this incremental approach, it is easy to construct a set of spheres such that each training example is covered by a sphere belonging to the right class. For example, a set of spheres that are centered on each training example, with

–2– their radii less than the smallest distance between two training examples of different classes, will do. However, according to the statistical learning theory, it is not sufficient for the set of spheres to cover all the training examples correctly. To guarantee good generalization performance it is essential for the learning algorithm to have small capacity [3]. Roughly speaking, in our setting, it means that it is desirable to construct a minimum number of spheres to represent the training examples correctly. In the following, we formally introduce the minimum sphere covering problem. We denote the set of training examples by D = {(x1 , y1), . . . , (xn , yn )}, where xi ∈ Rd and yi ∈ {−1, 1}. For notational simplicity, we only consider the binary classification problem. The task is to find a set of class-specific spheres S = {S1 , . . . , Sm } such that xi ∈

[

[

Sj and xi 6∈

y(Sj )=yi

Sj , ∀i = 1, . . . , n,

(1)

y(Sj )6=yi

where each sphere Si is characterized by its center c(Si ), its radius r(Si ) and its class y(Si). An example xi is covered by a sphere Sj , i.e., xi ∈ Sj , if d(xi , c(Sj )) ≤ r(Sj ). A set of spheres S that satisfies the conditions (1) is called a consistent sphere cover of the data D. A sphere cover is minimal if there exists no other consistent sphere cover with a smaller number of spheres. In this paper, we restrict ourselves to constructing a consistent sphere cover with spheres that are centered on training examples, although in general spheres do not have to be centered on the training examples. In order to minimize the number of spheres in the sphere cover S, each sphere in S should cover as many training examples as possible without covering a training example belonging to a different class. For a sphere centered on xi , this can be done by setting the radius to ri = min d(xi , xj ) − , j:yj 6=yi

(2)

where  > 0 is a small real-valued constant. Therefore, for each training example xi in D, we have a sphere Si with c(Si ) = xi ,

r(Si ) = ri ,

and

y(Si) = yi .

(3)

The goal is then to select a subset of {Si , i = 1, . . . , n} with a minimum size such that it forms a consistent sphere covering.

2.1

The greedy sphere covering algorithm

The above minimum sphere covering problem is NP-complete [4]. Therefore, it is hard to find the exact solution. However, it is known that the following algorithm, which is modified from the greedy set covering algorithm, has a good worst-case bound [5]. Greedy sphere covering algorithm.

–3– Input: D = {(x1 , y1), . . . , (xn , yn )}, S = ∅ for each (xi , yi) ∈ D do create a sphere Si according to (3) count the number ni of training data that it covers end for repeat find the sphere Si that contains the largest number of remaining training examples and add it to S until all the training examples are covered Output: sphere cover S The greedy sphere covering algorithm consists of two loops. In the first loop, it constructs a set of spheres {Si , i = 1, . . . , n} according to (3). In the second loop, it greedily selects into S spheres that cover the largest numbers of remaining training examples, i.e., training examples that have not been covered by any previously chosen spheres in S, in the hope of minimizing the total number of spheres that will be created. Initially, no training example is covered, and the sphere that covers the largest number of examples is selected into S.

2.2

Minimum sphere covering by integer programming

Given a training dataset D, the set of spheres {Si , i = 1, . . . , n} is naturally defined. Subsequently, we can define the following covering matrix: Kij =



1, if d(xi , xj ) ≤ rj 0, otherwise.

(4)

That is, Kij ≡ K(xi , xj ) = H(rj − d(xi , xj )), where d(xi , xj ) is the distance between xi and xj , rj is the radius of the sphere Sj that is centered on xj , and H is the Heaviside step function. Therefore, an entry Kij is 1 only if the sphere centered on xj covers xi . With the covering matrix K defined, we can formulate the minimum sphere covering problem as follows: n min

X

ai

i=1

subject to n X

Kij aj ≥ 1, i = 1, . . . , n

(5)

j=1

ai ∈ {1, 0}, i = 1, . . . , n, where ai is a binary-valued integer indicating whether or not the sphere Si is included in the P P sphere cover S. The sum ni=1 ai is the number of spheres in the sphere cover S. nj=1 Kij aj is the number of spheres that cover the training example xi . Therefore, the objective function minimizes the number of spheres included in the sphere cover S subject to the constraints that every training example is covered by at least one sphere belonging to the right class.

–4– For some problems, it might not be the best solution to require that every training example be covered by a sphere. To allow for some errors in the training data, we can introduce slack variables to relax the constraints (5). The relaxation of constraints leads to the following reformulation of the integer program: min

n X

ai + C

i=1

n X

ξi

(6)

i=1

subject to n X

Kij aj ≥ 1 − ξi , i = 1, . . . , n

(7)

j=1

ai ∈ {0, 1} and ξi ∈ {0, 1}, i = 1, . . . , n, where ξi are the slack variables that are used to relax the constraints in (5) and C > 0 is a constant that controls the trade-off between the training error and the number of spheres. The integer programming problem can be solved using a linear programming-based branch-and-bound algorithm. The resulting classification function is: f (x) = sgn(

n X

K(x, xj )yj aj ).

(8)

i=1

2.3

Minimum sphere covering by linear programming

In the above integer programming formulation of the minimum sphere covering problem, the covering matrix K is binary by definition. As a consequence, the resulting decision boundary is piece-wise spherical, and therefore maybe non-smooth at some points, e.g., at junctions of spheres. To obtain smoother decision boundaries, we can replace the Heaviside step function in the definition of the covering matrix K by a soft threshold function. There are many smooth functions that can be used to approximate the hard threshold function. For example, using the exponential function, we can define a soft version of the covering matrix as: kxi − xj k2 K(xi , xj ) = exp(− ), (9) s rj2 where s > 0 is a parameter that controls the speed of decay of the exponential function. Another smooth function that can be used to approximate the hard threshold function is the sigmoid function, which has been widely used in neural networks. Using the sigmoid function, we can define a soft version of the covering matrix as: 1

K(xi , xj ) = 1+

kxi −xj k2 −rj2 exp( ) σ

,

(10)

where σ > 0 is a parameter that controls the slope of the sigmoid function. The smaller σ is, the steeper the slope, the more closely the sigmoid function approximates the threshold function.

–5– With the soft versions of the covering matrix K, it is also possible to relax the constraints that require ai and ξi to be integers. Taking into account the class labels of the data, we reach the following linear programming formulation: min

n X

ai + C

i=1

n X

ξi

(11)

i=1

subject to n X

yiyj Kij aj ≥ 1 − ξi , i = 1, . . . , n

(12)

j=1

ai ≥ 0 and ξi ≥ 0, i = 1, . . . , n. Depending on which version of the covering matrix is used, the resulting classifier will be a linear combination of radial basis functions or sigmoid functions, resembling the RBF networks and sigmoid neural networks.

2.4

Connections to LP boosting, SCM, and SVMs

The above linear programming formulation of the minimum sphere covering problem has a form similar to that of the linear programming boosting (LP boosting) problem [6], which is defined as: m n min

X i=1

ai + C

X

ξi

(13)

i=1

subject to n X

yi Hij aj ≥ 1 − ξi , i = 1, . . . , n

j=1

ξi ≥ 0 and ai ≥ 0, i = 1, . . . , m, where Hij = hj (xi) is the label (+1 or −1) given by weak hypothesis hj on the training example xi . More specifically, each column H(·, j) of the matrix H constitutes the output of weak hypothesis hj on the training data, while each row H(i, ·) constitutes the outputs of all the weak hypotheses on the example xi . C > 0 is a constant that controls the trade-off between classification error and margin. Although the two forms appear similar, there are several major differences. First, in the sphere covering formulation, the covering matrix K is a square matrix, i.e., the column and row numbers are the same and are both equal to the number of training examples. However, in the LP boosting setting, the number of columns in the matrix H is determined by the number of weak hypotheses, and is in general not the same as the number of rows. Second, in the sphere covering problem, each column in the covering matrix K is associated with a particular training example. Specifically, the column vector K(·, j) indicates whether or not the training data is covered by the sphere Sj that is associated with the training example

–6– xj . However, in the LP boosting formulation, weak hypotheses may not be associated with particular training examples. In fact, it is more common to associate a weak hypothesis with an attribute. For example, when the weak learners are decision stumps, which perform classification based on single attributes, each column is associated with an attribute. Consequently, the number of columns is the same as the number of attributes. It might be tempting to view the spheres we construct as data-dependent weak hypotheses. For instance, if we identify Kij yj , the output of function K(·, xj ) on training example xi , with Hij , the two linear programs will have the same form. However, the major difference is that, in the sphere covering approach, the influence of each data-dependent sphere is limited by its radius, i.e., each data-dependent sphere Sj of class yj assigns examples within its influence region to yj , but outputs zeros instead of y¯j for examples that lie outside of its region of influence:  yj , if d(xi , xj ) ≤ rj K(xi , xj )yj = (14) 0, otherwise. In other words, the influence of each data-dependent sphere is localized. This is also the main difference between our sphere covering approach and the set covering machine proposed by Marchand and Shawe-Taylor [7]-[8]. Indeed, the set cover machine (SCM) can also be formulated as a linear programming boosting algorithm using data-dependent spheres as weak hypotheses, with the additional constraint that the weight vector over the training data is kept uniform. In the set covering machine, each data-dependent sphere of class yj classifies examples within the sphere as yj and classifies examples outside the sphere as y¯j : hj (xi ) =

(

yj , if d(xi , xj ) ≤ rj y¯j , otherwise.

(15)

In addition, the size of spheres rj is determined differently in the two methods. Traditionally, support vector machines aim to minimize the separation margin between two classes measured in L2 norm, and the result is a quadratic programming problem [9][10]. When L∞ norm is adopted to measure the separation margin, however, the resulting problem can be formulated as a linear programming problem: min

m X i=1

ai + C

n X

ξi

(16)

i=1

subject to n X

yi yj Kij aj − b ≥ 1 − ξi , i = 1, . . . , n

(17)

j=1

ξi ≥ and ai ≥ 0, i = 1, . . . , m Like the LP boosting problem, the linear programming SVM problem also has a form similar to that of the linear programming sphere covering problem. Linear programming SVMs equipped with Gaussian kernels k(x1, x2) = exp(−kx1 − x2k2 /2σ 2 ) are especially close to

–7– our approach with Gaussian extension. The main difference lies in the definition of kernel functions. In SVMs, kernel functions are given a priori, and the kernel matrix is symmetric and is controlled by a single parameter σ, while in our approach the kernel matrix is nonsymmetric (see Eq. (9)). The influence of a training example xj on another training example xi is modulated by the size rj of the sphere associated with xj , with each training example having a different size of influence depending on its position relative to other training data in the input space. Previous study indicates that modifying kernel matrices in a similar spirit improves the performance of SVMs [11]. The sphere covering interpretation underlying our linear programming approach leads naturally to the definition and construction of appropriate covering matrices. On the other hand, kernels are originally introduced in the SVMs simply as a computational trick for constructing more flexible decision boundaries. In addition, instead of maximizing the separation margin, the sphere covering approach minimizes the number of spheres that are used to represent the training data. The number of spheres is closely related to the generalization error based on the data compression scheme of Littlestone and Warmuth [2].

3

Results and discussion

We have tested the linear programming sphere covering algorithm using real-world datasets from the UCI machine learning repository [12]. To see how well the sphere covering algorithm performs with respect to other learning algorithms, it is possible to compare it with many algorithms, including LP boosting, LP SCM and SVMs with L1 norm. We have chosen to compare it with standard SVMs with Gaussian kernels, because, among the many algorithms, standard SVMs with Gaussian kernels have demonstrated superb performance in many realworld applications. In this paper, we focus on comparing the linear programming sphere covering (LP SC) algorithm using Gaussian covering matrices with SVMs with Gaussian kernels in terms of their generalization errors and compression rates. We have used 5-fold cross-validation to estimate the generalization error of the classifiers. In the 5-fold cross-validation procedure, we have ensured that each training set and each testing set were the same for both the sphere covering and the SVM algorithms. The performance of the SVM classifier depends on the choice of the kernel parameter σ and the margin parameter C. Similarly, the results of the sphere covering algorithm depend on the choice of the smoothness parameter s and the parameter C. The values of these parameters are chosen from a set of values that gave the smallest 5-fold cross-validation errors. The datasets used and the results obtained by the two algorithms are summarized in Table 1. From Table 1, we see that, for all 5 datasets, the results obtained by the sphere covering algorithm are either better than or comparable to the best results obtained by the SVMs. The sphere covering algorithm also tends to display less variance in the error estimate. For both the SVM classifiers with Gaussian kernels and the sphere covering algorithm with Gaussian extension, the decision function is a linear combination of radial basis functions. Therefore, in addition to comparing the performance of the two algorithms, we have

–8–

Table 1: Generalization error estimates and confidence intervals. DATASET

SVM

LP SC

Breast Cancer Ionosphere Liver Pima Sonar

3.68 (±1.29) 4.86 (±2.05) 31.47 (±5.16) 27.50 (±3.29) 11.00 (±4.57)

3.24 (±1.41) 3.71(±1.43) 30.14 (±2.75) 26.67 (±2.38) 11.22 (±3.58)

also investigated the number of radial basis functions created by each algorithm in each task. For each dataset, the average number of RBFs created by each algorithm over the 5 rounds of the cross-validation process is given in Table 2. Table 2: Comparison of the number of radial basis functions. DATASET

SVM

LP SC

Breast Cancer Ionosphere Liver Pima Sonar

231 75.3 236.3 506.3 102

145.4 68.2 33.2 58.4 79.4

It should be noted that the number of RBFs created by each algorithm depends on the values of the parameters. In Table 2, the numbers of RBF terms reported are the values that correspond to the smallest estimated generalization errors. We can see that for all 5 datasets, the sphere covering algorithm creates far fewer RBF terms than the SVM algorithm. Therefore, in terms of the sparsity of the solution measured by the number of RBF terms in the final decision function, the minimum sphere covering algorithm has a clear advantage over the SVM algorithm based on our experimental results.

4

Conclusion

In this paper we have presented an integer programming formulation of the minimum sphere covering problem that aims to construct a minimum number of spheres to represent the class regions. By replacing the hard threshold function in the integer program with a soft threshold using radial basis functions or sigmoid functions, we have derived a linear programming problem that gives rise to radial basis function classifiers and sigmoid function classifiers. In

–9– contrast to traditional RBF networks and sigmoid function networks, in which the number of units is specified a priori, the new method explicitly minimizes the number of base units in the resulting classifiers. In this perspective, our method can be viewed as a new way to construct RBF networks and sigmoid function networks based on the structural risk minimization principle. The linear programming problem we formulated has a similar form to that of the LP boosting method. However, they differ significantly in the way that the kernel matrix is defined. The fundamental difference between our approach and the LP boosting method lies in that each class-specific sphere in our approach, which can be roughly interpreted as a weak learner in the boosting framework, only has limited influence in the input space; i.e., it only classifies data that falls inside its influence region, in contrast with weak learners in the boosting framework, which dichotomize the whole input space. We believe that our approach with radial basis function extension is advantageous compared to the support vector machines with Gaussian kernels. For instance, in our method, training examples that are close to decision boundaries have smaller sizes of influence compared to training examples that are far from decision boundaries. In addition, the linear program explicitly minimizes the number of radial basis function terms that will be created in the decision functions. Our experiments using real-world datasets have demonstrated the superiority of our method both in terms of classification performance and in terms of sparsity of the solution. References [1] Reilly, D. L., Cooper, L. N, & Elbaum, C. (1982) A neural model for category learning. Biological Cybernetics, 45, 35–41. [2] Littlestone, N., & Warmuth, M. (1986) Relating data compression and learnability. Unpublished manuscript, University of California Santa Cruz. [3] Vapnik, V. N. (1998) Statistical learning theory. New York, NY: Wiley. [4] Garey, M. R., & Johnson, D. S. (1979) Computers and intractability, a guide to the theory of np-completeness. New York, NY: Freeman. [5] Chv´ atal, V. (1979) A greedy heuristic for the set covering problem. Mathematics of Operations Research, 4, 233–235. [6] Demiriz, A., Bennett, K. P., & Shawe-Taylor, J. (2002) Linear programming boosting via column generation. Machine Learning, 46, 225–254. [7] Marchand, M., & Shawe-Taylor, J. (2002) The set covering machine. Journal of Machine Learning Research, 3, 723–746. [8] Hussain, Z., Szedmak, S., & Shawe-Taylor, J. (2004) The linear programming set covering machine. [9] Boser, B. E., Guyon, I., & Vapnik, V. (1992) A training algorithm for optimal margin classifiers. Computational Learning Theory, 144–152. [10] Cortes, C., & Vapnik, V. (1995) Support-vector networks. Machine Learning, 20, 273–297. [11] Amari, S., & Wu, S. (1999) Improving support vector machine classifiers by modifying kernel functions. Neural Networks, 12, 783–789. [12] Blake C. L., & Merz, C. J. (1998) UCI repository of machine learning databases, http://www.ics. uci.edu/∼mlearn/MLRepository.html.

Suggest Documents