Minimax Probability Machine Gert R.G. Lanckriet
Laurent El Ghaoui
Department of EECS University of California, Berkeley Berkeley, CA 94720-1770
Department of EECS University of California, Berkeley Berkeley, CA 94720-1770
[email protected]
[email protected]
Chiranjib Bhattacharyya
Michael I. Jordan
Department of EECS University of California, Berkeley Berkeley, CA 94720-1776
Computer Science and Statistics University of California, Berkeley Berkeley, CA 94720-1776
[email protected]
[email protected]
Abstract When constructing a classi er, the probability of correct classi cation of future data points should be maximized. In the current paper this desideratum is translated in a very direct way into an optimization problem, which is solved using methods from convex optimization. We also show how to exploit Mercer kernels in this setting to obtain nonlinear decision boundaries. A worst-case bound on the probability of misclassi cation of future data is obtained explicitly.
1 Introduction Consider the problem of choosing a linear discriminant by minimizing the probabilities that data vectors fall on the wrong side of the boundary. One way to attempt to achieve this is via a generative approach in which one makes distributional assumptions about the class-conditional densities and thereby estimates and controls the relevant probabilities. The need to make distributional assumptions, however, casts doubt on the generality and validity of such an approach, and in discriminative solutions to classi cation problems it is common to attempt to dispense with class-conditional densities entirely. Rather than avoiding any reference to class-conditional densities, it might be useful to attempt to control misclassi cation probabilities in a worst-case setting; that is, under all possible choices of class-conditional densities. Such a minimax approach could be viewed as providing an alternative justi cation for discriminative approaches. In this paper we show how such a minimax programme can be carried out in the setting of binary classi cation. Our approach involves exploiting the following powerful theorem due to Isii [6], as extended in recent work by Bertsimas
http://robotics.eecs.berkeley.edu/~ gert/
and Sethuraman [2]:
(y ? y )T y ?1 (y ? y ); (1) sup PrfaT y bg = 1 +1 d2 ; with d2 = aTinf yb
where y is a random vector, where a and b are constants, and where the supremum is taken over all distributions having mean y and covariance matrix y . This theorem provides us with the ability to bound the probability of misclassifying a point, without making Gaussian or other speci c distributional assumptions. We will show how to exploit this ability in the design of linear classi ers. One of the appealing features of this formulation is that one obtains an explicit upper bound on the probability of misclassi cation of future data: 1=(1 + d2 ). A second appealing feature of this approach is that, like linear discriminant analysis [7], it is possible to generalize the basic methodology, utilizing Mercer kernels and thereby forming nonlinear decision boundaries. We show how to do this in Section 3. The paper is organized as follows: in Section 2 we present the minimax formulation for linear classi ers, while in Section 3 we deal with kernelizing the method. We present empirical results in Section 4.
2 Maximum probabilistic decision hyperplane In this section we present our minimax formulation for linear decision boundaries. Let x and y denote random vectors in a binary classi cation problem, with means and covariance matrices given by x (x; x ) and y (y; y ) where \" means that the random variable has the speci ed mean and covariance matrix but that the distribution is otherwise unconstrained. Note that x; x ; y; y 2 Rn and x ; y 2 n n R . We want to determine the hyperplane aT z = b (a; z 2 Rn and b 2 R) that separates the two classes of points with maximal probability with respect to all distributions having these means and covariance matrices. This boils down to: max s.t. inf PrfaT x bg (2) ;a;b inf PrfaT y bg or, max s.t. 1 ? sup PrfaT x bg (3) ;a;b 1 ? sup PrfaT y bg: Consider the second constraint in (3). Recall the result of Bertsimas and Sethuraman [2]: sup PrfaT y bg = 1 +1 d2 ; with d2 = Tinf (y ? y )T y ?1 (y ? y ) (4) a yb 2 T We can write this as d = inf cT wd w w, where w = y ?1=2(y?y ), cT = aT y 1=2 and d = b ? aT y . To solve this, rst notice that we can assume that aT y b (i.e. y is classi ed2 correctly by the decision hyperplane aT z = b): indeed, otherwise we would nd d = 0 and thus = 0 for that particular a and b, which can never be an optimal value. So, d > 0. We then form the Lagrangian: L(w; ) = wT w + (d ? cT w); (5)
which is to be maximized with respect to 0 and minimized with respect to w. At the optimum, 2w = c and d = cT w, so = c2Tdc and w = cdTcc . This yields: T 2 (6) d2 = inf (y ? y )T ?1 (y ? y ) = (b ? a y )
aT y a
y
aT yb
Using (4), the second constraint in (3) becomes 1 ? 1=(1+d2) or d2 =(1 ? ). Taking (6) into account, this boils down to: r q (7) b ? aT y () aT y a where () = 1 ? We can handle the rst constraint in (3) in a similar way (just write aT x b as ?aT x ?b and apply the result (7) for the second constraint). The optimization problem (3) then becomes: T x ()paT x a (8) max s.t. ? b + a ;a;b q
b ? aT y () aT y a: Because () is a monotone increasing function of , we can write this as: T x paT x a max s.t. ? b + a ;a;b q
(9)
b ? aT y aT y a: >From both constraints in (9), we get q
p
aT y + aT y a b aT x ? aT x a; which allows us to eliminate b from (9): q
p
max s.t. aT y + aT y a aT x ? aT x a: ;a
(10) (11)
Because we want to maximize , it is obvious that the inequalities in (10) will become equalities at the optimum. The optimal value of b will thus be given by q
q
(12) b = aT x ? aT x a = aT y + aT y a : where a and are the optimal values of a and respectively. Rearranging the constraint in (11), we get: p q T T T a (x ? y) a x a + a y a :
(13)
The above is positively homogeneous in a: if a satis es (13), sa with s 2 R+ also does. Furthermore, (13) implies aT (x ? y ) 0. Thus, we can restrict a to be such that aT (x ? y ) = 1. The optimization problem (11) then becomes 1 paT a + qaT a (14) max s.t. x y ;a aT (x ? y) = 1; which allows us to eliminate : q p T x a + aT y a s.t. aT (x ? y ) = 1; a (15) min a
or, equivalently min kx 1=2ak2 + ky 1=2ak2 s.t. aT (x ? y ) = 1: a
(16)
This is a convex optimization problem, more precisely a second order cone program (SOCP) [8,5]. Furthermore, notice that we can write a = a0 + Fu, where u 2 Rn?1, a0 = (x ? y )=kx ? yk2 , and F 2 Rn(n?1) is an orthogonal matrix whose columns span the subspace of vectors orthogonal to x ? y . Using this we can write (16) as an unconstrained SOCP: min kx 1=2(a0 + Fu)k2 + ky 1=2(a0 + Fu)k2 : u
(17)
We can solve this problem in various ways, for example using interior-point methods for SOCP [8], which yield a worst-case complexity of O(n3 ). Of course, the rst and second moments of x; y must be estimated beforehand, using for example plug-in estimates x^ ; y^ ; ^ x ; ^ y for respectively x ; y ; x ; y . This brings the total complexity to O(ln3 ), where l is the number of data points. This is the same complexity as the quadratic programs one has to solve in support vector machines. In our implementations, we took an iterative least-squares approach, which is based on the following form, equivalent to (17): 1 k 1=2(a + Fu)k2 + + 1 k 1=2(a + Fu)k2: min + (18) 0 0 2 2 u;; x y At iteration k, we rst minimize with respect to and by setting k = kx 1=2(a0 + Fuk?1)k2 and k = ky 1=2(a0 + Fuk?1)k2. Then we minimize with respect to u by solving a least squares problem in u for = k and = k , which gives us uk . Because in both update steps the objective of this COP will not increase, the iteration will converge to the global minimum kx 1=2(a0 + Fu )k2 + ky 1=2(a0 + Fu)k2, with u an optimal value of u. p T We then obtain a as a a x a + 0 + Fu and b from (12) with = 1=( p T a y a). Classi cation of a new data point znew is done by evaluating sign(aT znew ? b ): if this is +1, znew is classi ed as from class x, otherwise znew is classi ed as from class y. It is interesting to see what happens if we make distributional assumptions; in particular, let us assume that x N (x; x ) and y N (y; y ). This leads to the following optimization problem: p
max s.t. ?b + aT x ?1 () aT x a ;a;b q
(19)
b ? aT y ?1 () aT y a:
where (z) is the cumulative distribution function for a standard normal Gaussian ?1 distribution. q This has the same form as (8), but now with () = () instead of () = 1? (cf. a result by Cherno [4]). We thus solve the same optimization problem ( disappears from the optimization problem because () is monotone increasing) and nd the same decision hyperplane aT z = b. The dierence lies in the value of associated with : will be higher in this case, so the hyperplane will have a higher predicted probability of classifying future data correctly.
3 Kernelization In this section we describe the \kernelization" of the minimax approach described in the previous section. We seek to map the problem to a higher dimensional feature space Rf via a mapping ' : Rn 7! Rf , such that a linear discriminant in the feature space corresponds to a nonlinear discriminant in the original space. To carry out this programme, we need to try toT reformulate the minimax problem in terms of a kernel function K(z1 ; z2) = '(z1 ) '(z2 ) satisfying Mercer's condition. Let the data be mapped as x 7! '(x) ('(x); '(x) ) and y 7! '(y) ('(y); '(y) ) where fxigNi=1x and fyigiN=1y are training data points in the classes corresponding to x and y respectively. The decision hyperplane in Rf is then given by aT '(z) = b with a; '(z) 2 Rf and b 2 R. In Rf , we need to solve the following optimization problem: q q T '(x) a + aT '(y)a s.t. aT ('(x) ? '(y)) = 1; min a (20) a where, as in (12), the optimal value of b will be given by q q b = aT '(x) ? aT '(x) a = aT '(y) + aT '(y)a ; (21) where a and are the optimal values of a and respectively. However, we do not wish to solve the COP in this form, because we want to avoid using f or ' explicitly. If a has a component in Rf which is orthogonal to the subspace spanned by '(xi ), i = 1; 2; : : :; Nx and '(yi ), i = 1; 2; : : :; Ny , then that component won't aect the objective or the constraint in (20). This implies that we can write a as
a=
Nx X i=1
i '(xi ) +
Ny X j =1
j '(yj ):
(22)
Substituting expression (22) for a and estimates '[ (x) = N1x Ni=1x '(xi ) , '[ (y) = PNy PNx 1 1 T [ [ ^ ^ Ny i=1 '(yi ), '(x) = Nx i=1 ('(xi ) ? '(x))('(xi ) ? '(x)) and '(y) = PNy 1 [ [T Ny i=1 ('(yi ) ? '(y))('(yi ) ? '(y)) for the means and the covariance matrices in the objective and the constraint of the optimization problem (20), we see that both the objective and the constraints can be written in terms of the kernel function K(z1 ; z2) = '(z1 )T '(z2 ). We obtain: s r 1 1 T ~T ~ T ~ T T ~ ~ ~ min
Nx Kx Kx + Ny Ky Ky s.t. (kx ? ky ) = 1; (23) where = [1 2 Nx 1 2 Ny ]T , k~ x 2 RNx+Ny with [k~ x]i = PNx 1 Nx+Ny with [k~ y ]i = 1 PNy K(yj ; zi), zi = xi for ~ Nx j =1 K(xj ; zi), ky 2 R Ny j =1 i = 1; 2; : : :; Nx and zi = yi?Nx for i = Nx + 1; Nx + 2; : : :; Nx + Ny . K~ is de ned as: ~K = Kx ? 1Nx k~~ TxT = K~~ x (24) Ky ? 1Ny ky Ky where 1m is a column vector with ones of dimension m. Kx and Ky contain respectively the rst Nx rows and the last Ny rows of the Gram matrix K (de ned as Kij = '(zi )T '(zj ) = K(zi ; zj )). We can also write (23) as K~ x k + k pK~ y k s.t. T (k~ ? k~ ) = 1; p min k (25) x y
Nx 2 Ny 2 P
which is a second order cone program (SOCP) [5] that has the same form as the SOCP in (16) and can thus be solved in a similar way. Notice that, in this case, the optimizing variable is 2 RNx+Ny instead of a 2 Rn. Thus the dimension of the optimization problem increases, but the solution is more powerful because the kernelization corresponds to a more complex decision boundary in Rn. Similarly, the optimal value b of b in (21) will then become s r 1 T T T T ~ ~ ~ ~ b = kx ? N Kx Kx = ky + N1 T K~ Ty K~ y ; (26) x y where and are the optimal values of and respectively. q q Once is known, we get = 1= N1x T K~ Tx K~ x + N1y T K~ Ty K~ y and then b from (26). Classi cationof a new data point znew is then done by evaluating PNx +Ny sign(aT '(znew )?b ) = sign i=1 [ ]iK(zi ; znew ) ? b (again only in terms of the kernel function): if this is +1, znew is classi ed as from class x, otherwise znew is classi ed as from class y.
4 Experiments In this section we report the results of experiments that we carried out to test our algorithmic approach. The validity of 1 ? as the worst case bound on the probability of misclassi cation of future data is checked, and we also assess the usefulness of the kernel trick in this setting. We compare linear kernels and Gaussian kernels. Experimental results on standard benchmark problems are summarized in Table 1. The Wisconsin breast cancer dataset contained 16 missing examples which were not used. The breast cancer, pima, diabetes, and ionosphere data were obtained from UCI repository while the heart data were obtained from STATLOG. Data for the twonorm problem data were generated as speci ed in [3]. Each dataset was randomly partitioned into 90% training and 10% test sets. The kernel parameter 2 = ?k x ? y k () for the Gaussian kernel (e ) was tuned using cross-validation over 20 random partitions. The reported results are the averages over 50 random partitions for both the linear kernel and the Gaussian kernel with chosen as above. The results are comparable with those in the existing literature [3]. Also, we notice that is indeed smaller than the test-set accuracy in all but one case. Furthermore, is smaller for a linear decision boundary then for the nonlinear decision boundary obtained via the Gaussian kernel. This clearly shows that kernelizing the method leads to more powerful decision boundaries. Table 1: and test-set accuracy (TSA) compared to BPB (best performance in [3]) Dataset Linear kernel Gaussian kernel BPB TSA TSA Twonorm 80.2 % 95.2 % 86.2 % 95.8 % 96.3 % Breast cancer 84.4 % 97.2 % 92.7 % 97.3 % 96.8 % Ionosphere 63.3 % 84.7 % 94.9 % 93.4 % 93.7 % Pima diabetes 31.2 % 73.8 % 33.0 % 74.6 % 76.1 % Heart 42.8 % 81.4 % 50.5 % 83.6 % unknown
5 Conclusions The problem of linear discrimination has a long and distinguished history. Many results on misclassi cation rates have been obtained by making distributional assumptions (e.g., Anderson and Bahadur [1]). Our results, on the other hand, make use of recent work on moment problems and semide nite optimization to obtain distribution-free results for linear discriminants. We have also shown how to exploit Mercer kernels to generalize our algorithm to nonlinear classi cation. The computational complexity of our method is comparable to the quadratic program that one has to solve for the support vector machine (SVM). While we have used a simple iterative least-squares approach, we believe that there is much to gain from exploiting analogies to the SVM and developing specialized optimization procedures for our algorithm, in particular tools that break the data into subsets. This is a current focus of our research, as is the problem of developing a variant of our algorithm for multiway classi cation.
Acknowledgements
We would like to acknowledge support from ONR MURI N00014-00-1-0637, from NSF grants IIS-9988642 and ECS-9983874 and from the Belgian American Educational Foundation.
References [1] Anderson, T. W. and Bahadur, R. R. (1962) Classi cation into two multivariate Normal distributions with dierent covariance matrices. Annals of Mathematical Statistics 33(2): 420-431. [2] Bertsimas, D. and Sethuraman, J. (2000) Moment problems and semide nite optimization. Handbook of Semide nite Optimization 469-509, Kluwer Academic Publishers. [3] Breiman L. (1996) Arcing classi ers. Technical Report 460, Statistics Department, University of California, December 1997. [4] Cherno H. (1972) The selection of eective attributes for deciding between hypothesis using linear discriminant functions. In Frontiers of Pattern Recognition, (S. Watanabe, ed.), 55-60. New York: Academic Press. [5] Boyd, S. and Vandenberghe, L. (2001) Convex Optimization. Course notes for EE364, Stanford University. Available at http://www.stanford.edu/class/ee364. [6] Isii, K. (1963) On the sharpness of Chebyshev-type inequalities. Ann. Inst. Stat. Math. 14: 185-197. [7] Mika, M. Ratsch, G., Weston, J., Scholkopf, B., and Muller, K.-R. (1999) Fisher discriminant analysis with kernels. In Neural Networks for Signal Processing IX, 41{48, New York: IEEE Press. [8] Nesterov, Y. and Nemirovsky, A. (1994) Interior Point Polynomial Methods in Convex Programming: Theory and Applications. Philadelphia, PA: SIAM.