Optimizing Area Under the ROC Curve using Ranking SVMs - CiteSeerX

7 downloads 343324 Views 118KB Size Report
quadratic-programming-based algorithm to optimize AUC. Joachims used ranking SVMs in the context of optimizing search engines [11]. Even though in his ...
Optimizing Area Under the ROC Curve using Ranking SVMs Kaan Ataman

W. Nick Street

Department of Management Sciences The University of Iowa

Department of Management Sciences The University of Iowa

[email protected]

[email protected]

ABSTRACT Area Under the ROC Curve (AUC), often used for comparing classifiers, is a widely accepted performance measure for ranking instances. Many researches have studied optimization of AUC, usually via optimizing some approximation of a ranking function. Ranking SVMs are among the better performers but their usage in the literature is typically limited to learning a total ranking from partial rankings . In this paper, we observe that a ranking SVM is in fact a direct optimization of AUC via optimizing the Wilcoxon-MannWhitney statistic. We compare a linear ranking SVM with some well-known linear classifiers such as linear SVM, perceptron and logit in the context of binary classification. We show that an SVM optimized for ranking not only achieves better AUC than other linear classifiers on average, but also performs comparably in accuracy.

Keywords AUC,ranking,SVM,optimization

1.

INTRODUCTION

Many real world classification problems may require an ordering instead of a simple classification. For example in direct mail marketing, the marketer may want to know which potential customers to target for sending new catalogs so as to maximize the revenue. If the number of catalogs the marketer is sending is limited then it is not enough to know who is likely to buy a product after receiving the catalog. In this case it would be more useful to to distinguish top n-percent of potential customers who yield the highest likelihood of buying a product. By doing so, the marketer can more intelligently target a subset of her customer base increasing the expected profit. A slightly different example is collaborative book recommendations. To intelligently recommend a book, an algorithm must distinguish the similarities of book preferences of people who have read and submitted a score

for the book. Combining these scores it is possible to return a recommendation list to the user. Ideally this list would be ranked in such a way that the book on the top of the list is expected to appeal the most to the user. It is not hard to see that there is a slight difference between those two ranking problems. In the marketing problem the ranking is based on binary input (purchase, non-purchase) whereas in the book recommendation problem the ranking is based on a collection of partial rankings. In this paper we consider the problem of ranking based on binary input. Ranking is a popular machine learning problem that has been addressed extensively in the literature [3, 4, 19]. The most commonly used performance measure to evaluate a classifiers ability to rank instances in binary classification problems is the area under the ROC curve. ROC curves are widely used for visualization and comparison of performance of binary classifiers. ROC curves were originally used in signal detection theory. They were introduced to the machine learning community by Spackman in 1989, who showed that ROC curves can be used for evaluation and comparison of algorithms [17]. ROC curves plot true positive rate vs. false positive rate by varying the threshold (probability of membership to a class) or score. In the ROC space, the upper left corner represents perfect classification while a diagonal line represents random classification. A point in ROC space that lies to the upper left of another point represents a better classifier. Some classifiers such as neural networks or naive Bayes naturally provide those probabilities while other classifiers such as decision trees do not. It is still possible to produce ROC curves for any type of classifier using minor adjustments [5]. Area under the ROC Curve (AUC) is a single scalar value for classifier comparison [2, 9]. Statistically speaking, the AUC of a classifier is the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative instance. Since AUC is a probability, its value varies between 0 and 1, where 1 represents all positives being ranked higher than all negatives. Larger AUC values indicate better classifier performance across the full range of possible thresholds. Even though it is possible that a classifier with high AUC can be outperformed by a lower AUC classifier at some region of the ROC, in general the high AUC classifier is better.

AUC can also be represented as the ratio of the number of correct pairwise rankings vs. the number of all possible pairs [8]. This quantity is called the Wilcoxon-Mann-Whitney (WMW) statistic [13, 20].

Pp−1 Pn−1 W =

i=0

I(xi , yj )

pn 

I(xi , yj ) =

j=0

1 0

(1)

if xi > yj otherwise

where p is the number of positive points, n is the number of negative points, xi is the ith positive point and yj is the j th negative point. Note that in the binary classification problem we have no prior knowledge of intra-class rankings. Mainstream classifiers are usually designed to minimize the classification error, therefore they are not specifically designed to do well with the rankings. Support Vector Machines (SVMs) are exceptions, since by nature (margin maximization) they turn out to be good rankers. It has been shown that accuracy is not always the best performance measure, especially for datasets with skewed class or cost distributions as many real world problems have. The machine learning community has recently explored the question of which performance measure is better in general. Ling and Huang showed in their recent study that AUC is a statistically consistent and a more discriminating value than accuracy [12]. Recent research indicates an increase in the popularity of AUC as a performance measure especially in cases where class priors and/or misclassification costs are unknown. In the previous literature researchers introduced ways to optimize rankings. Cohen, Schapire and Singer proposed a two-stage approach where first a preference function is learned then new instances are ordered to maximize the agreements with the learned function [4]. Caruana, Baluja and Mitchell introduced Rankprop, an algorithm that improves rankings by backpropogation with sum of squares error on estimated ranks [3]. Vogt and Cottrell introduced d0 , an optimization measure for ranking problems [19]. In their paper they claim their method works as well as optimizing exact precision, a popular IR performance measure. Recently, optimizing the area under the ROC curve has become a focal point of researchers. The problem can be framed as optimizing a total ranking either with a given collection of rankings or using 0-1 input, depending on the available input. Freund et al. introduced Rankboost, an algorithm for combining preferences [6]. In the context of collaborative filtering the algorithm combines ranked preferences by users to create a recommendation list optimally ranked for a similar user. Yan et al. had a direct approach to optimizing the AUC. They use an alternate continuous function to approximate the WMW statistic, which by itself is not continuous, and optimize this function directly [22]. Another recent paper by Herschtal and Raskutti introduces an algorithm that optimizes AUC using gradient descent [10]. In this study, the authors approximated the WMW statistic with a sigmoid function making use of its

differentiability. The authors also introduced an efficient way to reduce the number of constraints in the optimization problem which reduces the overall complexity. Ranking SVMs are introduced by Herbrich et al. [15] and Joachims [11]. To our knowledge comparison of ranking SVMs to other baseline classifiers in the binary classification context has not been studied. The closest article we were able to find is the optimization of AUC using SVMs by Rakotomamonjy [16]. The study shows an application of a quadratic-programming-based algorithm to optimize AUC. Joachims used ranking SVMs in the context of optimizing search engines [11]. Even though in his paper the author does not experiment with binary classification here we show that his approach can be extended to binary classification problems. In this paper we focus on maximizing the number of correct pairwise rankings, which in turn will optimize WilcoxonMann-Whitney statistic. In the following section we explain the optimization problem and our experimental setup in detail. This is followed by section 3, which covers the experimental results and discussions. Finally, we finish this paper with conclusions.

2. 2.1

AUC OPTIMIZATION Primal and Dual Formulations

To directly optimize the WMW statistic, we would like to have all positive points ranked higher than all negative points. We therefore construct a linear program with soft constraints penalizing negatives that are ranked above positives. This requires p (number of positive points) times n (number of negative points) constraints in total. A perfect linear separation of classes is impossible for most real world datasets, therefore the LP has to be constructed to minimize some error term corresponding to the number of incorrect orderings. As in the classification case, our formulation avoids combinatorial complexity by minimizing the distance by which such pairs are mis-ordered. The primal form of the LP is given below.

minimize

eT z + eT v

subject to

−Cw − z w−v −w − v v, z

w,v,z

≤ −1 ≤0 ≤0 ≥0

(2)

where e is a vector of ones in the appropriate dimension, w represents the attribute weights, v minimizes the magnitude of the weights and z is a vector of error terms corresponding to each pairwise misranking. C represents a matrix of misrankings, more specifically:

Ci = Ai − Bi

(3)

where Ai represents a collection of positive points, Bi represents a collection of negative points, and Ci constrains a

Initialize the weight vector While the number of added violations greater than k do For each positive point ranked below a negative point Create a constraint for each misranked positive point Create balancing, non-violated, constraints End for Solve the LP and update the weight vector End while

Figure 1: Algorithm Outline

positive point Ai to be ranked higher then a negative point Bi . In the ideal case the C matrix has pn rows. The result of this LP is a vector w such that the distance of pairwise mis-rankings is minimized. By finding such w without an intercept, we are in fact learning an optimal family of parallel planes, that is equivalent to scanning a single separating plane across a dataset parallel to its normal vector, and constructing an optimal ROC based on the series of intercept values. At this point our LP looks similar to a soft margin linear SVM [18]. Similar to an SVM, our LP maximizes the margin via minimizing the magnitude of the weights. The parameter that controls the balance between error and margin is set to one in our case so that sum of the margin and sum of errors are balanced. We observed that on average, using one was as good as using any other value for the coefficient. An SVM optimizes the margin of the relevant points (support vectors) relative to a single separating plane, while the ranking SVM optimizes the margin between each inter-class pair of points along the normal direction of the plane. As such, the number of points influencing the ranking solution is potentially much larger than the number affecting the separating plane. We use the dual of the LP to reduce the number of constraints, as the soft constraints on w become simple bounds on the dual variables. The dual of the above LP can be arranged as:

instance, for a balanced dataset with 1000 points the number of constraints necessary is 250000. Of course this is the worst case scenario for a 1000 point data set since class ratio is 1. For a skewed dataset with a class ratio of 1:100 the required number of constraints would be roughly 10000. Initializing the weights randomly increases the run time of the algorithm by delaying the convergence of weights. Instead we initialize the weights using the solution of a linear SVM. This speeds up the algorithm dramatically since we start with good initial weights. At each iteration having good weights provides lesser number of violations. Figure 1 outlines the algorithm. Our approach is to adjust the weight vector incrementally. To do this, we create constraints for only the misranked positives. This a variation of the well-known column generation technique [7]. At each iteration, each newly misranked pair of points creates a new constraint that is added to the LP. Eventually, all non-slack constraints from the complete problem will be contained in our constraint set, resulting in an optimal solution. In practice, we stop the iteration early if only a few new constraints were added. In the experiments that follow k parameter in the algorithm is set to 10. In addition to these violated constraints we also add constraints to prevent weights flipping sign (− to + or + to −). Flipping of the sign of the weight vector occurs when we create constraints that only correspond to the violations. The reason is that, with a constraint set consisting of only those pairs of points that were misranked in the previous iteration an optimal solution can be achieved by simply reversing the signs of the coefficients. However this does not guarantee minimization for the whole problem since it ranks most of the original points incorrectly. Adding violated constraints together with some balancing non-violated constraints to avoid the reversal phenomenon dramatically reduces the total number of constraints compared to an LP that lists all possible p by n constraints. It takes only a few iterations for the objective function to converge and overall time to run the algorithm is usually much less then an LP which solves for all possible constraints.

3. minimize α,β,γ

e α T

subject to

2.2

EXPERIMENTS AND RESULTS

T

−C α + β − γ α −β − γ β, γ

=0 ≤1 ≤1 ≤0

(4)

The Algorithm

Previously we mentioned that ideally we would like to include all possible constraints such that each positive point is ranked above each negative point. This however is computationally expensive especially with large datasets. For

In this study we used Matlab as programming environment with GLPK LP Solver [14]. We also used WEKA [21] for comparison tests. We used 17 datasets from UCI repository [1]. Some of the datasets used were multi-class problems. Those datasets are converted to binary classification problems by splitting the datasets by one class vs. others splitting. An overview of the datasets and the splitting criterion for multi-class datasets are given in Table 1. For our experiments we used 10-fold cross validation. For training we limited the number of data points to 500 to keep the number of constraints manageable for the GLPK solver. For comparisons we ran WEKA using exactly the same training and testing sets in each fold. This provided us reliability with the significance tests. We present our results in Table 2. The table provides averages and standard deviations for five 10-fold cross validations. Table 3

Table 1: Overview of the datasets and modification details Datasets abalone blocks boston cancer(wbc) cancer(wpbc) diabetes ecoli glass heart ionosphere liver(bupa) pendigits sonar spambase spectf survival vehicle

# of points 4177 5473 506 699 194 768 336 214 270 351 345 14490 208 4601 351 306 846

# of attributes 9 11 14 10 34 9 8 10 14 35 7 17 61 58 45 4 19

Table 3: Summary of best-worst comparison results for AUC analysis

Best Worst

RankerSVM 10 1

SVM 5 5

Logit 1 2

Perc 1 9

Table 4: Summary of pairwise significance (t-test) results for AUC Ranking vs. SVM Ranking vs. Logit Ranking vs. Perceptron

W-L(significant) 7-4 4-2 8-1

summarizes the number of data sets on which the various algorithms performed best and worst, and the significance test (at 95%) results are provided in Table 4. The results in Table 3 show that ranker SVM returns the best AUC among all the tested algorithms for 10 data sets out of 17 while it is the worst in only one case. Significance test analysis in Table 4 show that in general ranker SVM performs better ranking than the classifiers. The linear SVM tended to perform either very well or very poorly, while the perceptron classifier did a generally poor job of ranking the cases. To test the impact of ranking optimization on classification accuracy we also collected accuracy results from the above experiment. It is fairly simple to get classifications from the ranker. To do so we simply find the best accuracyyielding threshold value on the training data. Then we use this threshold in the separating surface in the test set to obtain accuracy. The accuracy of the other linear classifiers were obtained from WEKA using the same data sets at each fold. The results are provided in Table 5. Table 6 summarizes overall performance and 95% pairwise significance tests for the accuracy comparisons are presented in Table 7.

% rare class 31 10 9 34 24 35 15 14 44 36 42 10 47 39 28 26 25

Comments ”female” = 1, rest = 0 ”text” = 1, rest = 0 (MEDV

Suggest Documents