Support Vector Machine Classification of Uncertain ... - Semantic Scholar

21 downloads 185 Views 1011KB Size Report
Page 1 ... imbalanced and noisy data using the principles of Robust Optimization. Uncertainty is prevalent ... analysis and present the final optimization problem.
Recent Researches in Computer Science

Support Vector Machine Classification of Uncertain and Imbalanced data using Robust Optimization RAGHAV PANT, THEODORE B. TRAFALIS, KASH BARKER School of Industrial Engineering University of Oklahoma 202 W. Boyd Street, Room 124, Norman, Oklahoma - 73019 UNITED STATES [email protected], [email protected], [email protected] Abstract: - In this paper, we have developed a robust Support Vector Machines (SVM) scheme of classifying imbalanced and noisy data using the principles of Robust Optimization. Uncertainty is prevalent in almost all datasets and has not been addressed efficiently by most data mining techniques, as these are based on deterministic mathematical tools. Imbalanced datasets exist while performing analysis of rare events, and for such datasets elements in the minority class become critical. Our method tries to address both issues lacking in traditional SVM classifications. At present, we provide solutions for linear classification of data having bounded uncertainties. This can be extended to non-linear classification schemes for any types of uncertainties that are convex. Our results in predicting the importance of the minority class are better than the traditional SVM soft-margin classification. Preliminary computational results are presented.

Key-Words: - Support Vector Machines, Robust Classification, Imbalance, Uncertainty, Noise is meant to represent those examples that do not lie on the intended side of the separation margin. In this paper we extend this term to include the uncertainty that manifests itself in every data point. Hence, our interpretation of noisy data consists of an error in measurement of each data point that has to be considered during classification and also there would be some data points which will not be classified correctly. As stated earlier, traditional SVM methods adjust for the error relative to the maximum margin of classification, but do not consider individual data uncertainties. Bhattacharyya et al. [2] addressed such issues by developing Second Order Conic Programming SVM formulations for Gaussian uncertainty in data, which bore resemblance to the Total SVM methods of Bi and Zhang [4] that provided SVM formulations for bounded uncertainties. While these methods were developed separately they fall under the scheme of Robust SVM approaches detailed in works of Trafalis et al. [5,6,7,8], which uses concepts developed in Robust Optimization (RO) literature [1]. RO schemes have been applied to SVM to explore both data uncertainties and classification errors and results of the robust SVM are found to perform better than the traditional SVM methods. The RO techniques for SVM can be extended to the study of imbalanced datasets. Examples of such

1 Introduction Data classification is an important problem in the field of data mining. Classification refers to dividing the data into classes, where each class signifies certain properties that are common to a set of data. The simplest classification is the perfect linear separation of data into two classes. In practical problems, data is not perfectly separable, due to which there are errors in classification. Further issues making data analysis complicated arise due to the presence of uncertain and imbalanced datasets. Uncertainty is prevalent is almost all datasets and is not addressed efficiently by most data mining techniques, as these are based on deterministic mathematical tools [8]. Imbalanced datasets exist while performing analysis of rare events, for which elements in the minority class become critical. Most of the data mining techniques perform poorly in predicting the minority class for imbalanced data [3]. The solution of classification problems using Support Vector Machines (SVMs) [9,10] is widely prevalent in data mining. The soft-margin SVM classification methods can provide efficient solutions to non-separable data, but due to uncertainty, the traditional soft-margin SVM classification might not be completely effective in providing the optimal classification. In general the term "noisy or uncertain data" in SVM classification

ISBN: 978-1-61804-019-0

369

Recent Researches in Computer Science

datasets are tornado datasets where out of a very large database only a few events are catastrophic and hence, important for decision-making. RO methods can be used to control the perturbations in data points, which in turn controls misclassification errors that affect the ability to successfully predict the minority class. Such methods have been explored in this study. We look at linear separation SVM classification for uncertain and imbalanced datasets. Using RO, we propose an optimization problem that provides a better solution to imbalanced, noisy data classification than the classical soft-margin SVM classification. The main contribution of this work is to formulate a classification problem that solves for imbalanced and noisy data. We have presented the formulation and results for convex bounded uncertainty datasets. This paper is organized as follows. Section 2 explains the problem statement for classification of uncertain data, and we present the soft-margin SVM classification problem statement. Section 3 discusses the development of the robust SVM classification problem for noisy data, wherein the robust counterpart of the classical SVM problem is developed and the final optimization equation for noisy data is presented. In Section 4, we extend the robust SVM formulation to include imbalanced data analysis and present the final optimization problem that handles data with Euclidean norm bounded uncertainty. in Section 5 we perform preliminary numerical analyses for a few toy datasets and compare the performance of our method with the classical SVM based svmtrain function of MATLAB. Section 6 discusses the conclusion and future development of this work.

have to be provided for obtaining feasible solutions to the classification problems. As a rule, it is assumed that 𝚫 has to belong to a convex set that gives computationally tractable solutions. The problem we aim to solve is the two-class SVM classification problem in which the amount of data belonging to one class 𝑦𝑖 = +1 is very few compared to the other class 𝑦𝑖 = βˆ’1 . Further, the data has uncertainties, due to which there will be errors in classification. In the next subsections below we state our SVM classification problem and suggest the formulation to handle uncertainties. We will further develop the classification scheme to address both problems of data imbalance and noise.

2.1 Soft-margin SVM Classification for Nonseparable Data For perfect linear separation of data the SVM classification rule is defined in terms of a separating hyperplane and is given as (2) 𝑦𝑖 𝑀, π‘₯𝑖 + 𝑏 β‰₯ 1, 𝑖 = 1, 2, … , π‘š , where 𝑀 ∈ ℝ𝑛 is a weight vector perpendicular to the hyperplane and b is a scalar for determining the offset of the hyperplane from the origin. The term 𝑀, π‘₯𝑖 signifies the vector dot product between the elements of 𝑀 and π‘₯𝑖 . The aim of SVM classification is to find the maximum margin of separation between data points, which is stated through the optimization problem 1 min 𝑀 22 (3) 2 s. t. 𝑦𝑖 𝑀, π‘₯𝑖 + 𝑏 β‰₯ 1, 𝑖 = 1, 2, … , π‘š . The classification of data is generally not exact because in actual situations the data is too complex to be perfectly linear separable. Hence, a term for the error in classification has to be incorporated into the analysis. Due to errors in classification, the traditional SVM hard-margin classification constraints are modified as 𝑦𝑖 𝑀, π‘₯𝑖 + 𝑏 β‰₯ 1 βˆ’ πœ‰π‘– , (4) πœ‰π‘– β‰₯ 0, 𝑖 = 1, 2, … , π‘š , where, πœ‰π‘– is the scalar error in classification of the ith data point xi. The aim of an efficient classifier is to minimize the errors in classification. This is accomplished by minimizing the sum of all the errors in classification, which is referred to as the realized hinge loss function [1]. Hence, the optimization problem to be solved for controlling errors becomes

2 Problem Formulation For classification, we are given an input matrix X∈ β„π‘šx𝑛 of training samples. Each data point, xi ∈ X, is an n element row vector and there are m such data points. Further, we are given that each data point belongs to either one of the two classes given as 𝑦 ∈ {+1, βˆ’1}. The pair {X, y} is referred to as the training dataset. Uncertainty is incorporated into the analysis by assuming that the input matrix is given in terms of a nominal value and a perturbation, that is, (1) 𝐗 = 𝐗 𝑁 + 𝚫: 𝚫 = 𝛿1 , 𝛿2 , … , π›Ώπ‘š , where, XN is the nominal value of the data that is free from uncertainties, 𝚫 is the uncertainty set for defining the perturbations in the data in which 𝛿𝑖 is the n element row vector of uncertainty associated with each data point xi. Suitable definitions about 𝚫

ISBN: 978-1-61804-019-0

370

Recent Researches in Computer Science

π‘š

π‘š

min

min πœ† 𝑀

πœ‰π‘– (5)

𝑖=1

s. t.

𝑀 ,𝑏

πœ‰π‘– β‰₯ 1 βˆ’ 𝑦𝑖 𝑀, π‘₯𝑖 + 𝑏 , πœ‰π‘– β‰₯ 0, 𝑖 = 1, 2, … , π‘š . The classical SVM soft-margin classification problem combines the objectives of the optimization problems given by equations (3) to (5). Hence, the classical SVM problem formulation becomes min

πœ† 𝑀

+

πœ‰π‘– 𝑖=1

s. t.

πœ‰π‘– β‰₯ 1 βˆ’ 𝑦𝑖 𝑀, π‘₯𝑖 + 𝑏 , πœ‰π‘– β‰₯ 0, 𝑖 = 1, 2, … , π‘š , where, Ξ» is the regularization parameter.

(6)

1 βˆ’ 𝑦𝑖 𝑀, π‘₯𝑖 + 𝑏

+.

(8)

𝑖=1

π‘š

min πœ† 𝑀 𝑀 ,𝑏

2 2

+ max π‘₯ 𝑖 βˆˆπ—

1 βˆ’ 𝑦𝑖 𝑀, π‘₯𝑖 + 𝑏

+.

(9)

𝑖=1

The significance of robust optimization principles in solving SVM classification problems lies in the fact that it solves for the extreme case of the data uncertainty. The geometrical representation of data points with spherical uncertainty is shown in Fig.1, where for each data point the centre of the sphere represents its nominal value and the radius of the sphere represents the uncertainty. In the classical SVM formulation the support vectors correspond to the centre of data points, which does not provide much scope for change. As seen in Fig.1, using the robust optimization methods the support vectors would be tangential to the spherical boundary of the perturbed data. Thus the solutions become sensitive to the radius of each sphere and can also result in more points becoming support vectors. For imbalanced data sets this becomes important as it allows us to have more points as support vectors for the minority class, which would include some points that were being treated as outliers otherwise.

3 Robust Formulation of the Softmargin SVM Classification for Noisy Data From the SVM formulation of (6) it can be seen that the maximal margin of separation is influenced by the realizations of the data points. In particular there are a few data points, called the support vectors that determine the separation margin. If the data are uncertain then the region of influence of the support vectors is varied and we obtain multiple solutions for the maximal margin. Hence, it is intuitive to look at the worst-case realizations for data points as these would give the extreme separation margin. RO methods are therefore useful tools for accomplishing the task of finding the best separation margin under uncertainty. Solving the optimization problem of (6) can become computationally intensive, especially if each data point has as unique uncertainty. Moreover, it is not possible to obtain computationally tractable solutions unless certain rules for the uncertainty set are specified. The inequalities of (6) allow us to rewrite the hinge loss function in terms of the sampling data, labels, weight vector and the offset as π‘š

+

This optimization problem motivates the formulations that lead to a robust analysis of data with noise. From (8) we can see that the new softmargin SVM classification formulation is easier to solve for noisy data, as we get a computationally tractable formulation. For a robust analysis, we will minimize the worst-case hinge loss function due to uncertain data. The robust counterpart of (8) becomes

π‘š

2 2

2 2

Figure 1. Comparison of classical SVM with robust SVM

π‘š

πœ‰π‘– = 𝑖=1

1 βˆ’ 𝑦𝑖 𝑀, π‘₯𝑖 + 𝑏

+,

(7)

3.1 Incorporating Uncertainties in the Softmargin SVM Classification

𝑖=1

1 βˆ’ 𝑦𝑖 𝑀, π‘₯𝑖 + 𝑏 + = max 0,1 βˆ’ 𝑦𝑖𝑀,π‘₯𝑖+ 𝑏. This formulation is similar to an indicator function and has a convex upper bound. Hence, we can restate the soft-margin SVM classification problem (6) as an unconstrained optimization problem given by where,

ISBN: 978-1-61804-019-0

We revisit the formulation of the realized hinge loss function in (7) for incorporating uncertainties in the SVM formulation. Dividing π‘₯𝑖 ’s into their nominal values, π‘₯𝑖𝑁 ’s, and uncertainties, 𝛿𝑖 ’s, the new formulation for (7) becomes

371

Recent Researches in Computer Science

π‘š

π‘š

1 βˆ’ 𝑦𝑖 𝑀, π‘₯𝑖𝑁 + 𝑏

πœ‰π‘– = 𝑖=1

interested in solving the following robust SVM optimization problem

(10)

𝑖=1

πœ† 𝑀

βˆ’ 𝑦𝑖 𝑀, 𝛿𝑖 + , Hence, in the robust counterpart of the realized hinge loss function we are concerned with finding the worst-case realization of the uncertainty, which does not involve the nominal data. In our robust formulation of (9) the worst-case hinge loss function is therefore dependent upon the worst-case realizations of the data perturbation, which means the second term of (9) can be expressed as

min 𝑀 ,𝑏

𝛿 𝑖 ∈𝚫

1 βˆ’ 𝑦𝑖 𝑀, π‘₯𝑖𝑁 + 𝑏 βˆ’ 𝑦𝑖 𝑀, 𝛿𝑖

+

𝑖=1

1 βˆ’ 𝑦𝑖 𝑀, π‘₯𝑖𝑁 + 𝑏 βˆ’ 𝑦𝑖 𝑀, 𝛿𝑖

(11)

π‘š

1 βˆ’ 𝑦𝑖 𝑀, π‘₯𝑖𝑁 + 𝑏 + 𝛿𝑖

=

𝑝

𝑀

𝑖=1

π‘ž +

min

πœ† 𝑀

2 2

1 βˆ’ 𝑦𝑖 𝑀, π‘₯𝑖𝑁 + 𝑏 + 𝛿𝑖

𝑀 ,𝑏

𝑖=1

𝑝

𝑀

π‘ž +

π‘ž +

1 βˆ’ 𝑦𝑖 𝑀, π‘₯𝑖𝑁 + 𝑏 + 𝛿𝑖

𝑝

𝑀

1 βˆ’ 𝑦𝑖 𝑀, π‘₯𝑖𝑁 + 𝑏 + 𝛿𝑖

𝑝

𝑀

π‘ž +

≀ 𝜏1 ,

π‘ž +

≀ πœβˆ’1 .

𝑦 𝑖=+1

(17)

𝑦 𝑖=βˆ’1

where 𝜏1 and πœβˆ’1 are respectively the values which bound the sum of errors in the positive and negative samples. A common form of uncertainty bound which is used for data perturbations is the 2-norm or Euclidean uncertainty bound. For most kinds of data it is assumed that either the entire uncertainty in the dataset has a Euclidean bound ( 𝚫 2 ≀ π‘Ÿ, π‘Ÿ β‰₯ 0), or each data point is contained in an uncertainty sphere of fixed radius ρ( 𝛿𝑖 2 ≀ 𝜌). Depending upon the data imbalance we can impose different bounds on the positive and negative samples respectively, in order to improve our classification. The robust SVM formulation, when uncertainty exists in each data point, becomes a conic programming problem, which is stated as min πœ† 𝑀 22 + 𝜏1 + πœβˆ’1

(14)

𝑀 ,𝑏

s. t.

1 βˆ’ 𝑦𝑖 𝑀, π‘₯𝑖𝑁 + 𝑏 + 𝜌1 𝑀

+ .

𝑀

𝑀 ,𝑏

Combining (9) and (14) gives us the robust SVM for solving the classification problem when we have data uncertainty or noise. We restate our final robust SVM problem, which we will develop to handle imbalanced data π‘š

𝑝

. (16)

s. t.

+

𝑖=1

1 βˆ’ 𝑦𝑖 𝑀, π‘₯𝑖𝑁 + 𝑏 + 𝛿𝑖

π‘ž +

Allowing this separation of the data into positive and negative samples helps us control the perturbation on the samples, which can be critical in including the important minority class samples in classification. The unconstrained optimization problem can be converted into a constrained optimization problem by assuming that the hinge loss functions are less than some maximum values. Mathematically this is expressed as min πœ† 𝑀 22 + 𝜏1 + πœβˆ’1

π‘š

𝛿 𝑖 ∈𝚫

𝑀

𝑝

𝑦 𝑖=+1

𝑦 𝑖=βˆ’1

In (11) above, we can take the maximum inside the summation because of the convexity of the hinge loss function. One way of specifying the worst-case realization for the uncertainty is through the Cauchy-Schwarz inequality, which provides norm bounds on the data perturbations. For most data, assuming norm upper bounds for perturbations is as justifiable assumption, which leads to convex formulations. The Cauchy-Schwarz bounds for 𝑦𝑖 𝑀, 𝛿𝑖 in (11) are given as 1 1 𝑦𝑖 𝑀, 𝛿𝑖 ≀ 𝛿𝑖 𝑝 𝑀 π‘ž , + =1 (12) 𝑝 π‘ž Hence, from (12) we obtain the following condition (13) βˆ’ 𝛿𝑖 𝑝 𝑀 π‘ž ≀ 𝑦𝑖 𝑀, 𝛿𝑖 ≀ 𝛿𝑖 𝑝 𝑀 π‘ž , which leads to the following worst-case robust formulation for (11) max

+

1 βˆ’ 𝑦𝑖 𝑀, π‘₯𝑖𝑁 + 𝑏 + 𝛿𝑖 +

π‘š

max

2 2

2 +

≀ 𝜏1 ,

𝑦 𝑖=+1

(15)

1 βˆ’ 𝑦𝑖 𝑀, π‘₯𝑖𝑁 + 𝑏 + πœŒβˆ’1 𝑀

π‘ž +

(18)

≀ πœβˆ’1 .

𝑦 𝑖=βˆ’1

4 Handling imbalanced data using Robust SVM Classification

5 Data Analysis For checking the effectiveness of the robust analysis scheme developed in this study, we generate some imbalanced data and compare the performance of our formulation with the soft-margin SVM

For handling data imbalance, the training data sample can be partitioned into examples with positive and negative labels respectively. We are

ISBN: 978-1-61804-019-0

372

Recent Researches in Computer Science

classification of MATLAB. We generate a 400x2 matrix of random numbers, where each number π‘₯𝑖𝑗 belongs to the set π‘₯𝑖𝑗 = {0.5 + rand(0,1)}. Each data point in the analysis is the 2x1 vector π‘₯𝑖 = [π‘₯𝑖1 ; π‘₯𝑖2 ]. The Euclidean norm of each data point is calculated and if it is greater than one a label of 𝑦𝑖 = βˆ’1 is assigned to it, otherwise it belongs to the class 𝑦𝑖 = +1. This creates an imbalanced data set in which the positive labels belong to the minority class. Next we add an uncertainty 𝛿𝑖 to each data point such that the Euclidean norm of that uncertainty is less than 𝜌 = 0.01. For the resulting data, we perform analysis using the svmtrain function of MATLAB and the code for our robust SVM formulation also written in MATLAB. By controlling the separation criteria for the norm to be 𝑦𝑖 = βˆ’1: π‘₯𝑖 2 < 1 + 𝛼(rand(0,1)) , we can generate different imbalanced data sets. These are shown in Fig. 2 to 4.

It can be observed from the data generated that as we decrease the imbalance, we also have more data points misclassified on either side of the separation margin, which would decrease over prediction accuracy. Fixing πœ† = 0.1, we use 50% of the data for training, 20% for tuning and the remaining 30% for testing the classification results of the two schemes. In Table 1, 2 and 3 we show confusion matrices of the classifications of the different degrees of imbalance. It can be seen that for each case the minority class is predicted with greater accuracy using the robust SVM scheme. Hence, the results show that the methods developed can improve the performance in prediction of the minority class. Table 1. Comparison of % accuracy between MATLAB SVM and robust SVM in predicting class for very high imbalanced data

True Label Method MATLAB SVM Robust SVM

Predicted Label -1 +1 -1 +1

-1

+1

100% 0% 90.4% 9.6%

18.2% 81.2% 9.1% 90.9%

Table 2. Comparison of % accuracy between MATLAB SVM and robust SVM in predicting class for high imbalanced data

True Label Method

Figure 2. Very high imbalance in data with 10% data in minority class

MATLAB SVM Robust SVM

Predicted Label -1 +1 -1 +1

-1

+1

97.6% 3.4% 79.2% 20.8%

25.8% 74.2% 22.6% 77.4%

Table 3. Comparison of % accuracy between MATLAB SVM and robust SVM in predicting class for moderate imbalanced data

True Label Method Figure 3. High imbalance in data with 25% data in minority class

MATLAB SVM Robust SVM

Predicted Label -1 +1 -1 +1

-1

+1

91.6% 8.4% 75.5% 24.5%

35.7% 64.3% 30.1% 69.6%

6 Conclusion In this paper we have been able to develop a RO based scheme for robust SVM classification of imbalanced and noisy data. The methods have been developed for linear classification of data into two class. They work on the assumption that the

Figure 4. Moderate imbalance in data with 40% data in minority class

ISBN: 978-1-61804-019-0

373

Recent Researches in Computer Science

uncertainties are convex and bounded. This is a reasonable assumption as it applies to most practical data. Our method is shown to perform better than the classical SVM soft-margin classification in predicting the importance of the minority class. The method has to be developed further to improve its overall accuracy. Also, it can be extended to include any type of uncertainty measures that are convex. A non-linear robust classification procedure can also be developed using the same principles presented in this work. Further development of these methods would increase their application in classification and prediction of various types of datasets.

[4] J. Bi and T. Zhang, Support vector classification with input data uncertainty. In Advances in Neural Information Processing Systems, Vol. 17, 2004. [5] T.B. Trafalis and R.C. Gilbert, Maximum Margin Classifiers with Noisy Data: A Robust Optimization Approach, in Proceedings of the International Joint Conference on Neural Networks (IJCNN) 2005, Piscataway, NJ, USA, 2005, pp. 2826-2830. [6] T.B. Trafalis and R.C. Gilbert, Robust classification and regression using support vector machines, European Journal of Operational Research, Vol. 173, No. 3, 2006, pp. 893-909. [7] T.B. Trafalis and R.C. Gilbert, Robust Support Vector Machines for Classification and Computational Issues, Optimization Methods and Software, Vol. 22, No. 1, 2007, pp. 187198. [8] T.B. Trafalis and S.A. Alwazzi, Support vector regression with noisy data: A second order cone programming approach, International Journal of General Systems, Vol. 36, No. 2, 2007, pp. 237-250. [9] V.N. Vapnik, The nature of statistical learning theory, Springer-Verlag, 1995. [10] V.N. Vapnik, Statistical Learning Theory, Wiley, 1998.

References: [1] A. Ben-Tal, L. El Ghaoui and A. Nemirovski, Robust Optimization. Princeton, NJ: Princeton University Press, 2009. [2] C. Bhattacharyya, K.S. Pannagadatta, and A. J. Smola, A second order cone programming formulation for classifying missing data, In Advances in Neural Information Processing Systems, Vol. 17, 2004. [3] G.M. Weiss and F. Provost, Learning when training data are costly: the effect of class distribution on tree induction, Journal of Artificial Intelligence Research, Vol. 19, No. 1, 2003, pp. 315-354.

ISBN: 978-1-61804-019-0

374