Ann Oper Res (2009) 166: 339–353 DOI 10.1007/s10479-008-0412-4
Quadratic mixed integer programming and support vectors for deleting outliers in robust regression G. Zioutas · L. Pitsoulis · A. Avramidis
Published online: 12 August 2008 © Springer Science+Business Media, LLC 2008
Abstract We consider the problem of deleting bad influential observations (outliers) in linear regression models. The problem is formulated as a Quadratic Mixed Integer Programming (QMIP) problem, where penalty costs for discarding outliers are used into the objective function. The optimum solution defines a robust regression estimator called penalized trimmed squares (PTS). Due to the high computational complexity of the resulting QMIP problem, the proposed robust procedure is computationally suitable for small sample data. The computational performance and the effectiveness of the new procedure are improved significantly by using the idea of -Insensitive loss function from support vectors machine regression. Small errors are ignored, and the mathematical formula gains the sparseness property. The good performance of the -Insensitive PTS (IPTS) estimator allows identification of multiple outliers avoiding masking or swamping effects. The computational effectiveness and successful outlier detection of the proposed method is demonstrated via simulated experiments. Keywords Robust regression · Mixed integer programming · Penalty method · Least trimmed squares · Identifying outliers · Support vectors machine 1 Introduction In linear regression models data often contains outliers and bad influential observations, therefore it is important to identify these observations and either bound their influence or This research has been partially funded by the Greek Ministry of Education under the program Pythagoras II. G. Zioutas · L. Pitsoulis () · A. Avramidis Department of Mathematical and Physical Sciences, School of Engineering, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece e-mail:
[email protected] G. Zioutas e-mail:
[email protected] A. Avramidis e-mail:
[email protected]
340
Ann Oper Res (2009) 166: 339–353
eliminate them from the data set. If the data are contaminated with a single or few outliers the problem of identifying such observations is not difficult. However, in most cases data sets contain more outliers or groups of masking outliers and the problem of identifying such cases becomes more difficult, due to masking and swamping effects. An indirect approach to outlier identification is through a robust regression estimate. A number of robust methods has been developed where the regression estimators are less sensitive to outliers. If a robust estimate is relatively unaffected from outliers, then the size of the residuals from the robust fit would be used to identify the outliers. The class of generalized M estimators (so called GM estimators) introduced by Huber (1981) and Hampel (1978), and other robust estimators were intended to protect the estimator against outliers. In particular, they bound the influence of outliers by pulling them towards the regression line (or hyperplane), through an iteratively weighted procedure. Unfortunately these bounded influence estimators have small break down point, that is they can handle only few outliers. The low break down point of the GM estimators is a deficiency. Alternative robust regression techniques have been developed where instead of bounding the influence of outliers as in the case of GM estimators, they improve the least squares estimation by identifying and deleting the outliers in the data. Such an estimator which preserves high break down point is the so-called Least Trimmed Squares (LTS) estimator, introduced by Rousseeuw and Leroy (1987). In the LTS estimator it is assumed a priori that only k observations from the total n in the data set are not outliers, a parameter which is called coverage. Therefore in order to find the n − k outliers it finds the subset of k observations among the n with the minimum sum of squared residuals. The disadvantage of the LTS estimator is that the coverage k is typically unknown. A different approach called Penalized Trimmed Squares (PTS) has been proposed in Zioutas and Avramidis (2005), which does not require presetting the number n − k of outliers to be deleted from the data, but instead it presets penalties associated with each observation and reflects the cost of deleting it as an outlier. The new estimator PTS is defined by minimizing a convex objective function (loss function), which is the ordinary sum of square residuals and penalty costs for discarding bad observations. The PTS estimator is very sensitive to the penalties defined a priory. In fact, these penalties in the loss function regulate the robustness and the efficiency of the estimator. So far mathematical programming techniques have been used in regression models for various reasons. In particular Convex Optimization theory is used for solving restricted least squares problems (Arthanari and Dodge 1993). Quadratic programming models have been employed by Mangasarian and Musicant (2000) for Huber type regression estimation, resulting in faster computation of the estimator. In Zioutas and Avramidis (2005) the PTS estimator defined above, is formulated as a Quadratic Mixed Integer Programming (QMIP) problem, and the robust estimate is obtained by the optimum solution of the resulting integer problem. The main contributions of this paper are first the introduction of new penalties for the aforementioned PTS estimator, and secondly the improvement of the computation time for solving PTS by combining the PTS loss function and the idea of -insensitive loss function from support vectors machine (Vapnick 1998). The support vectors machine is a mathematical programming based technique, which has the advantage of reducing the computational difficulty by ignoring small residuals. Specifically residuals within an interval (−, ) are not included in the loss function, and those points outside the so-called -tube which are called support vectors, define the regression line. As a result of this a certain “sparsity” is achieved in the objective function of the optimization problem, which in turn makes it less difficult to solve. Moreover, the effectiveness of the robust regression method is improved, since noisy training data are ignored.
Ann Oper Res (2009) 166: 339–353
341
The organization of the paper is as follows. In Sect. 2, we briefly describe the Huber-type robust regression and its formulation as a convex quadratic program. Also quadratic mixed integer programming formulations for the high break down point regression estimators like LTS and PTS are presented. Section 3 presents the proposed robust regression estimator which utilizes both the QMIP formulation and the -insensitive support vectors machine, for computing faster the PTS estimator. The performance of the new -insensitive PTS estimator is tested on some benchmark instances and Monte-Carlo simulation experiments in Sect. 4. Finally, conclusions and future research are addressed in Sect. 4.1.
2 Review of mathematical programming in robust regression 2.1 Regression problem Regression analysis is one of the most widely used techniques in engineering, management, the sciences, and many other fields. Consider the linear regression model y = xT β + u,
(1)
where: • • • •
y is the response variable x = (1, x1 , . . . , xp )T is a vector of explanatory variables β = (β0 , β1 , . . . , βp )T is a vector parameter 2 u is a random error normally distributed u ∼ N (0, σ ∗ ).
We observe a sample (x1 , y1 ), . . . , (xn , yn ) and we wish to construct an estimator for the parameter β. Nearly all regression computer software relies on the method of ordinary least squares (OLS) for estimation of the parameters in the model. The OLS estimator is obtained by solving the following optimization problem min β
s.t.
n
u2i
(2)
i=1
ui = yi − xTi β, i = 1, . . . , n
(3)
It is well known that OLS can be solved efficiently by constructing the normal equations and solving the resulting full rank linear system of equations. Unfortunately, points that are far from the predicted line (outliers) are overemphasized, thereby making the OLS estimator very sensitive to outliers (Huber 1981; Hampel 1978). We wish to construct a robust estimator for the parameter β, in the sense that the influence of any outlying observation (xi , yi ) on the sample estimate is bounded, either by pulling the observation towards the regression line or deleting it from the data set. 2.2 Huber-type estimator as a convex quadratic program A group of robust methods has been developed where the estimators are less sensitive to outliers. Huber-type estimators are some the most popular robust techniques, where the influence of outlying points is reduced by applying the following robust procedure: • modify yi by pulling its observed value towards the fitted line
342
Ann Oper Res (2009) 166: 339–353
Fig. 1 Graphical interpretation of the variables ui , si , ci in the QMIP formulation, where ci = c for illustrative purposes
• apply least squares to the modified data For a known residual scale parameter σ a Huber-type (Huber 1981) or, a robust min-max estimator as it is otherwise called, could be defined by minimizing the sum of the less rapidly increasing functions of the residuals known as the loss function min β
where the u¯ i are defined as follows: u¯ i =
n (u¯ 2i + 2ci σ si )
(4)
i=1
ui ,
if |ui | ≤ ci σ
|ui | − si ,
otherwise
for ui = yi − xTi β as in (3). The quantity 2ci σ si can be considered as the penalty cost of pulling yi towards the fitted line, where the constant ci regulates the degree of robustness and depends inversely on the distance d(xi ) from the center of the X-space. The variable si indicates the degree of “shortening” the big residuals ui or equivalently the distance of pulling yi towards its fitted value. An important interpretation of the problem in (4) is that the estimator is defined by modifying the observations yi with the smallest possible amount of overall pulling distance, and this leads to min-max robust estimator (min-max theory, see Huber 1981). The Huber-type estimate of β is obtained by minimizing the loss function of (4), that switches at ui = ci σ from being quadratic to linear. Special methods have been developed for the associated minimization problem, which use the well known iterative re-weighted least squares (IRLS) procedure. This is an iterative procedure which starts with an initial estimate of β, reduces the big residuals by pulling their yi values, apply least squares on the modified data to obtain a new estimate of β and continues until convergence. Recently, Wright (2000), developed a min-max formulation to a mixed complementarity problem, which was subsequently reduced to a convex quadratic program. Mangasarian and Musicant (2000) propose convex quadratic programming formulations for the Huber-type estimator. A quadratic programming formulation is given by Camarinopoulos and Zioutas (2002) and is presented below. The explicit convex quadratic programming formulation in
Ann Oper Res (2009) 166: 339–353
343
the original primal space of the problem is rather simple and easily interpretable. The Hubertype estimator given in (4) can be shown to be equivalent to the following quadratic program. min β
s.t.
n (u¯ 2i + 2ci σ si ) i=1
xTi β − yi ≤ u¯ i + si , i = 1, . . . , n
(5)
−xTi β + yi ≤ u¯ i + si , i = 1, . . . , n u¯ i , si ≥ 0, i = 1, . . . , n The first two constraints in (5) describe the size of positive or negative residuals u¯ i , after shortening them by si . The convexity of (5) can be verified easily, thereby the KarushKuhn-Tucker (KKT) conditions will yield a unique global optimum solution (see Bazaraa et al. 1993). Mangasarian and Musicant (2000) propose alternative convex quadratic programming formulations for the Huber-type estimator, and provide a comparison study which demonstrates that the quadratic programming approach for computing Huber-type estimators is simpler and faster than the IRLS methods. 2.3 Least trimmed squares In the robust regression literature break down point is defined as the smallest data contamination (number of outliers) that can destroy the regression estimator. Unfortunately the previously mentioned Huber-type estimator is not high break down point, that is it cannot handle a group of outliers. A well known estimator that preserves high break down point is the Least Trimmed Squares estimator of Rousseeuw and Leroy (1987), that minimizes the sum of the k smallest squared residuals. The LTS estimator fits the best subset of k observations, dropping the rest (n − k) observations. It is defined by the following problem: min β
s.t.
k
u2(i)
(6)
i=1
u2(1) ≤ u2(2) ≤ · · · ≤ u2(n)
(7)
where k is called coverage and is chosen a priori to a value greater or equal than (n + p − 1)/2, in order to maximize the so called break down point. The estimator has high break down point but loses efficiency, since n − k observations have to be removed from the sample even if they are not outliers. The main disadvantages of the LTS estimator are mainly the fact that in real applications the coverage k is unknown, and also its exact computation is difficult due to the combinatorial nature of the problem. Given a coverage k, we have to find the best set from all possible combinations (n, k). An exact algorithm for LTS would have to partially enumerate through all possible k subsets from the n, therefore it would be suitable only for small data sets. Such an algorithm is presented in Agulló (2001) for data sets with size n < 30. Fast local search based algorithms have been developed for larger samples, which provide approximate solutions close to the optimum (see Rousseeuw and Van Driessen 2006; Hawkins 1994). Giloni and Padberg (2002) studied from a mathematical programming perspective the LTS estimator, in order to obtain an exact solution for this estimator.
344
Ann Oper Res (2009) 166: 339–353
2.3.1 Quadratic mixed integer programming for LTS The LTS estimator has been formulated as a quadratic mixed integer program by Zioutas and Avramidis (2005). In what follows we present a modified QMIP formulation for the LTS: min β,δ
s.t.
n
u¯ 2i
i=1
xTi β − yi ≤ u¯ i + δi M, i = 1, . . . , n −xTi β + yi ≤ u¯ i + δi M, i = 1, . . . , n
(8)
δT δ ≤ n − k u¯ i ≥ 0, i = 1, . . . , n δi ∈ {0, 1}, i = 1, . . . , n The new formulation in (8) has to make a binary decision, i.e. to choose which points should be deleted. This is modeled by the {0, 1} decision variables δi , where δi = 1 iff observation i is to be deleted from the data set, and δi = 0 otherwise. Due to the third constraint the total number of the deleted points is n − k. Moreover, M is an upper limit of the residual ui where in practice this should be a very large number, i.e. M = 106 . In the new formulation (8) there is no need to use the pulling distance si as a decision variable, as we did in formulation (5). The term δi M in the first two constraints, reduces the residual u¯ i to zero when δi = 1. The convexity of the problem (8) for constant δ could be established by noticing the fact that the Hessian of the objective function has nonnegative eigenvalues, therefore is positive semidefinite, which in turns implies a unique global optimum solution due to the KKT optimality conditions. Combining this with the statistical hypothesis that the observations are in general position (Rousseeuw and Leroy 1987), which assumes that no two distinct subsets of p observations have the same regression lines, we can conclude that (8) will have a unique solution. 2.4 Penalized trimmed squares A problem with the LTS estimator is that the size n − k of the outlier subset is rarely known. A new approach called Penalized Trimmed Squares, has been proposed by Zioutas and Avramidis (2005) that does not require presetting the number n − k of outliers to delete from the data set. The basic idea is to insert fixed penalty costs into the loss function for possible deletion. Thus, only observations that produce reduction larger than their penalty costs are deleted from the data set. The proposed PTS estimator minimizes the sum of the k square residuals in the clean data and the sum of the penalties for deleting the rest n − k observations, and is defined as follows: min β,k
s.t.
k i=1
u2(i) +
n
(cσ )2
(9)
i=k+1
u2(1) ≤ u2(2) ≤ · · · ≤ u2(n)
The term (cσ )2 can be interpreted as a penalty cost for deleting any observation where σ is a robust residual scale, and c is a cut-off parameter. The robust residual scale σ is taken
Ann Oper Res (2009) 166: 339–353
345
from the LTS estimator (Rousseeuw and Leroy 1987). Actually, it could be obtained by regularizing the mean squared error resulted from the QMIP problem (8). The constant c is well known from robust literature as cut-off parameter, and can take the values 2 ≤ c ≤ 3. The value cσ can be interpreted as a threshold for the allowable size of the residuals. We have found that 2.5σ or 3σ is a reasonable threshold under Gaussian conditions. The penalty cost is defined a priory and the estimator’s performance is very sensitive to this penalty which regulates the robustness and the efficiency of the estimator. It should be noted that the PTS estimate is equivalent to the OLS estimate of the clean data set with k observations. 2.4.1 Robust penalties for PTS The general principle of the PTS estimator is to delete an observation if the resulting reduction in the sum of squared residuals in (9) exceeds the penalty cost. As it is known from robust statistics literature (Atkison and Riani 2000), a transformation of residuals that has been useful for outlier diagnostics is the square of adjusted residual. A data point is deleted if its adjusted residual exceeds the penalty cost u2i > (cσ )2 1 − hi where hi (0 < hi < 1) measures the leverage of the i th observation, defined as hi = xTi (XT X)−1 xi where X is a n × p matrix whose rows are the xi . The leverage value hi can be distorted by the presence of collection of points, which individually have small leverage values but collectively forms a high leverage group. Peña and Yohai (1999) point out that the individual leverage hi of each point may be small, i.e hi 1, whereas the final residual ui may appear to be very close to 0, something which is called masking. Thus the deletion of a masked leverage point will result in a small reduction in the sum of squared residuals. In order to eliminate the distortion of the masking problem, appropriate penalties for high-leverage observations are proposed in this work. We propose a robust measure of the leverage hi of the i th observation, by using the Minimum Covariance Determinant (MCD) matrix Xk (Rousseeuw and Van Driessen 1999). In a MCD matrix Xk with coverage k, the n − k observations with the largest distances d(xi ) from the X-space have been removed from the sample and a robust measure of the leverage of each observation xi is given by
h∗i
−1 xTi XTk+1 Xk+1 xi , = −1 xi , xTi XTk Xk
for i = k + 1, . . . , n for i = 1, 2, . . . , k
The robust leverage values h∗i ofthe masked points will appear much larger than the original masked hi , therefore the value 1 − h∗i is small i.e. 1 − h∗i 1. In order to overcome the masking problem in PTS, we propose to reduce the penalty (cσ )2 for every observation i with weights wi = 1 − h∗i , which will give the following individual penalties for each observation i (ci σ )2 = ( 1 − h∗i · cσ )2 The above penalties will be small for masked high leverage observations, which will result in their deletion.
346
Ann Oper Res (2009) 166: 339–353
2.4.2 Quadratic mixed integer programming for PTS Given a sample data (xi , yi ), i = 1, . . . , n in order to obtain a PTS estimate, the problem (9) is formulated as a quadratic mixed integer program as follows: min
n
β,δ
s.t.
u¯ 2i + (ci σ )2 δi
i=1
xTi β − yi ≤ u¯ i + δi M, i = 1, . . . , n −xTi β
(10)
+ yi ≤ u¯ i + δi M, i = 1, . . . , n
u¯ i ≥ 0, i = 1, . . . , n δi ∈ {0, 1}, i = 1, . . . , n Robust penalties (ci σ )2 are added in the loss function for deleting the potential outliers. If δi = 1 the residual is reduced to zero in the first two constraints of (10) and the loss function is penalized with (ci σ )2 . Problem (10) has a unique optimum solution in a similar way as problem (8), under the assumption that the observations are in general position. Due to the combinatorial nature of (10), the PTS estimation is suitable for small number of observations (i.e. n < 50). In the following section we propose an -insensitive PTS procedure where the QMIP formulation gains sparseness and it becomes computationally reasonable even for larger data sets.
3 Support vectors for PTS 3.1 -Insensitive loss function Support Vector Regression (SVR) is a powerful technique for predictive data analysis with many applications to various areas of study (see Mangasarian and Musicant 2000). In Vapnick (1998), Smola and Scholkopf (1998) the support vector methodology has been introduced for linear and nonlinear regression, and is considered as an alternative robust procedure against noisy data. In this work, we concentrate on the case of linear regression function f (x), using Vapnick’s -insensitive loss function defined as follows: 0, if |y − f (x)| ≤ (11) u¯ = |y − f (x)| = |y − f (x)| − , otherwise, which is shown in Fig. 2. The value of in (11) is the accuracy that one requires for the approximation. The SVR approach when applied to robust regression, consists of mainly the following two goals: 1. To reduce the complexity of the regression model, by regularizing the regression plane with respect to the parameter β. 2. To ignore small errors, that is with u < , by fitting a tube of radius to the data set. The first goal deals with the problem of overfitting. In order to reduce the complexity of the regression model we prefer simpler regressions, i.e. “flatter” line for reducing overfitting. To achieve this goal the magnitude of the β vector should be reduced. In the second goal,
Ann Oper Res (2009) 166: 339–353
347
since the exact fitting is very sensitive to small errors due to noisy data, a tube of radius is fitted instead of a line. The points that lie outside an -tube will be called support vectors. The -insensitive regression allows an interval of size with uniform error u¯ i = 0. The estimation of the linear function f (x) = xT β is implemented by formulating the following problem min β
s.t.
n
u¯ i + Cβ2
i=1
xTi β − yi ≤ + u¯ i , i = 1, . . . , n −xTi β
(12)
+ yi ≤ + u¯ i , i = 1, . . . , n
u¯ i ≥ 0, i = 1, . . . , n The objective function in (12) contains two competing terms. The linear sum of u¯ i represents the sum of residuals of the observations from the -tube, and the quadratic term Cβ2 is a regularization term which helps to avoid overfitting. Since these two terms cannot typically both be simultaneously minimized, the parameter C indicates how much emphasis has to be placed on one goal versus the other. It is possible to generalize the results of SVM regression to any convex loss function. According to Vapnick (1998) and Smola and Scholkopf (1998), let us consider the loss function 0, if |y − f (x)| ≤ (13) L (y − f (x)) = L(y − f (x)) − , otherwise where L (.) is convex. One can then formulate the problem as min β
s.t.
n
L (u¯ i ) + Cβ2
i=1
xTi β − yi + b ≤ + u¯ i , i = 1, . . . , n −xTi β
(14)
+ yi − b ≤ + u¯ i , i = 1, . . . , n
u¯ i ≥ 0, i = 1, . . . , n For example L (u) ¯ = u¯ as in (12), or L (u) ¯ = u¯ 2 , or some other robust convex loss function from robust statistics. 3.2 -Insensitive penalized trimmed squares Although -insensitive regression is an alternative robust procedure to handle noisy data, it is not designed to remove groups of outliers. In this section we combine the computationally effective -insensitive regression with the robust PTS estimator. In order to combine the PTS estimator as defined in (9) with the SVM regression given in (14), we have to fit an -tube to the data such that every observation within the tube will have a uniform cost in the loss function, while every observation that lies outside the tube will have a cost which will be the minimum value between its residual and a given penalty. Since the residuals in (14) are defined from the boundary of the -tube, while the penalties as described in Sect. 2.4.1 are defined from the axis of the -tube, in order to be consistent
348
Ann Oper Res (2009) 166: 339–353
Fig. 2 -insensitive regression. Any point which is not a support vector will have a residual u¯ i = 0
we either have to modify the penalties or the residuals. We choose the later, and proposed the following modified SVR model given below. min β
s.t.
n
L (u¯ i ) + Cβ2
i=1
xTi β − yi + b ≤ u¯ i , i = 1, . . . , n −xTi β
(15)
+ yi − b ≤ u¯ i , i = 1, . . . , n
u¯ i ≥ , i = 1, . . . , n The residuals in (15) are clearly from the axis of the -tube, while due to the last constraint every observation within the tube will have a uniform cost, thereby maintaining the -insensitive property. The -Insensitive Penalized Trimmed Squares (IPTS) estimator, is thus defined by including the PTS loss function from (10) in (15) resulting in the following QMIP problem: min β,δ
s.t.
n
u¯ 2i + (ci σ )2 δi + β2
i=1
xTi β − yi ≤ u¯ i + δi M, i = 1, . . . , n
(16)
−xTi β + yi ≤ u¯ i + δi M, i = 1, . . . , n u¯ i ≥ , i = 1, . . . , n δi ∈ {0, 1}, i = 1, . . . , n The term Cβ2 in the loss function is simplified by C = 1, emphasizing the regularization of the constant term β0 , since the regressor coefficients βi , (i = 1, . . . , p) often appear with small values in regression problems. We note that due to the third constraint for any observation point inside the tube (Fig. 3), the objective function is penalized with 2 . A final note must be made regarding the sparseness of the problem in (16). All points inside the -tube do not contribute to the solution. We could remove locally any one of them and still have the same solution or -tube, since the solution is strictly defined by the exterior points which are called support vectors. The value of the parameter defines the tradeoff between accuracy and sparseness. Under Gaussian conditions, good efficiency could be ob-
Ann Oper Res (2009) 166: 339–353
349
Fig. 3 -insensitive PTS regression. Any point which is not a support vector will have a residual u¯ i =
tained for = 0.612σ (Smola and Scholkopf 1998). From our empirical results = σ is a good choice for faster computation and efficiency. Finally we should note that in order to maintain the general position assumption, thereby unique solutions in (16), the value of should be small enough such that any -tube will have at least p + 1 associated support 2 vectors. Under Gaussian conditions, u ∼ N (0, σ ∗ ), the size of should therefore satisfy the condition 2 1− ∗ n≥p+1 (17) σ where (·) is the cumulative probability of the standard Normal distribution. For instance, in a regression problem with n = 500 observations and p = 2 regressors should be σ = 3 = 0.997 and more specifically σ = −1 (0.997) or from statistical tables = 2.75σ . 1 − 1000 3.2.1 The radius for large data sets As the size of the data set increases, it would be reasonable to increase the sparseness of the optimization problem in (16) by increasing the value of in order to reduce the computational time. It should be noted that small changes in the parameter might increase the sparseness without affecting the correct identification of the outliers. However, as the radius of the tube increases we lose precision of the regression parameter estimate. Thus for large data sets, or equivalently for large values of , the IPTS estimate is obtained by in the following two stages • 1st stage. Perform IPTS by solving (16) to the data for a some which satisfies (17), and identify the outliers. Delete the outliers from the data. • 2nd stage. Perform OLS to the clean data. It should be mentioned at this point that for large data sets when we follow the above two stages, the IPTS estimate would coincide with the PTS estimate given that both procedures detect the same outliers. This arises from the fact that both procedures lead to the OLS estimate on the clean data.
350
Ann Oper Res (2009) 166: 339–353
Table 1 Small artificial data set, 50 points in R2 including no outliers (0% contamination). Results of 500 simulation runs
βˆ0
True function OLS
βˆ1
βˆ2
Mean sq.
CPU (s)
fitting
with
error
QMIP
0.000
1.200
−0.800
253.86
−0.646
1.184
−0.787
262.90
LTS
1.699
1.129
−0.828
278.30
12.54
PTS
−1.671
1.210
−0.753
263.41
8.80
IPTS=0.8σ
−1.260
1.207
−0.774
261.86
2.11
4 Monte Carlo results In this section we perform Monte Carlo experiments to evaluate the performance of the proposed robust procedures from a computational point of view. To carry out one simulation run, we proceed as follows. The observations yi were obtained following the regression model of second degree p = 2, yi = β0 + β1 x1i + β2 x2i + ui , where the coefficients values are β1 = 1.20, β2 = −0.80 and a zero constant term β0 = 0.0. We prefer the Gauss distribution for the error term u ∼ N (μ = 0, σ 2 = 162 ), while x1i , x2i are values drawn also from normal distributions N (μ = 20, σ 2 = 62 ) and N (μ = 30, σ 2 = 82 ) respectively. Some of the samples may contain outliers, which arise by drawing extra values from the uniform distribution U (a = 80, b = 220) and added to x1i , x2i or yi . We have also included a typical benchmark instance from Hawkins et al. (1984), which its frequently used to evaluate robust estimators in the literature. We report the results of the OLS estimator and the exact solution from solving the convex quadratic mixed integer programming formula for the three different types of robust estimators: • the LTS estimator for k = 0.6n, • the proposed PTS estimator for c = 3, • the proposed IPTS estimator for c = 3 and various values of . The performance criteria used for comparing the estimators are the following: • CPU time, • Mean square fitting error on the non-contaminated data, • β parameter estimate. The experimental runs were performed on a 1200 MHz Athlon AMD Processor, while the QMIP solver used is the FortMP/QMIP-Fortran Code provided by CARISMA, Brunel University, U.K., 2003. The simulation results are summarized in Tables 1–6. Tables 1–3 present the measures of the performance criteria for the four estimators in the absence of outliers and in the presence of average and high contamination respectively. Taking into account all performance criteria we see that the IPTS estimator outperforms the other estimators. In the case of clean data shown in Table 1, all estimators have a good performance with the IPTS having the advantage both in fitting error and computation time. For medium contamination data in Table 2, the LTS estimator has a reasonable performance
Ann Oper Res (2009) 166: 339–353
351
Table 2 Small artificial data set, 50 points in R2 including 12 outliers (24% contamination). Results of 500 simulation runs
βˆ0
βˆ1
βˆ2
Mean sq.
CPU (s)
fitting
with
error
QMIP
0.000
1.200
−0.800
253.86
OLS
60.545
0.565
−1.034
2122.79
LTS
−0.672
1.020
−0.703
317.15
17.05
PTS
0.031
1.134
−0.809
271.23
11.06
−1.117
1.210
−0.803
263.41
3.11
βˆ0
βˆ1
βˆ2
Mean sq.
CPU (s)
fitting
with
error
QMIP
True function
IPTS=0.8σ
Table 3 Small artificial data set, 50 points in R2 including 16 outliers (32% contamination). Results of 500 simulation runs
0.000
1.200
−0.800
253.86
64.372
0.015
−0.683
2547.86
LTS
0.631
0.985
−0.683
322.87
PTS
−1.006
1.159
−0.742
282.23
9.27
IPTS=0.8σ
−1.313
1.166
−0.750
274.16
2.91
True function OLS
Table 4 Hawkins et al. (1984) artificial data, 75 points in R3 including 10 outliers (13% contamination)
svs βˆ0
βˆ1
βˆ2
βˆ3
13.85
Mean sq. CPU (s) fitting
with
error
QMIP
LTS
−1.101 0.196 0.210
PTS
−0.050 0.062 0.012 −0.107 0.30
102.00
0.088 0.009 0.014 −0.105 0.30
3.93
IPTS=0.8σ 28
0.160 0.47
IPTS=1.1σ 11 −0.050 0.062 0.012 −0.107 0.30
0.54
4 −0.050 0.062 0.012 −0.107 0.30
0.10
IPTS=1.3σ
but the PTS and IPTS estimators improve significantly the parameter estimates and the fitting error. The IPTS estimator is the best among all others concerning all criteria. It is of great interest to see the results presented in Table 3, where the data is heavily contaminated with 16 outliers. Again, the IPTS is the best among all other robust estimators. Also the computation time has improved significantly. For large data sets we could increase the radius , in order to delete the outliers from the original data set. From Tables 4–6 it is obvious that in large data sets the IPTS computation time decreases significantly as we increase the tube radius. We note that in Tables 4–6 we also mention the number of resulting support vectors (svs) for each choice of radius . In the case of high contamination the IPTS has shown remarkable improvement in both robustness and efficiency. We can empirically conclude from the results that the -insensitive PTS procedure provides a significant improvement with respect to the PTS, in terms of robustness and computational time.
352
Ann Oper Res (2009) 166: 339–353
Table 5 Medium artificial data set, 100 points in R2 , no outliers (0% contamination). Results of 500 simulation runs
svs
βˆ0
βˆ1
βˆ2
Mean sq.
CPU (s)
fitting
with
error
QMIP
0.000
1.200 −0.800
253.86
OLS
−1.147
1.179 −0.741
261.75
LTS
−1.148
1.135 −0.722
269.25
95.00
PTS
−4.301
1.180 −0.663
261.50
51.17
True function
IPTS=1.0σ
24
−4.301
1.180 −0.663
261.50
4.90
IPTS=1.5σ
9
−4.301
1.180 −0.663
261.50
0.21
IPTS=2.0σ
3
−4.301
1.180 −0.663
261.50
0.08
βˆ0
βˆ1
Mean sq.
CPU (s)
fitting
with
error
QMIP
Table 6 Large artificial data set, 500 points in R2 including 120 outliers (24% contamination). Results of 500 simulation runs
svs
βˆ2
0.000
1.200 −0.800
253.86
60.649
0.606 −1.068
2079.16
LTS
0.774
1.154 −0.766
275.64
1380.00
PTS
−1.840
1.171 −0.754
262.15
1029.00 755.00
True function OLS
IPTS=1.5σ
35
−1.840
1.171 −0.754
262.15
IPTS=2.0σ
10
−1.840
1.171 −0.754
262.15
24.17
IPTS=2.5σ
4
−1.840
1.171 −0.754
262.15
11.20
4.1 Conclusions and future research A new methodology for deleting outliers in robust regression is presented which reduces the problem to a quadratic mixed integer program. The approach is shown to perform well when compared to other algorithms in robust regression literature. Further research is needed to improve the computational time for large size sample data (n > 500), possibly by solving the dual problem of (16). Moreover, modifications of (16) should be investigated that will provide the optimum value of .
References Agulló, J. (2001). New algorithms for computing the least trimmed squares regression estimator. Computational Statistics and Data Analysis, 36, 425–439. Arthanari, T. S., & Dodge, Y. (1993). Mathematical programming in statistics. New York: Wiley. Atkison, A., & Riani, M. (2000). Robust diagnostic regression analysis. Berlin: Wiley. Bazaraa, M., Shevali, H., & Shelty, C. (1993). Nonlinear programming: Theory and algorithms. New York: Wiley. Camarinopoulos, L., & Zioutas, G. (2002). Formulating robust regression estimation as an optimum allocation problem. Journal of Statistical Computation and Simulation, 72(9), 687–705. Giloni, A., & Padberg, M. (2002). Least trimmed squares regression, least median squares regression, and mathematical programming. Mathematical and Computer Modelling, 35, 1043–1060. Hampel, F. R. (1978). Optimally bounding the gross error sensitivity and influence of position in factor space. In Proceedings of the ASA statistical computing section (pp. 59–64). ASA.
Ann Oper Res (2009) 166: 339–353
353
Hawkins, D. M. (1994). The feasible solution algorithm for least trimmed squares regression. Data Mining and Knowledge Discovery, 17, 185–196. Hawkins, D. M., Bradu, D., & Kass, G. V. (1984). Location of several outliers in multiple regression data using elemental sets. Technometrics, 26, 197–208. Huber, P. J. (1981). Robust statistics. New York: Wiley. Mangasarian, O. L., & Musicant, D. R. (2000). Robust linear and support vector regression. IEEE Transactions on Patern Analysis and Machine Intelligence, 22, 950–955. Peña, D., & Yohai, V. J. (1999). A fast procedure for outlier diagnostics in large regression problems. Journal of the American Statistical Association, 94, 434–445. Rousseeuw, P. J., & Leroy, A. M. (1987). Robust regression and outlier detection. New York: Wiley. Rousseeuw, P. J., & Van Driessen, K. (1999). A fast algorithm for the minimum covariance determinant estimator. Technometrics, 41, 212–223. Rousseeuw, P. J., & Van Driessen, K. (2006). Computing LTS regression for large data sets. Data Mining and Knowledge Discovery, 12, 29–45. Smola, A. J., & Scholkopf, B. (1998). On kernel-based method for pattern recognition, regression, approximation and operator inversion. Algorithmica, 22, 211–231. Vapnick, V. N. (1998). Statistical learning theory. New York: Wiley. Wright, S. J. (2000). On reduced convex qp formulations of monotone lcp problems (Technical Report ANL/MCS-P808-0400). Argonne National Laboratory. Zioutas, G., & Avramidis, A. (2005). Deleting outliers in robust regression with mixed integer programming. Acta Mathematicae Applicatae Sinica, 21, 323–334.