A MIXED INTEGER PROGRAMMING MODEL FOR ...

July 4, 2011 9:46 WSPC/S0219-6220 173-IJITDM S0219622011004476

International Journal of Information Technology & Decision Making Vol. 10, No. 4 (2011) 589–612 c World Scientific Publishing Company DOI: 10.1142/S0219622011004476

A MIXED INTEGER PROGRAMMING MODEL FOR MULTIPLE-CLASS DISCRIMINANT ANALYSIS

MINGHE SUN Department of Management Science and Statistics College of Business, The University of Texas at San Antonio San Antonio, TX 78249-0632 [email protected] http://business.utsa.edu/faculty/msun

A mixed integer programming model is proposed for multiple-class discriminant and classification analysis. When multiple discriminant functions, one for each class, are constructed with the mixed integer programming model, the number of misclassified observations in the sample is minimized. This model is an extension of the linear programming models for multiple-class discriminant analysis but may be considered as a generalization of mixed integer programming formulations for two-class classification analysis. Properties of the model are studied. The model is immune from any difficulties of many mathematical programming formulations for two-class classification analysis, such as nonexistence of optimal solutions, improper solutions, and instability under linear data transformation. In addition, meaningful discriminant functions can be generated under conditions where other techniques fail. Examples are provided. Results on publically accessible datasets show that this model is very effective in generating powerful discriminant functions. Keywords: Discriminant analysis; classification; mixed integer programming; optimization; nonparametric procedures.

1. Introduction Although extensive studies have been undertaken in mathematical programming (MP) approaches for discriminant and classification analysis, the focus has been on two-class classification techniques. Although generalizations to multiple-class techniques have been attempted, simple and powerful MP models for multiple-class discrimination and classification deserve further investigation. A mixed integer programming (MIP) model is proposed in this study as a nonparametric procedure for multiple-class discriminant and classification analysis. Although it is an extension of the linear programming (LP) model of Sun,1 this MIP model may be considered as a generalization of the MIP models for two-class classification analysis. With the addition of this MIP model, MP approaches may become more attractive and better alternatives to other discriminant analysis techniques.

589


590

M. Sun

Sun1 proposed a simple but powerful LP model for this purpose. This LP model minimizes the sum of deviations of misclassified observations in the sample, or the L1 -norm. The MIP model proposed in this study minimizes the number of misclassified observations in the sample, or the L0 -norm. The LP model of Sun1 has very good properties, is immune to the pathologies of many other earlier MP models, and should work well under most situations. However, an MIP model is appealing because it directly minimizes the number of misclassified observations in the sample and some researchers have reported that MIP models for two-class classification outperformed other models under certain conditions. In addition, an MIP model should be always the choice if the purpose of the application is discrimination rather than classification. MIP models can also be used to fathom the maximum classification rates of a dataset because their classification results provide benchmarks for other models of similar structure.2 Some properties of the MIP model are studied. Like the LP model,1 the MIP model is immune to the difficulties caused by pathologies of earlier MP models for two-class classification analysis. One shortcoming of the MIP model is the demand for extensive computation time. Discriminant analysis and classification analysis are closely related and are major techniques for data processing. They study the differences between two or more classes of objects that are described by measurements of different characteristics or attributes, called prediction variables. Using the observations in a sample with known values of attributes or prediction variables and known class memberships, called a training sample, mathematical discriminant functions are constructed. The attribute values of an observation can be evaluated by these discriminant functions to obtain discriminant scores and the observation is assigned to a class with the highest discriminant score. In discriminant analysis, discriminant functions are constructed in such a way that the observations in the sample are optimally separated based on a certain criterion. In classification analysis, the generalization capability of the discriminant functions is a consideration and new observations are expected to be classified as accurately as possible. Discriminant analysis and classification analysis have been used by all fields of studies. For many decades, statistical techniques, such as Fisher’s linear discriminant function (LDF), Smith’s quadratic discriminant function (QDF), and logistics regression, have been standard tools for discriminant and classification analysis. More recently, other techniques such as MP approaches including support vector machines (SVM), neural networks, and classification trees have become alternative tools. A spectrum of techniques is needed because no single technique always outperforms others under all situations.3 Discriminant and classification analysis techniques will play more important roles in data analysis as huge amount of data collected through information technology need to be analyzed. 2. MP Approaches for Discriminant Analysis The original LP models for two-class classification4,5 inspired a series of studies. A considerable number of studies on MP approaches for discriminant and classification


MIP Model for Multiple-Class Discriminant Analysis

591

analysis have been reported in the literature. Some of these works reported limitations and pathologies of some of the earlier MP models, some provided diagnoses, and others offered alternative or improved MP models as remedies.6–14 The different MP models introduced in the literature include LP, MIP, goal programming, nonlinear programming, multiple objective programming, and quadratic programming (i.e. SVM) models.15–17 Three difficulties, unbounded objective function, degenerate or improper solutions, and solution instability under linear data transformation, due to pathologies of some earlier MP formulations caused concerns.7,10–14 The objective function of an MP formulation is unbounded if its value can be made arbitrarily large when maximized or arbitrarily small when minimized resulting in no meaningful solutions. A solution is degenerate or trivial if all the estimated coefficients in the classification function or in the discriminant functions are 0. A solution is improper if all the resulting discriminant functions for different classes are the same and therefore the observations cannot be definitely assigned to any class. A solution is proper if it is not improper. Solution instability occurs when an MP formulation produces different sets of discriminant functions with different classification results when the data are linearly transformed. As will be shown later, the MIP model proposed in this study is immune from these difficulties. The diagnoses include the conditions under which these difficulties may occur. The remedies and improvements include normalizations of the coefficients in classification functions to avoid these difficulties. The concept of a trivial solution in the context of multiple-class discriminant and classification analysis is different from that in two-class classification. In two-class classification analysis, a classification function representing a hyperplane separating the two classes is constructed. In multiple-class discriminant and classification analysis, multiple discriminant functions, one for each class, are constructed and the hyperplane separating any pair of classes is where the two discriminant functions have the same discriminant scores. Therefore, the equation representing the hyperplane is when the difference of the two discriminant functions is 0. As a result, when all discriminant functions have the same coefficients, i.e. an improper solution, the equations representing the boundaries of any pair of classes all have coefficients of 0’s, i.e. a degenerate solution. In multiple-class discriminant analysis, improper solutions may occur and the trivial solution becomes a special case of improper solutions. Unlike statistical methods, MP approaches, as nonparametric methods, do not make strict assumptions about the data analyzed. Many studies comparing the performances between the more traditional statistical methods, such as Fisher’s LDF and Smith’s QDF, and MP approaches have been reported.18–20 Many computational experiments have been undertaken.2,17,18,20–24 Some of these studies also compared the performances of different MP formulations. Some of them used real data and others used simulated data. In general, the conclusion is that no single technique performs the best under all conditions. MP approaches outperform


592

M. Sun

statistical approaches when the assumptions underlying the statistical approaches are seriously violated.17,25 Being able to perform well on a variety of types of data is an advantage of MP approaches. Another advantage of MP approaches over the traditional statistical techniques is that the fitted model is less influenced by outlier observations. Nath and Jones,26 Glen,27,28 and Sun and Xiong29 have addressed the problem of variable selection in MP approaches for discriminant analysis. Although most of the research in MP approaches is around two-class classification, attempts have been made to extend the approaches to multiple-class discriminant and classification analysis.1,30–37 Without models for multiple-class problems, the MP approaches are handicapped and may not be able to compete with other techniques. Researchers and practitioners will be more willing to accept MP approaches as nonparametric procedures when simple but powerful multipleclass MP models are available. Given the difficulty of multiple-class classification problems, some researchers focused on the three-class classification problems.38–40 Conceiving that an extension from two-class classification to multiple-class classification was straightforward, Freed and Glover31 proposed a decomposition of a p-class discriminant problem into p(p − 1)/2 two-class discriminant problems. Each problem represents a pair of classes and each is formulated as a LP model and is solved separately to determine a classification function representing the hyperplane separating the two classes. This approach is later on called the one-against-one approach in the literature. The drawback of this approach is that the classification functions may be suboptimal because they are not estimated in an aggregate form. As a result, the classification of observations in some segments of the variable space is not clear.16,36,39 Furthermore, this approach is tedious because too many subproblems are formed, too many LP models are formulated and solved, and too many classification functions are estimated. Another extension is to decompose a p-class discriminant problem into p twoclass problems. In the kth two-class problem, the observations in class k are treated as one class and all the rest are treated as the other. This approach is later on called the one-against-all approach. This approach has the same drawbacks as those of the one-against-one approach except that p, instead of p(p − 1)/2, classification functions are estimated. The advantage of this approach is that the p two-class problems are computationally easier to solve than a multiple-class model. Gehrlein32 proposed two MIP formulations. These MIP models set up the foundations of most of the later studies in this area. One MIP model uses a single discriminant function with class-specific cutoff discriminant scores and is referred to as the single function model. A new observation is evaluated with the classification function to obtain a discriminant score and is assigned to a specific class if its discriminant score falls into the interval of this class. This model implicitly implies that the hyperplanes separating the classes are all in parallel and, therefore, cut the variable space into layers. However, the observations in different classes rarely fall into layers in the variable space for practical problems. Hence, this model has pathological flaws as conceived by many researchers.16 Pavur and Loucopoulos36



593

modified the original single function model from the minimization of the number of misclassified cases to the minimization of the sum of deviations of misclassified ¨ cases. This modified model, as a LP model, is much easier to solve. Ostermark and 41 H¨ oglund pointed out that the ordering of the classes is important for accurate classification because the model requires that class 1 scores be lower than class 2 scores that in turn be lower than class 3 scores, and so on. Therefore, they suggested the investigation of alternative sequencing of the classes. Choo and Wedley30 used multiple criterion decision-making techniques to determine the coefficients in a single classification function for multiple-class discriminant problems. Kou et al.37 developed a one-function model for multiple-class classification using multiple objective programming. Kernels are introduced into the classification function to deal with nonlinearity. The objectives are to maximize the distances between the classes and to minimize total overlapping among the classes. Single function models may perform well on certain datasets or under certain conditions. Although it is possible to design computational experiments using datasets with each class falling into layers in the variable space to obtain good classification results, single function models are pathologically flawed and are not general enough to be applied to all types of problems in practice. Models using multiple discriminant functions, one for each class, are theoretically more reasonable, are more general and flexible and, therefore, are more preferred than single function models.16 Gehrlein32 also proposed a multiple discriminant function MIP model in which each class has a discriminant function. All the discriminant functions are estimated simultaneously through the solution of the MIP model. A new observation is evaluated by each discriminant function and is assigned to the class with the highest discriminant score. As a drawback, this MIP model requires a considerable number of binary variables and is computationally infeasible for problems with large datasets.16 Wilson42 introduced a multiple function MIP model as an alternative to the one by Gehrlein.32 Each observation is represented by 2p constraints and is associated with p binary variables for a problem with p classes. This model is even more complicated and difficult to solve. The model proposed by Bennett and Mangasarian33 is almost identical in structure to the multiple function model proposed by Gehrlein.32 However, instead of minimizing the L0 -norm, this model minimizes a weighted L1 -norm making the MP model computationally much easier to solve. The weight assigned to each term in the objective function, representing the deviation if an observation is misclassified, is the reciprocal of the sample size of the class that the observation belongs to. Gochet et al.34 proposed a LP model for multiple-class discriminant analysis. Not like in the study of Bennett and Mangasarian,33 the terms in the objective function are not weighted. In addition to the same set of constraints as in the study of Bennett and Mangasarian,33 a normalization constraint is used to restrict the sum of goodness of fit to be larger than the sum of badness of fit. Because no integer variables are involved in the formulation, the model is computationally very


594

M. Sun

¨ efficient to solve. Ostermark and H¨ oglund41 extended the model of Gochet et al.34 to include quadratic terms of the variables in the discriminant functions. All MP models for multiple-class problems using multiple discriminant functions mentioned above use the difference of two discriminant functions as a constraint and use p − 1 error terms or p − 1 binary variables for each observation in the model for a problem with p classes. The LP model of Sun1 introduced a cutoff point in the discriminant scores for each observation and uses a single error term for each observation. Therefore, this LP model is simpler and more reasonable. Rather than constructing a hyperplane to separate the two classes, Bertsimas and Shioda43 divide the variable space into different polyhedral regions. All data points in a polyhedral region are assigned to one of the two classes. This approach works in a way similar to a classification tree and performs well for problems with observations of each class clustered in different regions rather than in a single region ¨ in the variable space. The approach used by Uney and T¨ urkay44 is similar to that 43 used by Bertsimas and Shioda but is for multiple-class classification. Instead of constructing discriminant functions to separate the classes, this approach constructs hyperboxes in the variable space. All points in the whole hyperbox are assigned to the same class. The major attractive feature of MIP models is the direct minimization of the number of misclassified cases in the sample. However, MIP models are greedier than other methods. Therefore, it is possible for the MIP approach to achieve higher in-sample classification rate but lower validation classification rate than other methods. If the purpose of the study is discrimination of the observations in the sample rather than classification of new observations, MIP models perform better than other models of similar structure. 3. Model Development Assume there are p ≥ 2 known classes in the dataset and K = {1, . . . , p} is used to denote the index set of them. A sample of m observations or cases is available and the class membership of each observation is unique and known. Among the m observations, mk are in class k, for each k ∈ K, such that m = k∈K mk . The index set of all observations in the sample is denoted by I and that in class k is denoted by Ik for each k ∈ K. The observations represent the objects to be classified. Assume n attributes or prediction variables are used to describe the observations. The index set of all prediction variables is denoted by J = {1, . . . , n}. The value of prediction variable j ∈ J of observation i ∈ I is denoted by xij . With xi0 ≡ 1 for all i ∈ I, xi = (xi0 , xi1 , . . . , xin ) represents the realized values of all prediction variables of observation i. When convenient, x = (x0 , x1 , . . . , xn ) is used to denote the prediction variables of a generic observation. Some of the prediction variables are real and others may be nominal or categorical. When nonlinear discriminant functions are constructed, some of the xj s measure the attributes of the observations and, therefore, are independent variables, and others may be functions,



595

such as squares or cross products, of other independent variables. When nonlinear discriminant functions are constructed, the MIP model is still linear because the functions are linear in the parameters although nonlinear in the variables. It is assumed that there is a vector of unknown parameters βk ∈ n+1 with β k = (βk0 , βk1 , . . . , βkn ) for each k ∈ K such that βk x is the discriminant function for class k. The elements of β k need to be estimated using the data in the sample in such a way that as many as possible observations in the sample are correctly classified. The sample estimate of βk is denoted by bk = (bk0 , bk1 , . . . , bkn ). The estimated discriminant functions bk x for k ∈ K are then used to evaluate and classify observations. For any observation represented by x ∈ n+1 , discriminant scores bk x are computed. The observation is assigned to a class k with the highest discriminant score, i.e. bk x = max{bk x | k ∈ K}. Let M > 0 be a constant that is sufficiently large. For each observation i ∈ Ik , there always exist ci and di such that the following p inequalities hold: bk xi + M di > ci , bk xi ≤ ci ,

for k = k,

(1) (2)

where ci is a cutoff point and di is a binary variable associated with observation i. In inequalities (1) and (2), ci = max{bk xi | k = k} is sufficient for any i ∈ Ik . If these inequalities hold with di = 0, observation i ∈ Ik is correctly classified into class k. Otherwise, if di = 1 for these inequalities to hold, observation i ∈ Ik is then incorrectly classified into a different class other than k. Similar to the LP formulation by Sun,1 the following MIP formulation is proposed to estimate the parameters, i.e. the elements of βk for all k ∈ K, for multipleclass discriminant analysis: di (3) min i∈I

s.t. bk xi − ci + M di ≥ ε for i ∈ Ik and k ∈ K

(4)

bk xi − ci ≤ 0

for i ∈ / Ik and k ∈ K

(5)

bk unrestricted

for k ∈ K

(6)

ci unrestricted

for i ∈ I

(7)

di = 0 or 1

for i ∈ I.

(8)

Totally, m binary variables are in this MIP model, one for each observation in the sample. In (4), ε > 0 is a constant that is sufficiently small such that M ≥ ε. In the implementation, ε = 1 is sufficient. Intuitively, ε creates a classification gap for each i ∈ I. In addition, ε plays a normalization role to eliminate improper solutions. According to (4) and (5), (bk xi + M di ) − max{bk xi | k = k} ≥ ε must hold for each i ∈ Ik . If di = 0 in the final solution, then bk xi − max{bk xi | k = k} ≥ ε and observation i ∈ Ik is correctly classified into class k with a clear margin greater than or equal to ε. Otherwise, if di = 1 in the final solution, bk xi −max{bk xi | k = k} ≥ ε does not hold. In this case, observation i ∈ Ik cannot be correctly classified


596

M. Sun

into class k. Hence, the value of di indicates if the observation i ∈ Ik is correctly classified. The objective function (3) represents the total number of misclassified observations, or the L0 -norm. Because the objective function is minimized, as many as possible di s are set to 0 in the optimal solution of the MIP model. Consequently, the MIP model is not affected by outlier observations because each misclassified observation is counted equally in the objective function regardless of the magnitude of its deviation. For each observation i ∈ I, one constraint in (4) and p − 1 constraints in (5) are in the model. Altogether, there are m constraints in (4), one for each observation i ∈ I, and m(p−1) constraints in (5), p−1 for each observation i ∈ I. In addition to the binary variables di , the estimated parameters, i.e. the elements of bk , and the cutoff points ci are the continuous variables of the model. Altogether the model has mp constraints, m binary variables, and (n + 1)p + m continuous variables. Hence, the model is an MIP model. The MIP formulation in (3)–(8) is different from that in the study by Gehrlein.32 In the MIP model by Gehrlein,32 each observation i ∈ Ik is associated with p − 1 constraints, one for each k = k. For an observation i ∈ Ik , each constraint is of the form bk xi − bk xi + M dik ≥ ε for i ∈ Ik , k ∈ K, k ∈ K, and k = k. This constraint is the difference between the constraint in (4) for the observation and a constraint in (5) for the same observation but for k = k. As a result, p − 1 binary variables of the form dik are used for each observation i ∈ Ik , one for each k = k, in the model by Gehrlein.32 Totally, m(p− 1) binary variables are in the MIP model of Gehrlein32 but only m are in the model in (3)–(8). Each xij appears 2(p − 1) times in the MIP model of Gehrlein32 but appears only p times in the MIP model in (3)–(8). Using the property of order preserving under shifting and positive rescaling stated in the next section, each xij appears in the MIP model only p − 1 times. Therefore, the MIP model in (3)–(8) is a much sparser model than previous MIP formulations. When p = 2, each observation i ∈ I is associated with one constraint in (4) and one constraint in (5). The inequality b2 xi − b1 xi − M di ≤ −ε is obtained after subtracting the constraint in (4) from the constraint in (5) for each i ∈ I1 . Similarly, b2 xi − b1 xi + M di ≥ ε is obtained after subtracting the constraint in (5) from the constraint in (4) for each i ∈ I2 . With b = b2 − b1 , the MIP model in (3)–(8) for p = 2 can be written as min

di

(9)

i∈I

s.t. bxi − M di ≤ −ε bxi + M di ≥ ε

for i ∈ I1

(10)

for i ∈ I2

(11)

b unrestricted di = 0 or 1

(12) for i ∈ I.

(13)



597

The MIP model in (9)–(13) is similar to that in the study by Stam and Joachimsthaler.17 The difference is that ε = 0 is used by Stam and Joachimsthaler.17 In this sense, the MIP model in (3)–(8) may be considered as a generalization of the one proposed by Stam and Joachimsthaler17 for two-class classification. In a similar manner, other formulations for two-class problems can be generalized to multiple-class problems. More formulations for two-class problems are summarized by Erenguc and Koehler,15 Joachimsthaler and Stam,45 and Stam.16 When p = 2, the model in (9)–(13) constructs an equation bx = 0 that represents a hyperplane separating the two classes in the prediction variable space with one class on each side of the hyperplane. When p > 2, the MIP model in (3)– (8) constructs p discriminant functions. The equation representing the hyperplane separating two classes is obtained by equating the pair of discriminant functions representing the two classes. MIP formulations for discriminant problems are NP-complete46 and, therefore, are computationally much more demanding than LP formulations, especially when the classes have substantial overlaps. This limitation may be overcome by developing heuristic solution methods. The MIP model may have many alternative optimal solutions. When the classes in the sample are not completely separable, the discriminant functions constructed with the MIP model always yield the highest classification rate of the observations in the sample among all models of similar structure. Therefore, the classification results of MIP models can be used as benchmarks to measure the classification quality of other approaches.2 4. Some Properties of the MIP Model The MIP model in (3)–(8) possesses some properties that the LP model1 has and some others that the LP model does not have. These properties are around the three difficulties encountered in earlier MP models for two-class discriminant and classification analysis. In this section, some properties that the MIP model in (3)– (8) has but the LP model does not have are presented first and the properties that both the MIP and the LP models have are then discussed briefly. The first issue to consider is if the MIP model in (3)–(8) has an optimal solution or even has any feasible solution. Theorem 1 that follows guarantees that an optimal solution can always be found. Theorem 1. An optimal solution to the MIP model in (3)–(8) always exists if M > 0 is sufficiently large and ε > 0 is sufficiently small such that M ≥ ε. Proof. For a minimization problem to have an optimal solution, the objective function must be bounded from below and a feasible solution must exist. From (8), the objective function (3) is bounded from below by 0. Because of M ≥ ε, the trivial solution with bk = 0 for all k ∈ K, ci = 0 for all i ∈ I, and di = 1 for all i ∈ I is a feasible solution.


598

M. Sun

The objective function is also bounded from above by m. Therefore, a solution with di = 1 for all i ∈ I is the possible worst feasible solution. Only an improper solution has such an objective function value. With an improper solution, ¯ for all k ∈ K and the discriminant functions cannot definitely classify bk = b any observation into any class. Although Theorem 1 ensures an optimal solution always exists, the solution is useless if it is an improper solution. The second issue of concern is whether the optimal solution is an improper solution. Theorem 2 that follows ensures that the MIP model in (3)–(8) always has proper solutions because proper solutions always have lower values of the objective function than improper solutions. Theorem 2. If M > 0 is sufficiently large and ε > 0 is sufficiently small such that M ≥ ε, the MIP model in (3)–(8) always has proper solutions. ¯ for all k ∈ K is optimal. The Proof. Assume an improper solution with bk = b ¯ i for all i ∈ I. Hence, with M ≥ ε, constraints in (5) are satisfied with ci = bx the constraints in (4) are satisfied only with di = 1 for all i ∈ I. As a result, the objective function (3) has a value i∈I di = m. If all the observations are classified into a single class k ∈ K such that mk = max{mk | k ∈ K}, the objective function (3) has a value i∈I di = m − mk < m. For any observation i to be classified into a class k, bk xi > bk xi , i.e. bk = bk , must hold for all k = k. This contradicts the ¯ for all k ∈ K is optimal. assumption that the improper solution with bk = b Because the trivial solution is a special case of improper solutions, with Theorem 2, the MIP model in (3)–(8) never generates the trivial solution. Many LP models for two-class classification suffer from trivial solutions.7,10–14 For multiple-class discriminant and classification analysis, the LP approaches1,33,34 generate improper ¯ for all k ∈ K under the special condition that all classes in the solutions bk = b sample have the same class centroid and equal sample size. The centroid of a class k ∈ K is defined to be x ¯k = i∈Ik xi . Hence, these LP models generate improper ¯2 = · · · = x ¯p and m1 = m2 = · · · = mp . However, even under solutions when x ¯1 = x this special condition, the MIP model in (3)–(8) always generates a proper solution with a set of discriminant functions that can classify as many observations in the sample as possible into their respective classes. The proof of Theorem 2 assumes that the discriminant functions classify all the observations in the sample into a single class k with mk = max{mk | k ∈ K}. In fact, better solutions can always be found if the observations in different classes are distinct. Assume all observations are classified into class k. If there is an observation i ∈ Ik with k = k that appears only in class k , reassigning this observation to class k reduces the value of the objective function (3) by 1. Under the extreme condition where all the classes have exactly the same observations, the model still generates a proper solution but the discriminant functions can only classify all the observations into a single class. Under this extreme condition, the observations in the sample should not be differentiated anyways.



599

With discriminant functions constructed with LP models, some observations in the sample fall into the classification gap.40,47 For such an observation i ∈ Ik , 0 < bk xi − ci < ε holds. Such observations are correctly classified but without a clear margin ε. Theorem 3 that follows shows that this situation does not exist with discriminant functions constructed with the MIP model in (3)–(8). Hence, an observation may be either incorrectly classified or correctly classified with a clear margin ε. Therefore, for discrimination purpose, the MIP solution is at least as good as the LP solution in terms of the number of misclassified observations in the sample. Therefore, the MIP model may be used to improve the solution obtained with the LP model and the LP solution may be used as the initial incumbent solution of the MIP model. Theorem 3. If M > 0 is sufficiently large and ε > 0 is sufficiently small such that M ≥ ε, the condition in (14) does not hold for any i ∈ Ik and k ∈ K in an optimal solution: 0 < bk xi − ci < ε.

(14)

Proof. Any observation i ∈ Ik and k ∈ K for which (14) holds must have di = 1 to satisfy the constraint in (4). Assume (14) holds in an optimal solution for at least one i ∈ Ik and k ∈ K; therefore, di = 1. By (14), φ = ε/(bk xi − ci ) > 1. Let ˆk = αbk for all k ∈ K with α ≥ φ > 1. Then b ˆ k x for all k ∈ K preserve the orders b ˆ of bk x and, therefore, bk x for k ∈ K can also be used as discriminant functions. Because M > 0 is sufficiently large and ε > 0 is sufficiently small with M ≥ ε, ˆk = αbk for each k ∈ K and replacing ci with cî = αci for all replacing bk with b i ∈ I in (4) and (5) will not violate the feasibility of the MIP model in (3)–(8). ˆ k xi − cî = α(bk xi − ci ) ≥ φ(bk xi − ci ) = ε implies di = 0 is feasible. However, b This contradicts the assumption that the solution is optimal. The MIP model in (3)–(8) also has other properties that the LP model1 has. One property is stability or invariability under linear data transformation. Linear data transformation is to transform xij to yij using yij = αj xij + γj for each i ∈ I and j ∈ J. Both αj and γj are scalars with αj = 0 and may be different for different j ∈ J. Then yij s instead of xij s are used to set up the MIP model and to construct discriminant functions. The MIP model in (3)–(8) is invariant under linear data transformation as long as M > 0 is sufficiently large and ε > 0 is sufficiently small such that M ≥ ε. After transformation, the estimated parameters in the ˆk = (bk0 − discriminant functions become b j∈J (bkj /αj )γj , bk1 /α1 , . . . , bkn /αn ) ˆk y for k ∈ K are the discriminant functions using the for all k ∈ K. Then b ˆk y for k ∈ K preserve the order transformed data. Being stable or invariant, b ˆ ˆ of bk x, i.e. bk y ≥ bk y for any k = k if and only if bk x ≥ bk x for the same k = k. With this property, the classification results will be the same whether the original or the transformed data are used. This property is important and desirable because linear data transformation is a common technique in data preprocessing


600

M. Sun

and because the transformed data are usually easier to analyze. Some LP models proposed for two-class classification do not have this property.14 Another property is the freedom in selecting a value for ε in (4). Varying the value of ε in (4) only rescales the components of bk for k ∈ K and the cutoff values ci for all i ∈ I but does not affect the classification results as long as M > 0 is sufficiently large and ε > 0 is sufficiently small such that M ≥ ε. Suppose ε > 0 in (4) is replaced with εˆ = αε for any α > 0, hence εˆ > 0, and M in (4) is changed accordingly to keep M ≥ εˆ. A new MIP model is formulated with εˆ in (4). Then ˆk = αbk for k ∈ K and cî = αci for i ∈ I are optimal in the new MIP model if and b only if bk for k ∈ K and ci for i ∈ I are optimal in the original MIP model. The binary variables di for i ∈ I will be the same in both MIP models. Consequently, ˆ k x for any ˆ k x for k ∈ K preserve the order of bk x for k ∈ K, i.e. b ˆk x ≥ b b ˆ k = k if and only if bk x ≥ bk x for the same k = k. As a result, bk x and bk x for k ∈ K produce the same classification results. With this property, users can use a convenient value for ε. The value of ε determines the magnitudes of the components of bk for k ∈ K and of ci for i ∈ I. Gochet et al.34 and Sun1 studied the order preserving property of the discriminant functions for the LP models under shifting and positive rescaling. Shifting refers to adding to each discriminant function the same constant β and positive rescaling refers to multiplying each discriminant function by the same positive constant α. Discriminant functions constructed with statistical methods or other MP ¯ is added to each bk x to models all have this property. If a function of the form bx ¯ ˆ k x and bk x ˆ obtain bk x = (bk + b)x for each k ∈ K, the discriminant functions b have the same classification results. If each bk x is multiplied by the same constant ˆ k x and bk x also produce ˆ k x = αbk x, the discriminant functions b α > 0 to obtain b 34 1 the same classification results. Gochet et al. and Sun showed that the LP model can be simplified by using this property. The MIP model in (3)–(8) can also be simplified by using this property. Choose a k ∈ K such that mk = min{mk | k ∈ K} and add −bk x to bk x to ˆ k = 0. Using b ˆ k = bk − bk ˆk x = (bk − bk )x for all k ∈ K. Hence, b obtain b instead of bk in the MIP model, the constraint in (4) for each i ∈ Ik becomes / Ik and for the chosen k becomes −ci + M di ≥ ε and that in (5) for each i ∈ ci ≥ 0. The number of constraints in (5) is reduced by m − mk and mk constraints in (4) become the form of −ci + M di ≥ ε. The number of continuous variables is reduced by n + 1. Therefore, the MIP model is simplified. Choosing k such that ˆk = 0 takes full advantage of this property. mk = min{mk |k ∈ K} to set b All properties of the MIP model in (3)–(8) assume that M > 0 is sufficiently large and ε > 0 is sufficiently small such that M ≥ ε. No guidelines of determining M were provided for two-class models published in the literature. The following is a rough guideline for determining M in (4). Solve the LP model of Sun1 with the given ε. Suppose e∗i for each i ∈ I are the values of the deviation variables in an optimal solution of the LP model. Then the value of M can be determined by M ≥ max{e∗i | i ∈ I}. Because the CPU time taken to solve the LP model



601

can be negligible as compared to that to solve the MIP model, this step does not cause extra computation burden. Furthermore, an optimal solution of the LP model provides an initial incumbent solution for the MIP model. 5. Examples Two structured examples are provided in this section. The main purpose of these examples is to show the ways that the model works, especially the formulation of the MIP model, and to show the features that the model has, rather than to show the performance of this approach. 5.1. Example 1 The purpose of this example is to demonstrate how the MIP model in (3)–(8) is set up. The example has m = 9 observations and n = 2 prediction variables as shown in Table 1. The m = 9 observations are divided equally into p = 3 classes. The observations are not completely linearly separable. The objective function of this example problem is min d1 + d2 + d3 + d4 + d5 + d6 + d7 + d8 + d9 . With M = 10 and ε = 1 and with all elements of b1 setting to 0, the constraints in (4) and (5) are listed in Table 2. These constraints are grouped into blocks in the table. The constraints in (4) are in the diagonal blocks and those in (5) are in the off-diagonal blocks in the table. In addition, bkj are unrestricted for all j = 0, 1, 2 and k = 1, 2, 3; ci is unrestricted for i = 1, 2, 3; ci ≥ 0 for i = 4, . . . , 9; and di = 0 or 1 for i = 1, . . . , 9. The discriminant functions b1 x = 0.00, b2x = 9.5 − 2.09x1 − 1.09x2, and b3 x = −24.56 − 9.02x1 + 15.70x2 are obtained. With this set of discriminant functions, observation 1 is misclassified into class 2 and observation 6 is misclassified into class 1, but all other observations are correctly classified. The equations representing the hyperplanes separating the three classes obtained from this set of discriminant functions are 2.09x1 + 1.09x2 = 9.5 between classes 1 and 2, −9.02x1 + 15.70x2 = 24.56 between classes 1 and 3, and −6.93x1 + 16.80x2 = 34.06 between classes 2 and 3. 5.2. Example 2 The data of this example, as presented in Table 3, are randomly generated in such a way that the p = 3 classes have the same class centroid and equal sample size, Table 1. Data in Example 1. Class 1 i

1 2 3

Class 2

j

i

1

2

2.0 5.5 4.5

2.0 4.5 1.0

4 5 6

Class 3

j

i

1

2

2.5 2.0 6.5

3.0 1.5 4.5

7 8 9

j 1

2

1.0 5.0 2.5

2.5 4.5 4.0


602

M. Sun Table 2. Constraints in Example 1.

Class

i

Constraints k=1

k=2

k=3

−c1 + 10d1 ≥ 1.0 −c2 + 10d2 ≥ 1.0 −c3 + 10d3 ≥ 1.0

b20 + 2.0b21 + 2.0b22 + c1 ≤ 0.0 b20 + 5.5b21 + 4.5b22 + c2 ≤ 0.0 b20 + 4.5b21 + 1.0b22 + c3 ≤ 0.0

b30 + 2.0b31 + 2.0b32 + c1 ≤ 0.0 b30 + 5.5b31 + 4.5b32 + c2 ≤ 0.0 b30 + 4.5b31 + 1.0b32 + c3 ≤ 0.0

k=1

1 2 3

k=2

4

b20 + 2.5b21 + 3.0b22 − c4 +10d4 ≥ 1.0

b30 + 2.5b31 + 3.0b32 − c4 ≤ 0.0

5

b20 + 2.0b21 + 1.5b22 − c5 +10d5 ≥ 1.0 b20 + 6.5b21 + 4.5b22 − c6 +10d6 ≥ 1.0

b30 + 2.0b31 + 1.5b32 − c5 ≤ 0.0

7

b20 + 1.0b21 + 2.5b22 − c7 ≤ 0.0

b30 + 1.0b31 + 2.5b32 − c7 +10d7 ≥ 1.0

8

b20 + 5.0b21 + 4.5b22 − c8 ≤ 0.0

9

b20 + 2.5b21 + 4.0b22 − c9 ≤ 0.0

b30 + 5.0b31 + 4.5b32 − c8 +10d8 ≥ 1.0 b30 + 2.5b31 + 4.0b32 − c9 +10d9 ≥ 1.0

6 k=3

b30 + 6.5b31 + 4.5b32 − c6 ≤ 0.0

Table 3. Data in Example 2. i

1 2 3 4 5 6 7 8 9 10

k=1

i

j=1

j=2

j=3

6.26 5.81 9.81 8.06 8.15 6.42 6.78 6.40 6.19 7.38

17.02 18.57 19.30 17.69 15.60 17.45 17.86 17.39 15.57 17.44

11.25 14.19 12.42 11.26 11.63 14.91 13.63 14.63 13.69 11.99

11 12 13 14 15 16 17 18 19 20

k=1

i

j=1

j=2

j=3

5.64 8.44 6.33 7.39 7.81 6.89 8.33 6.45 8.47 5.51

16.10 16.37 17.55 19.43 19.97 15.64 16.50 18.22 18.80 15.31

10.80 15.07 14.03 11.74 15.34 14.77 12.60 12.16 11.10 11.99

21 22 23 24 25 26 27 28 29 30

k=1 j=1

j=2

j=3

5.86 6.37 6.14 6.32 8.30 10.19 5.36 7.59 8.87 6.26

18.39 18.43 15.51 18.07 18.60 18.95 18.66 15.25 15.87 16.16

13.87 12.27 13.11 16.07 14.32 14.37 11.20 11.16 12.07 11.16

¯1 = x ¯2 = x ¯ 3 = (7.126, 17.389, 12.960) and m1 = m2 = m3 = 10. Therefore, i.e. x this dataset meets the conditions for improper solutions of LP models.1,33,34 As a result, these LP models cannot generate any meaningful discriminant functions for this dataset. Fisher’s LDF also generated an improper solution for this dataset. With M = 10 and ε = 1 and with all elements of b1 setting to 0, the discriminant functions b1 x = 0.00, b2 x = 125.45 − 28.68x1 + 14.91x2 − 14.58x3, and b3 x = −2641.33 − 49.71x1 + 139.11x2 + 35.68x3 are constructed with the MIP solution. The classification results of these observations are presented in Table 4. Among the m = 30 observations, 19 are correctly classified. This example shows that the MIP model may generate meaningful discriminant functions and distinguish the observations in the sample when the conditions for improper solutions in LP models are met.



603

Table 4. Classification results for Example 2. From

Classified into

Total

Class 1

Class 2

Class 3

8 3 3

1 5 1

1 2 6

Class 1 Class 2 Class 3

10 10 10

6. Computational Results Computational results of six publically accessible datasets are reported in this section. The callable library of the software package CPLEXa was used to solve the MIP problems. CPLEX was allowed to reach optimality for the first three datasets because the classes of these datasets are completely or nearly completely separable and the MIP models are easy to solve. The classes of the last three datasets are not nearly completely separable and the MIP models take extensive computation time to solve. Hence, CPLEX was limited to examine 1000 nodes and the final solution is the best integer solution found. All computations were conducted on a SUN Enterprise 3000 computer running the UNIX operating system. Both in-sample and validation classification results are reported for each dataset. The in-sample classification results are obtained with observation resubmission, i.e. all the observations in the sample are used to construct discriminant functions that are then used to classify each of the observations in the sample. The validation classification results are obtained using the leave-one-out (LOO) validation procedure for the first three and the ten-fold validation procedure for the last three datasets. With the LOO validation procedure, one observation i ∈ I is left out in turn and the other m − 1 observations are used to construct discriminant functions that are then used to classify the observation left out. This process is repeated m times, once for each i ∈ I, and the reported results are a summary of the m observations left out in turn. With the ten-fold validation procedure, the observations in the dataset are randomly partitioned into 10 subsets with the ones in each class distributed into the subsets as evenly as possible. Each subset is left out in turn and observations in the other subsets are used to construct discriminant functions that are then used to classify the observations in the subset left out. This process is repeated 10 times, once for each subset, and the classification results of the 10 subsets are added together. Both in-sample and ten-fold validation results of Fisher’s LDF obtained in this study are also reported for the last three datasets. After the computational results are presented for each dataset, these results for all datasets are summarized.

a ILOG

S.A. and ILOG, Inc., ILOG CPLEX 10.0, 2006.


604

M. Sun

6.1. The wine recognition dataset This dataset was originally used by Aeberhard et al.48 The data are the results of chemical analysis of wines grown in the same region in Italy but derived from p = 3 different cultivars. The chemical analysis determined the quantities of n = 13 constituents, i.e. prediction variables, found in each of the m = 178 wine samples with m1 = 59, m2 = 71, and m3 = 48. The dataset itself with more information is available at the UCI Machine Learning Repository.49 The three classes in the sample are completely separable. The MIP model in (3)–(8) for this dataset took CPLEX about 0.15 s to solve. The classification results are presented in Table 5. In the validation results obtained with the LOO procedure, 171 or 96.07% of the m = 178 observations are correctly classified. This dataset has been used by other researchers to test different discriminant methods, e.g. by Bennett and Mangasarian.33 The results obtained with the MIP model are comparable to those published in the literature. 6.2. The Iris dataset This dataset was originally from Fisher50 and was used as a standard test dataset by many authors, such as Kendall,51 Gehrlein,32 and Bennett and Mangasarian33 and as a standard example in many commercial software packages, such as BMDP52 and NeuralWorks.b The dataset has n = 4 prediction variables measuring four plant characteristics and m = 150 observations evenly divided into p = 3 species of iris plants, Iris setosa, Iris versicolor, and Iris virginica with mk = 50 in each. The p = 3 classes are nearly completely separable and the MIP model is not difficult to solve taking CPLEX a little over 1 s. The classification results are given in Table 6. With observation resubmission, 149 or 99.33% of the m = 150 observations are correctly classified. With the LOO validation procedure, 144 or 96.00% of the m = 150 observations were correctly classified. These results are comparable to those obtained with other methods as published in the literature. With observation resubmission, both the single function and multiple function MIP models of Gehrlein32 classified 149 observations and Table 5. Classification results for the wine recognition dataset. From


b NeuralWare,

Classified into (in-sample) Class 1

Class 2

Class 3

59 0 0

0 71 0

0 0 48

Classified into (validation) Class 1 57 1 1

Class 2

Class 3

2 68 1

0 2 46

Total

59 71 48

Using Neuralworks, A tutorial for NeuralWorks professional II/Plus and NeuralWorks explorer, NeuralWare, Inc., Pittsburgh, Pennsylvania, 1993.



605

Table 6. Classification results for the Iris dataset. From

Classified into (in-sample)

Classified into (validation)

Iris setosa

Iris versicolor

Iris virginica

Iris setosa

Iris versicolor

Iris virginica

50 0 0

0 49 0

0 1 50

50 0 0

0 48 4

0 2 46

Iris setosa Iris versicolor Iris virginica

Total

50 50 50

Fisher’s LDF52 classified 147 observations correctly to their corresponding classes for this dataset. 6.3. The MBA admission dataset Johnson and Wichern3 provided an MBA admission dataset as an example for multiple-class discriminant analysis. The dataset has a total of m = 85 observations divided into p = 3 classes of applicants, with m1 = 31 in Admit, m2 = 28 in Not Admit, and m3 = 26 in Borderline, and n = 2 prediction variables, undergraduate GPA (x1 ), and GMAT score (x2 ). The details of the dataset are described and the dataset itself is available in the book.3 The observations in the p = 3 classes are nearly completely separable and the MIP model is easy to solve. It took CPLEX less than 0.1 s of CPU time. The classification results are summarized in Table 7. With observation resubmission, 84 or 98.82% of the m = 85 observations are correctly classified. With the LOO validation procedure, 80 or 94.12% of the m = 85 observations are correctly classified. This dataset has been used to test other MP models for discriminant analysis in different studies, such as that of Loucopoulos.38 The results obtained by the MIP model in (3)–(8) are in line with those published in the literature. 6.4. The personnel management dataset This dataset was used by Rulon et al.53 through their text as an example to demonstrate different classification techniques. The dataset has m = 244 observations representing employees of an airline company in p = 3 classes with m1 = 85 Passenger Agents, m2 = 93 Mechanics, and m3 = 66 Operations Control Agents, respectively.

Table 7. Classification results for the MBA admission dataset. From

Admit Not Admit Borderline



Admit

Not Admit

Borderline

Admit

Not Admit

Borderline

30 0 0

0 28 0

1 0 26

30 0 1

0 26 1

1 2 24

Total

31 28 26


606

M. Sun Table 8. Classification results of the personnel management dataset. From



Class 1

Class 2

Class 3

Class 1

Class 2

Class 3

Total

MIP


77 18 9

7 69 10

1 6 47

67 15 5

15 67 12

3 11 49

85 93 66

LDF


65 16 9

16 65 9

4 12 48

66 16 8

15 63 11

4 14 47

85 93 66

The n = 3 variables are aggregate scores for outdoor activities (x1 ), convivial activities (x2 ), and conservative activities (x3 ) obtained from a survey measuring employee preferences. The p = 3 classes are not nearly completely separable and therefore the MIP model needs extensive computation time. CPLEX takes about 1 min of CPU time to examine 1000 nodes. The classification results are presented in Table 8. For in-sample classification, 193 or 79.10% of the m = 244 observations were correctly classified. These results are much better than those reported by Gochet et al.34 and Sun.1 Hence, the insample classification results obtained with the MIP approach can be used as benchmarks to measure the classification quality of other approaches. For the ten-fold validation, 183 or 75.00% of the m = 244 observations are correctly classified. Gochet et al.34 reported LOO validation results of their LP model, Fisher’s LDF, and the k-nearest neighbor method for this dataset. Sun1 reported LOO validation results of his LP approach and Fisher’s LDF for this dataset. The ten-fold validation results obtained in this study are comparable to or better than the LOO validation results reported in these studies. Out of the m = 244 observations with Fisher’s LDF, 178 or 72.95% of the observations for the in-sample and 176 or 72.13% of the observations for the ten-fold validation are classified correctly.

6.5. The glass identification dataset This dataset is from the UCI Machine Learning Repository.49 The purpose of the analysis is to identify the p = 6 different types of glasses using the n = 9 physical characteristics and chemical contents as prediction variables. The dataset has m = 214 observations with m1 = 70, m2 = 76, m3 = 17, m4 = 13, m5 = 9, and m6 = 29. There is considerable overlapping among the different classes in this dataset. The MIP model takes about 15 min of CPU time for CPLEX to examine 1000 nodes. The MIP and LDF results are given in Table 9. With the MIP approach, 179 or about 83.64% of the observations are correctly classified for the in-sample classification and 137 or 64.02% of the observations are correctly classified for the ten-fold validation. With Fisher’s LDF, 130 or about 60.75% of the observations are correctly classified for the in-sample classification and 129 or 60.28% of the observations are



607

Table 9. Classification results of the glass identification dataset. From



Total

1

2

3

4

5

6

1

2

3

4

5

6

MIP

1 2 3 4 5 6

55 13 0 0 0 0

11 60 3 1 0 0

4 3 14 0 0 0

0 0 0 12 0 0

0 0 0 0 9 0

0 0 0 0 0 29

50 17 3 0 0 1

11 45 5 5 0 0

6 5 9 1 0 0

0 4 0 3 1 2

3 2 0 0 6 2

0 3 0 4 2 24

70 76 17 13 9 29

LDF

1 2 3 4 5 6

45 12 9 0 1 0

13 37 3 2 1 0

12 18 5 0 0 1

0 6 0 10 0 2

0 3 0 0 7 0

0 0 0 1 0 26

44 16 5 0 1 0

15 40 3 5 1 1

11 11 9 1 0 1

0 6 0 6 0 2

0 3 0 0 7 1

0 0 0 1 0 24

70 76 17 13 9 29

correctly classified for the ten-fold validation. Both of the Fisher’s LDF and MIP results were obtained with linear discriminant functions. Better results may be obtained if nonlinear terms are also used. 6.6. The E-coli dataset Contributed by Nakai,54–56 this dataset is available at the UCI Machine Learning Repository.49 The purpose of analysis is to classify each of the observations into one of the p = 8 different protein localization sites using n = 7 prediction variables. The m = 336 observations in the dataset are not evenly distributed with m1 = 143, m2 = 77, m3 = 2, m4 = 2, m5 = 35, m6 = 20, m7 = 5, and m8 = 52. Because of the overlapping among the classes, the MIP models take CPLEX a long time to solve. CPLEX takes about an hour of CPU time to examine 1000 nodes. The MIP and LDF classification results are presented in Table 10. The MIP approach correctly classified 308 or 91.67% of the m = 336 observations for insample classification and 284 or 84.52% of the observations for the ten-fold validation. Fisher’s LDF correctly classified 285 or about 84.82% of the observations for in-sample classification and 281 or 83.63% of the observations for the ten-fold validation. Horton and Nakai56 reported 84.2% in-sample and 81.1% ten-fold validation classification rates. 6.7. Summary of computational results The computational results of the six datasets used in this study are summarized in Table 11. Both in-sample and validation classification results with the MIP approach and with another approach for each dataset are given in the table. Classification results are represented by the number and percentage of correctly classified observations in the dataset. For each approach, m and m in Table 11 represent the


608

M. Sun Table 10. Classification results of the E-coli dataset. From



Total

1

2

3

4

5

6

7

8

1

2

3

4

5

6

7

8

MIP

1 2 3 4 5 6 7 8

139 1 0 0 0 0 0 2

1 63 0 1 4 0 0 1

0 0 2 0 0 0 0 0

0 0 0 1 0 0 0 0

0 11 0 0 31 0 0 0

0 0 0 0 0 19 0 1

0 0 0 0 0 0 5 0

3 2 0 0 0 1 0 48

138 2 1 0 1 0 1 3

2 64 0 1 11 0 0 3

0 0 0 0 2 0 0 0

0 0 0 0 0 0 0 0

0 10 1 0 20 0 0 0

0 0 0 0 0 17 1 3

0 0 0 0 0 0 2 0

3 1 0 1 1 3 1 43

143 77 2 2 35 20 5 52

LDF

1 2 3 4 5 6 7 8

139 2 0 0 1 0 0 3

1 53 0 0 2 0 0 1

0 0 1 0 0 0 0 0

1 7 0 1 4 0 0 4

0 14 0 1 27 0 0 0

0 0 0 0 0 18 0 3

0 1 1 0 1 1 5 0

2 0 0 0 0 1 0 41

138 2 0 0 1 0 0 3

1 54 0 0 2 0 0 1

0 0 1 0 0 0 0 0

1 7 0 0 5 1 0 3

0 13 0 1 26 0 0 0

0 0 0 0 0 17 0 3

0 1 1 0 1 1 5 2

3 0 0 1 0 1 0 40

143 77 2 2 35 20 5 52

Table 11. Summary of computational results. Dataset

Wine Iris MBA Personnel Glass E-coli

p

3 3 3 3 6 8

n

13 4 2 3 9 7

m

178 150 85 244 214 336

MIP m 178 149 84 193 179 308

Other Approaches

%

m

100.00 99.33 98.82 79.10 83.64 91.67

171 144 80 183 137 284

%

m

96.07 96.00 94.12 75.00 64.02 84.52

178 148 — 185 130 265

Source

%

m

%

100.00 98.67 — 75.82 60.75 84.82

162 145 83 179 130 281

91.01 96.67 97.65 73.36 60.75 83.63

33 33 38 34

number of correctly classified observations for the in-sample and for the validation results, respectively. With the MIP approach, the validation results are obtained with LOO for the first three datasets and with the ten-fold validation procedure for the last three datasets. The other approach for each of the first four datasets is a MP approach given in the reference in the last column of Table 11 and the results are from the corresponding literature. These results give an indirect comparison of the performances of the MIP approach and other MP approaches. For the last two datasets, the other approach is LDF, the results are obtained in this study and the validation results are obtained with the ten-fold validation procedure. For each dataset, p, n, and m are also listed in the table. 7. Conclusions An MIP model for multiple-class discriminant and classification analysis is proposed. The model directly minimizes the number of misclassifications in the sample.



609

The formulation is simple, easy to understand, and easy to use. Properties of the model are discussed. The model does not suffer from any difficulties caused by pathologies of some of the earlier MP models for two-class classification analysis. With this MIP approach, practitioners have one more tool for their discriminant problems. In general, LP models are preferred to MIP models because LP models are much easier to solve and the resulting discriminant functions may have better generalization capabilities. Under some conditions the MIP approach may be a better alternative. For example, the MIP approach may be preferred to the LP approach if the purpose of the application is discrimination rather than classification. Although MIP models are generally much more difficult to solve, they can be solved within reasonable computation time under certain conditions, for example, when the sample sizes are small or when the different classes in the sample are completely or nearly completely separable. Under these conditions, the MIP model may not take much more CPU time than LP models. It may not be necessary for the MIP model to be solved to optimality when it is hard to solve. A good approximation is usually quite good. One direction of future research is to determine the effect of the relative values of M and ε in (4) on the computational complexity of the MIP model through computational experiments. One direction is to address the issue of variable selection when observations on a large number of variables are available. One more set of binary variables will be involved in the variable selection model and, therefore, the MIP model will demand more computation time. Another direction is to develop heuristic methods to solve the MIP models, possibly with variable selection capability, especially for applications with large datasets. With effective heuristics, the disadvantage of demanding too much computation time is at least partially overcome. One more direction is software implementations. The MP approaches will be much easier for the practitioners to use if user-friendly software is available, possibly with variable selection features and with heuristic procedures to solve MIP models. Acknowledgments This research was supported in part by a grant from the College of Business at the University of Texas at San Antonio 2008–2009. This work is significantly benefitted from the comments made by the four anonymous reviewers in the three revisions. The author greatly appreciates the time and efforts of these reviewers in reviewing and commenting this paper. References 1. M. Sun, Linear programming approaches for multiple-class discriminant and classification analysis, International Journal of Strategic Decision Sciences 1(1) (2010) 57–80.


610

M. Sun

2. O. Asparouhov and P. A. Rubin, Oscillation heuristics for the two-group classification problem, Journal of Classification 21(2) (2004) 255–277. 3. R. A. Johnson and D. W. Wichern, Applied Multivariate Statistical Analysis, 2nd edn. (Prentice-Hall, Inc., Englewood Cliffs, New Jersey, 1988). 4. N. Freed and F. Glover, A linear programming approach to the discriminant problem, Decision Sciences 12(1) (1981) 68–74. 5. D. J. Hand, Discrimination and Classification (Wiley and Sons, New York, 1981). 6. T. P. Cavalier, J. P. Ignizio and A. L. Soyster, Discriminant analysis via mathematical programming on certain problems and their causes, Computers & Operations Research 16(4) (1989) 353–362. 7. N. Freed and F. Glover, Resolving certain difficulties and improving the classification power of LP discriminant analysis formulations, Decision Sciences 17(4) (1986) 589–595. 8. F. Glover, Improved linear programming models for discriminant analysis, Decision Sciences 21(4) (1990) 771–785. 9. F. Glover, S. Keene and B. Duea, A new class of models for the discriminant problem, Decision Sciences 19(2) (1988) 269–280. 10. G. J. Koehler, Characterizations of unacceptable solutions in LP discriminant analysis, Decision Sciences 20(2) (1989) 239–257. 11. G. J. Koehler, Unacceptable solutions and the hybrid discriminant model, Decision Sciences 20(4) (1989) 844–848. 12. G. J. Koehler, Considerations for mathematical programming models in discriminant analysis, Managerial and Decision Economics 11 (1990) 227–234. 13. G. J. Koehler, Improper linear discriminant classifiers, European Journal of Operational Research 50 (1991) 188–198. 14. E. P. Markowski and C. A. Markowski, Some difficulties and improvements in applying linear programming formulations to the discriminant problem, Decision Sciences 16(3) (1985) 237–247. 15. S. S. Erenguc and G. J. Koehler, Survey of mathematical programming models and experimental results for linear discriminant analysis, Management and Decision Economics 11 (1990) 215–225. 16. A. Stam, Nontraditional approaches to statistical classification: Some perspectives on Lp -norm methods, Annals of Operations Research 74 (1997) 1–36. 17. A. Stam and E. A. Joachimsthaler, A comparison of a robust mixed-integer approach to existing methods for establishing classification rules for the discriminant problem, European Journal of Operational Research 46(1) (1990) 113–122. 18. N. Freed and F. Glover, Evaluating alternative linear programming models to solve the two-group discriminant problem, Decision Sciences 17(2) (1986) 151–162. 19. R. Nath, W. M. Jackson and T. W. Jones, A comparison of the classical and the linear programming approaches to the classification problem in discriminant analysis, Journal of the Statistical Computation and Simulation 41 (1992) 73–93. 20. E. A. Joachimsthaler and A. Stam, Four approaches to the classification problem in discriminant analysis: An experimental study, Decision Sciences 19(2) (1988) 322–333. 21. S. M. Bajaier and A. V. Hill, An experimental comparison of statistical and linear programming approaches to the discriminant problem, Decision Sciences 13 (1982) 604–618. 22. C. A. Markowski and P. E. Markowski, An experimental comparison of several approaches to the discriminant problem with both qualitative and quantitative variables, European Journal of Operational Research 28 (1987) 74–87.



611

23. P. A. Rubin, Evaluating the maximize minimum distance formulation of the linear discriminant problem, European Journal of Operational Research 41 (1989) 240–248. 24. P. A. Rubin, A comparison of linear programming and parametric approaches to the two-class discriminant problem, Decision Sciences 21 (1990) 373–386. 25. C. T. Ragsdale and A. Stam, Mathematical programming formulations for the discriminant problem: An old dog does new tricks, Decision Sciences 22(2) (1991) 296–307. 26. R. Nath and T. W. Jones, A variable selection criterion in linear programming approaches to discriminant analysis, Decision Sciences 19(3) (1988) 554–563. 27. J. J. Glen, Integer programming methods for normalization and variable selection in mathematical programming discriminant analysis models, Journal of the Operational Research Society 50(10) (1999) 1043–1053. 28. J. J. Glen, Classification accuracy in discriminant analysis: A mixed integer programming approach, Journal of the Operational Research Society 52(3) (2001) 328–339. 29. M. Sun and M. Xiong, A mathematical programming approach for gene selection and tissue classification, Bioinformatics 19(10) (2003) 1243–1251. 30. E.-U. Choo and W. C. Wedley, Optimal criteria weights in repetitive multicriteria decision-making, Journal of the Operational Research Society 36 (1985) 983–992. 31. N. Freed and F. Glover, Simple but powerful goal programming formulations for the discriminant problem, European Journal of Operational Research 7(1) (1981) 44–60. 32. W. V. Gehrlein, General mathematical programming formulations for the statistical classification problem, Operations Research Letters 5(6) (1986) 299–304. 33. K. P. Bennett and O. L. Mangasarian, Multicategory discrimination via linear programming, Optimization Methods and Software 3 (1994) 27–39. 34. W. Gochet, A. Stam, V. Srinivasan and S. Chen Multigroup discriminant analysis using linear programming, Operations Research 45(2) (1997) 213–225. 35. R. Pavur, Dimensionality representation of linear discriminant function space for the multiple-group problem: An MIP approach, Annals of Operations Research 74 (1997) 37–50. 36. R. Pavur and C. Loucopoulos, Examining optimal criterion weights in mixed integer programming approaches to the multiple-group classification problem, Journal of the Operational Research Society 46 (1995) 626–640. 37. G. Kou, Y. Peng, Z. Chen and Y. Shi, Multiple criteria mathematical programming for multi-class classification and application in network intrusion detection, Information Science 179(4) (2009) 371–381. 38. C. Loucopoulos, Three-group classification with unequal misclassification costs: A mathematical programming approach, Omega 29(3) (2001) 291–297. 39. C. Loucopoulos and R. Pavur, Computational characteristics of a new mathematical programming model for the three-group discriminant problem, Computers & Operations Research 24(2) (1997) 179–191. 40. R. Pavur and C. Loucopoulos, Evaluating the effect of gap size in a single function mathematical programming model for the three-group classification problem, Journal of the Operational Research Society 52(8) (2001) 896–904. ¨ 41. R. Ostermark and R. H¨ oglund, Addressing the multiclass discriminant problem using multivariate statistics and mathematical programming, European Journal of Operational Research 108(1) (1998) 224–237. 42. J. M. Wilson, Integer programming formulation of statistical classification problems, Omega 24(6) (1996) 681–688. 43. D. Bertsimas and R. Shioda, Classification and regression via integer optimization, Operations Research 55(2) (2007) 252–271.


612

M. Sun

¨ 44. F. Uney and M. T¨ urkay, A mixed-integer programming approach to multi-class data classification problem, European Journal of Operational Research 173(3) (2006) 910–920. 45. E. A. Joachimsthaler and A. Stam, Mathematical programming approaches for the classification problem in two-group discriminant analysis, Multivariate Behavioral Research 25 (1990) 427–454. 46. C. Chen and O. L. Mangasarian, Hybrid misclassification minimization, Advances in Computational Mathematics 5(2) (1996) 127–136. 47. A. Stam and C. T. Ragsdale, On the classification gap in MP-based approaches to the discriminant problem, Naval Research Logistics 39 (1992) 545–559. 48. S. Aeberhard, D. Coomans and O. de Vel, Comparison of classifiers in high dimensional settings, Technical Report No. 92-02, Department of Computer Science and Department of Mathematics and Statistics, James Cook University of North Queensland (1992). 49. A. Asuncion and D. J. Newman, UCI Machine Learning Repository, http://www. ics.uci.edu/∼mlearn/MLRepository.html, School of Information and Computer Science, University of California, Irvine, CA (2007). 50. R. A. Fisher, The use of multiple measurements in taxonomy problems, Annals of Eugenics 7 (1936) 179–188. 51. M. G. Kendall, Discrimination and classification, in Multivariate Analysis, ed. P. R. Krishnaiah (Academic Press, New York, 1966), pp. 165–185. 52. R. Jennrich and P. Sampson, Stepwise discriminant analysis, in BMDP Statistical Software, ed. W. J. Dixon (University of Californial Press, Berkeley, CA, 1983), pp. 519–525. 53. P. J. Rulon, D. V. Tiedeman, M. M. Tatsuoka and C. R. Langmuir, Multivariate Statistics for Personnel Classification (Wiley & Sons, New York, 1967). 54. K. Nakai and M. Kanehisa, Expert system for predicting protein localization sites in Gram-negative bacteria, PROTEINS: Structure, Function, and Genetics 11(2) (1991) 95–110. 55. K. Nakai and M. Kanehisa, A knowledge base for predicting protein localization sites in eukaryotic cells, Genomics 14(4) (1992) 897–911. 56. P. Horton and K. Nakai, A probabilistic classification system for predicting the cellular localization sites of proteins, Proceedings of the Fourth International Conference on Intelligent Systems for Molecular Biology (St. Louis, USA, 1996), pp. 109–115.

Copyright of International Journal of Information Technology & Decision Making is the property of World Scientific Publishing Company and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use.