ARTICLE IN PRESS
DTD 5
Expert Systems with Applications xx (2005) 1–9 www.elsevier.com/locate/eswa
Constructing response model using ensemble based on feature subset selection Enzhe Yu1, Sungzoon Cho* Department of Industrial Engineering, Seoul National University, San 56-1 Shilimdong Kwanakgu, 151-744 Seoul, South Korea
Abstract In building a response model, determining the inputs to the model has been an important issue because of the complexities of the marketing problem and limitations of mental models for decision-making. It is common that the customers’ historical purchase data contains many irrelevant or redundant features thus result in bad model performance. Furthermore, single complex models based on feature subset selection may not always report improved performance largely because of overfitting and instability. Ensemble is a widely adopted mechanism for alleviating such problems. In this paper, we propose an ensemble creation method based on GA based wrapper feature subset selection mechanism. Through experimental studies on DMEF4 data set, we found that the proposed method has, at least, two distinct advantages over other models: first, the ability to account for the important inputs to the response model; second, improved prediction accuracy and stability. q 2005 Elsevier Ltd. All rights reserved. Keywords: Response model; Ensemble; Feature selection; SVM
1. Introduction Response modeling is concerned with computing the likelihood of a customer to respond to a marketing campaign. Lack of mental models for decision making leaves theoretical models unattainable. Statistical models are often developed from a historic data set, which contains customers’ past purchasing behavior and demographic information. The marketers can then use the model to compute the likelihood values or scores to reduce the number of customers whom they target in actual campaign. As far as the number of the responding customers is reasonably similar, the effectiveness of the campaign can be raised. Conventional statistical methods include logistic regression and multi-normal regression. However, complexities of human behavior rendered these inherently linear models have a lot to be desired. Recently, nonlinear and complex machine learning approaches were proposed such as neural networks (Bentz & Merunkay, 2000; Bounds &
* Corresponding author. Tel.: C82 2 880 6275; fax: C82 2 889 8560. E-mail address:
[email protected] (S. Cho). 1 Samsung Data Systems Inc., Seoul, South Korea, bravochina@gmail. com.
0957-4174/$ - see front matter q 2005 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2005.07.026
Ross, 1997; Ha, Cho, & MacLachlan, 2005; Moutinho, Curry, Davies, & Rita, 1994; Kim & Street, 2004; Zahavi & Levin, 1997a), decision trees (Haughton & Oulabi, 1997) and support vector machines (Cheung, Kwok, Law, & Tsui, 2003). They are known to be more flexible so that any degree of complexity can be learned or handled. Experimental results support that these novel methods outperform conventional methods in terms of prediction accuracy and lift. There are two issues that significantly affect the performance of complex machine learning models: the issue of feature subset selection or determination of the predictor input variables and the ‘bias-variance’ dilemma which is concerned with determination of adequate model complexity. Let us briefly review these issues. The first issue results from the fact that a customer related data set usually contains hundreds of features or variables, many of which are irrelevant and heavily correlated with others. Without prescreening or feature selection, they tend to deteriorate performance of the model, as well as increase the model training time. Feature subset selection can be formulated as an optimization problem which involves searching the space of possible features to identify a subset that is optimum or near-optimal with respect to performance measures such as accuracy. Various ways to perform feature subset selection exist (Yang & Honavar, 1998). They can be classified into two categories based on its relation with base learner: filter and wrapper.
DTD 5
2
ARTICLE IN PRESS E. Yu, S. Cho / Expert Systems with Applications xx (2005) 1–9
The filter approach usually chooses features independently of base learner by human experts or statistical methods such as principal component analysis (PCA). It is generally computationally more efficient than a wrapper approach, but its major drawback is that an optimal selection of features may not be independent of the inductive and representational biases of the learning algorithm that is used to construct the classifier. The wrapper approach, on the other hand, evaluates candidate feature subsets by training a base learner such as a neural network or support vector machine with a given training data set using each feature subset under consideration and then testing it against a separate validation data set. Since the number of feature subsets is usually very large, this scheme is feasible only if training is relatively fast (Yang & Hanovar, 1998). Furthermore, since wrapper approach usually involves search, choosing a proper way of searching is also an important issue. A simple exhaustive search is computationally infeasible in practice. Therefore, suboptimal yet practical search method is desired. The well known stepwise regression is a wrapper approach based on a greedy search. It is computationally efficient yet achieves suboptimal solution. Since more computing power is available now, randomized or probabilistic search such as genetic algorithm (GA) is used more often. It is computationally more involved, yet promises a solution close to the optimal solution. As for the second issue, bias-variance dilemma, we first review the notion of bias and variance and then discuss ways to overcome the associated problem. The variance represents the extent to which the output of a classifier is sensitive to the data on which it was trained, i.e. the extent to which the same set of results would be obtained if a different set of training data was used (Geman, Bienenstock, & Doursat, 1992). The bias, on the other hand, reflects the ability of the learner to generalize correctly to a test set once trained, as an average over the set of possible training sets. The dilemma comes from the fact that a decrease of bias leads to an increase of variance, and vice versa. A complex neural network, for instance, has a low bias but a high variance. The balancing act in training classifiers is that of achieving a compromise between the conflicting requirements of a small variance and a small bias. However, it is generally accepted that the effect of combining a set of different classifiers is to reduce the variance, or dependence on a particular data set, without affecting the bias (Bishop, 1995). Thus, the best results could be obtained from an ensemble of the classifiers each of which exhibits high variance (Bishop, 1995; Hashem, 1997; Krogh & Vedelsby, 1995; Perrone & Cooper, 1993; Sharkey, 1996). As we can see from the descriptions above, most literature treats the two issues separately. Model building procedure usually deals with feature selection first, and then with ensemble creation (Ho, 1998; Sharkey, Sharkey, & Chandroth, 1995). However, separation of the issues is not desirable since a feature subset selected may be optimal for a single classifier, but may not be optimal for an ensemble.
In fact, it may be better off if each member classifier of an ensemble employees different feature subsets. Since they are interrelated, these two issues need to be addressed at the same time. In this paper, we propose a framework, where these two issues are addressed together with a goal of obtaining a globally optimal solution, not a combination of two locally optimal solutions. In particular, we propose an ensemble of classifiers trained with different subsets of features chosen by a GA based wrapper feature selection approach. This paper is structured as follows. In Section 2, the GA based wrapper approach for feature subset selection is introduced, followed by a description of the proposed FSE model. Then, the experimental setting and results are presented, respectively. A summary and discussion on future works concludes this paper.
2. The proposed method In a GA based wrapper approach (Yang & Honavar, 1998), each feature is encoded as a gene and a subset of features as a chromosome. Presence or absence of a feature corresponds to 1 or 0 value, respectively, in the chromosome. To each chromosome does base learner correspond which is trained with only those features that are present in the chromosome. Initially, a population of chromosomes is randomly created. The validation accuracy of the trained base learner is considered as the fitness of the corresponding chromosome. Selection operation screens out those chromosomes that correspond to a low fitness value. A next generation of chromosomes is created from the selected chromosomes through crossover and mutation. These operations are designed so that the next generation is ‘fitter.’ With this process being repeated, the population of chromosomes evolves with an expectation that fitter chromosomes emerge and survive. In most practical situations, chromosomes do improve and become fitter. In the end, only the fittest ones do survive and they qualify as ‘good’ feature subsets. The base learner can be any classifier with a reasonably good generalization performance and quick training capability. We found that Support Vector Machine (SVM) suits the requirement very well (Vapnik, 1995). In one of our previous studies, an SVM resulted in a comparable performance with that of four layer multilayer perception neural network while its learning time was three orders of magnitude faster than that of neural network (Yu & Cho, 2004). A GA wrapper approach with SVM as a base learner is depicted in Fig. 1. Previous researches on ensemble are either based on the difference of training sets, such as bagging (Breiman, 1996) and boosting (Schapire, 1990), or on difference of classifiers (Yao & Liu, 1998). The ensemble creation method proposed here is based on the difference among feature subsets, which correspond to different chromosomes. Since, GA wrapper
ARTICLE IN PRESS
DTD 5
E. Yu, S. Cho / Expert Systems with Applications xx (2005) 1–9
3
Initial Population
Pool of Candidate Feature Subsets
Crossover and Mutation
New Pool of Candidate Feature Subsets
Evaluation by SVM
Selection
Best Solution Fig. 1. GA-SVM wrapper approach for feature subset selection.
approach involves generating a population of ‘good’ subset features, it is quite natural to base it for our ensemble member selection process. There is one caveat however, one needs to balance between accuracy and diversity. An ensemble is effective only when the member classifiers are both accurate and diverse (Perrone & Cooper, 1993). If one selects only those classifiers with a high accuracy, the ensemble may not have a sufficient diversity. If, on the other hand, one focuses on the diversity of classifiers or their corresponding subset of features, their individual members’ accuracy may not be high. Thus, it is critical how to balance these two contradictory objectives in a structured way. Our approach is based on the following observation that in an early stage of genetic evolution, the chromosomes tend to be more diverse but less accurate while in a latter stage of genetic evolution, the surviving chromosomes tend to be more accurate but less diverse. Thus, we propose to stop the genetic evolution process early so that diversity is still kept while accuracy is acceptable. Then, ‘diverse’ subset of chromosomes is chosen from the finalists. The proposed Feature Selection Ensemble (FSE) procedure is described in Fig. 2. The proposed Feature Selection Ensemble (FSE) procedure consists of two phases, GA phase and classifier selection phase. First, in GA phase, a typical GA wrapper is run to find T. Then, those chromosomes that emerge at (1K3)!100% of T are identified for phase 2. As mentioned earlier, if the GA process is let to run to converge, very accurate but similar classifiers will result, thus lowering
Fig. 2. Proposed feature selection ensemble (FSE) procedure.
the diversity. They are considered to be reasonably accurate and fairly diverse. Second, in the classifier selection phase, we try to select a subset of classifiers from phase 1 for an ensemble. Since they are already reasonably accurate, we focus here the diversity. The problem of choosing the most diverse subset of classifiers based on their validation predictions can be transformed into a set packing problem, which is known as NP-complete (Karp, 1972). Thus, we employed a greedy heuristic approach, where a pair of classifiers with a largest difference is selected, then another pair with a second largest difference is selected, and so on. The difference between two classifiers is estimated by the hamming distance between two respective validation output vectors corresponding to the two classifiers. The first step in phase 2 identifies k groups of SVM classifiers, each of which employs the same set of features. Note that even if a same set of features were used, different classifiers could result in general, depending on the particular SVM parameters such as g and c cost used during training. In the second step,
DTD 5
4
ARTICLE IN PRESS E. Yu, S. Cho / Expert Systems with Applications xx (2005) 1–9
the SVM classifier fi that has the smallest validation error for each group is chosen, resulting in a total of k(%n) classifiers. Then, for each classifier fi, compute the validation output vector fÐi Z ðfi ð1Þ; fi ð2Þ; .; fi ðNÞÞ, where fi(k) is the output of classifier i for the kth validation pattern. Based on these vectors, hamming distance HD(i,j) between every pair of validation output vectors fÐi and fÐj is computed (There are a total of kC2 such distances.). Finally, at step 5, those classifiers are selected which are involved in the largest HD, then those involved in the second largest HD, and so on until the pre-determined number of classifiers is obtained. If HD(3,7) were the largest, for instance, classifiers 3 and 7 would be obtained. Or alternatively, the top x% of the HDs could be considered and all the classifiers involved would be obtained. In the latter, the exact number of classifiers cannot be determined a priori. The proposed FSE procedure employs three kinds of parameters, which must be set and tuned by the user. First is the set of parameters related to GA such as the number of generations, crossover and mutation rates. In particular, GA is usually let to run until the population fitness change becomes very small. Here, we set the generation number to 100. Second is the set of parameters related to the base learner or SVM. For Gaussian kernels, its width parameter g and cost parameter a need to be set. They could have been put into a chromosome and ‘optimal’ values could have been obtained through GA search. Due to an increased computational time, their ranges were set and tuned within the ranges in a randomized fashion. The third kind involves those parameters particular to our proposed approach: the number of ensemble members and ‘early stop’ parameter 3. We set the number of ensemble size as 10 in this study. Parameter 3 controls the level of accuracies and diversity of the population. If 3 is too large, the chromosomes will show a high level of diversity while the quality is low. On the contrary, if 3 is too small, the chromosomes will have a high quality, but the level of diversity will be low. Setting parameter 3 requires an empirical justification through a priori studies on the problem complexity against GA convergence. The values of 3’s were set between 10 and 20%, that is, the final chromosomes were 80–90% mature.
3. Modeling details 3.1. Data set In this study, we used DMEF4 data set from the Direct Marketing Educational Foundation (http://www.the-dma. org/dmef/dmefdset.shtml) which contains data from an upscale gift business that mails general and specialized catalogs to its customer base several times each year. There are two time periods, ‘base time period’ from December 1971 through to June 1992, and ‘latter time period’ from September 1992 through to December 1992. Every customer in the latter time period received at least one catalog in early
autumn of 1992. The original problem was to model how much each customer will spend during the Fall-1992 season, i.e. from October 1992 through to December 1992, based on purchase history information. Often in direct marketing, however, one only cares about whether or not a customer will respond to an offer. Thus, we converted the regression task to a classification one where customers were categorized as either a respondent (class 1) or as non-respondent (class 0) depending upon whether they made orders during the ‘base time period’ or not, respectively. In DMEF4, there are 101,532 customers in the sample and 91 candidate predictor variables. The response rate was 9.4%. For practical reasons such as computational cost and imbalanced data distribution, only a subset of DMEF4 customers was selected and used for experiments based on ‘weighted dollar amount spent’ as was defined in (Ha, Cho, & MacLachlan, 2005): Weighted Dollar Z 0:4 !Dollar of This Year C 0:3 !Dollar of Last Year C 0:1 !Dollar of Two Years Ago C 0:1 !Dollars of Three Years Ago C 0:1 !Dollars of Four Years Ago The customers or data samples with Weighted Dollar in top 20% were extracted for this study. The reduced data set now has 20,306 customers with 18.9% response rate. Although the degree of data imbalance is improved from 9.4 to 18.9%, it still shows a high level of skewness. Most classifiers, such as neural network and SVM, do not perform well on this kind of data set. Several remedies have been proposed in the previous researches (Ling and Li, 1998): resampling the small class at random so that it contains as many samples as the other class (oversampling), eliminating the pattern of the large sized class so that it matches the size of the other class (undersampling), or alternatively, converting the problem into a one-class classification problem, i.e. novelty detection problem by ignoring one of the two classes and only building the classifier for another class (Baek and Cho, 2003). In this study, we defined the problem as a twoclass classification problem, and used the subsampling approach so that the training data contains an equal number of respondents and non-respondents. 3.2. The response models and model settings In addition to the proposed FSE, we built six other models for comparison. These models are: † MLP-FF: Single MLP with Full Feature Set, † SVM-FF: Single SVM with Full Feature Set,
DTD 5
ARTICLE IN PRESS E. Yu, S. Cho / Expert Systems with Applications xx (2005) 1–9
† EMLP: Ensemble MLP with Full Feature Set, † FSVM-PCA: SVM with feature subsets selected by PCA, † FSVM-DT: SVM with feature subsets selected by DT, † SVM-FS: SVM with Feature Selection (by GA wrapper), † FSE: SVM Ensemble based on Feature Selection. In MLP-FF model, a single MLP was used to train the data with all the 90 features. The number of hidden nodes was set as 50 and ‘tansig’ and ‘purelin’ were employed as activation functions for hidden layer and output layer, respectively. The learning rate and the number of epochs were set to 0.05 and 1000, respectively. As for the SVM-FF model, a Gaussian kernel was used, with the value of Gamma and Cost set to 0.5 and 20, respectively. The ensemble of 10 MLPs, EMLP, used the full feature set and returned the final result by majority voting. In order to investigate the performance of the filter FS model, FSVM-PCA and FSVM-DT were built, which are SVM models trained with a feature subset selected by PCA and decision tree, respectively. Finally, the SVM trained with the feature subset identified by the GA wrapper, SVM-FS, was built. The population size, number of generation, crossover rate and mutation rate were set to 100, 150, 0.6 and 0.05, respectively. As for the SVM base learner, a Gaussian kernel was employed with Gamma value set to 0.5, and the Cost value set randomly to 20! (1Ca), where a ranges from 0 to 1. The reason for using a random mechanism for cost value is two-fold. First, each of wrapper iteration cycle may return a different feature subset. Second, a fixed cost value is not practically reasonable when a is also fixed. The change of feature subsets may require different parameters for their due performance. As for the FSE model, the parameters of GA and SVM were similar to those in the wrapper feature selection procedure. In Phase 1, early stop parameter 3 was set to 0.1. In Phase 2, the number of ensemble members was set to 10. The fitness function used in GA combined three different criteria, i.e. validation accuracy (Acc), learning time (LrnT), and dimension reduction ratio (DimRat) as follows: FitnessðxÞ Z
1 1 C C 10AccðxÞ dimRatðxÞ 100LrnTðxÞ
In addition to usual accuracy, we also included dimension reduction ratio and training time into the fitness function since when models show comparable results, the one with a smaller dimension and smaller training time is less likely to introduce irrelevant or redundant features, thus preferred. Proper tradeoff values among the multiple objectives may be based on the knowledge of the problem domain or the experimental results.
5
Table 1 The confusion matrix Predicted True
Negative Positive
Negative
Positive
A C
B D
3.3. Performance measurements The most basic performance measures can be described by a confusion matrix in Table 1. Other basic measures are summarized in Table 2. 3.3.1. Balanced correct classification rate (BCR) The accuracy, AC is not an adequate performance measure when the data distribution is unbalanced. Given a data set of 950 negative cases and 50 positive cases, the accuracy will be 95%, if every case is blindly classified as negative. Instead, both precision P and recall TP need to be employed. Some researchers proposed to combine precision and recall into a more elaborate function called the F-measure (Lewis and Catlett, 1994). Other researchers (Kubat, p ffiffiffiffiffiffiffiffiffi Holte, and Matwin, 1997) used the geometric mean PTP, as a performance measure for the problems with imbalanced data sets, since it is an appropriate measure of central tendency. We propose yet another measure called ‘balanced correct classification rate (BCR)’ which incorporates TP and TN in the following way: pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi BCR Z TP !TN (1) For example, a high TN and a low TP will result in a low BCR. Furthermore, The measure will become 0 if all positive cases are classified incorrectly. 3.3.2. ROC graph Another popular measure is an ROC graph which plots true positive rate (TP) against false positive rate (FP) (see Fig. 3). Point (0, 1) in the figure corresponds to perfect classification. Classifiers such as neural network involve adjusting a threshold that can increase TP at the cost of an increased FP or decrease FP at the cost of a decrease in TP. In such cases, a set of (FP, TP) pairs can be obtained and then used to constitute a curve. Examples are the curves labeled C1 and C2 in Fig. 3. Some classifiers such as support Table 2 Basic measures Measure
Formula
Accuracy (AC) True positive (TP) rate or recall False positive (FP) rate True negative (TN) rate False negative (FN) rate Precision (P)
(ACD)/(ACBCCCD) D/(CCD) B/(ACB) A/(ACB) C/(CCD) D/(BCD)
ARTICLE IN PRESS
DTD 5
6
E. Yu, S. Cho / Expert Systems with Applications xx (2005) 1–9
Fig. 3. ROC curves and points.
vector machines does not have a threshold. In such cases, its performance is denoted by a single (FP, TP) pair point in the figure (see points M1, M2, and M3). Compared to other measurements, an ROC curve has the following characteristics. An ROC curve or point is independent of class distribution or error costs. An ROC graph encapsulates all information contained in the confusion matrix, since FN is the complement of TP and TN is the complement of FP. ROC curves provide a visual tool for examining the tradeoff between the ability of a classifier to correctly identify positive cases and the number of negative cases that are incorrectly classified. 3.3.3. c. Degree of perfection The distance to perfect classification (1, 0) may be used as a measure or degree of perfection. Since the distance between pffiffiffi the perfect and worst is 2, we can define the degree of perfection based on the Euclidean distance as following: pffiffiffi 2 KDisP pffiffiffi DeP Z (2) 2 where DisP Z
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðwð1KTPÞÞ2 C ðð1KwÞFPÞ2
(3)
where w is a factor, ranging from 0 to 1, that is used to assign relative importance to false positives and false negatives. The value DeP of Eq. (2) ranges from 0 to 1 and the larger the value, the better the model performance. As extreme cases, when the model’s ROC point lies on (1, 0), DeP would be 0; and when the model’s ROC points lies on the perfect point (0, 1), DeP would be 1.
4. Experimental results 4.1. The selected inputs to the response model We have built and tested seven different response models using three different ways of input feature selection: no
selection, filter and wrapper. We examine which features were found by filter and wrapper approaches. First, filter approach employed PCA and DT. For PCA, the eigenvectors that have large corresponding eigenvalues were selected. The elbow search method was used to choose them. In other words, each eigenvalue was examined one by one from largest to smallest until the difference between two contiguous eigenvalues was too large. A total of 15 components were selected. Each of them is a linear combination of all features. But for a display purpose, the top 15 variables whose coefficients’ magnitude is largest are as follows: First Season Dollars, Latest Season Dollars, Dollars This Year, Dollars Last Year, LTD Credit Card Dollars, LTD AmerExp Card Dollars, LTD MC & VISA Card Dollars, LTD Phone Orders, LTD Gift Dollars, Dollars 2 Years Ago, Dollars 3 Years Ago, Dollars 4 Years Ago, LTD Dollars, Years with Purchase and Date Added to File. For Decision Tree (DT), 10 features were selected, which are First Season Items, Items This Year, Lines This Year, LTD MC & VISA Card Orders, LTD MC & VISA Card Dollars, Orders 4 Years Ago, LTD Pruchs in ProdGrp 10, This Year Purchs ProdGrp 10, and Latest Purchase Season, Years with Purchase. Second, wrapper approach employed a GA wrapper where each GA run resulted in a different subset of features in the end. A total of 10 runs were made. Table 2 lists the selected features and the number of times that each feature was selected in the 10 runs. Among a total of 90 variables, five features were selected in all 10 runs: LTD AmerExp Card Dollars, LTD Purchases in Group 7, First Purchase had in Group 11, LTD Fall orders, and Data added to File. Another 22 features were chosen nine times, and another seven features eight times, etc. Features that were chosen more than five times occupied approximately 52% of the 90 features. A reasonable degree of consistency was observed without the aid of a marketing expert. These features were used to improve the model performance, which will be shown in the next section. Five features were commonly chosen by Decision tree and GA wrapper, marked * in Table 3. 4.2. Model performance In this section, the response model performances were compared in terms of accuracy, BCR, ROC graph and degree of perfection. Fig. 4 depicts the average accuracies of the seven response models built and tested in the experiment. In terms of accuracy, these models showed similar performance, but the proposed FSE performed best with an accuracy of 0.7268, and the worst performance was reported by single SVM model using the full feature set with an accuracy of 0.7109. The proposed FSE outperformed MLPF, SVMF, EMLP and SVMFS by 1.72, 2.23, 0.65 and 1.71%, respectively. As mentioned before, a model that ignores a small percentage of respondents could still result in a reasonably good accuracy. Thus, a difference performance measure needs to be examined.
ARTICLE IN PRESS
DTD 5
E. Yu, S. Cho / Expert Systems with Applications xx (2005) 1–9 Table 3 The GA wrapper selected features and their frequencies
10 10 10 10 10 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 8 8 8 8 8 8 8 7 7 7 7 7 7 7 7 7 7 6
The balanced correct classification rate (BCR) takes both TP and TN into consideration. Figs. 5 and 6 display the average and the box plot of BCR from 10 experiments, except for the single SVM model, which used the full feature set. The proposed FSE model outperformed single models using full feature set by approximately 5%. Both SVMFS and EMLP showed relatively unstable performance whereas FSE showed more stable performance. Fig. 7 presents (FP, TP) pairs for the seven models. The perfect model would fall on (0, 1). The proposed FSE lies closet to it while SVMFS and EMLP look pretty close. SVMF has a good TP but is FP is too large. The actual distance to the perfect point (0, 1) is calculated by ‘degree of perfection’ (DeP) measure shown in Fig. 8.
0.76 0.74
Accuracy
LTD AmerExp card dollars LTD Purchases in ProdGrp 7 First purchases had Prodgrp 11 LTD fall orders Date added to file First season dollars Dollars this year Items last year Lines last year LTD credit card orders LTD AmerExp card orders LTD MC visa card orders (*) LTD dollars LTD purchases in ProdGrp 1 LTD purchases in ProdGrp 4 LTD purchases in ProdGrp 8 First purchases had ProdGrp 3 First purchases had ProdGrp 8 This year purchase ProdGrp 4 This year purchase ProdGrp 5 This year purchase ProdGrp 10(*) This year purchase ProdGrp 11 LTD purchases in ProdGrp 9 Latest purchase season (*) Date of last purchase LTD credit card dollars Orders 3 years ago LTD lines LTD purchases in ProdGrp 9 First purchases in ProdGrp 10 Latest purchase year Years with purchase (*) Date of first purchase First season orders Latest season items LTD purchases in ProdGrp 3 LTD purchases in ProdGrp 10(*) First purchases in ProdGrp 7 First purchase had ProdGrp 12 First purchases had ProdGrp 14 This year purchase ProdGrp 7 This year purchase ProdGrp 14 Orders this year
0.72
0.7268
0.7221 0.7145
0.7164
0.7109
0.7064
0.7 0.68 0.66 0.6481
0.64 0.62 0.6 MLPF
SVMF
EMLP FSVMPCA FSVMDT SVMFS
FSE
Response Models
Fig. 4. Average accuracies for the seven response models.
0.8
0.75 0.7104
0.7
0.6938
BCR
Frequencies
0.8 0.78
0.7055
0.7136
0.7231
0.6847
0.65 0.6254
0.6
0.55 MLPF
SVMF EMLP
FSVM FSVM SVMFS -PCA -DT
FSE
Response Models Fig. 5. Average BCR for the seven response models.
0.8
0.75
BCR
Features
7
0.7
0.65
0.6
0.55
MLPF SVMF EMLP
FSVM FSVM SVMFS FSE -PCA -DT
Response Models Fig. 6. BCR box plot for the seven response models.
ARTICLE IN PRESS
DTD 5
8
E. Yu, S. Cho / Expert Systems with Applications xx (2005) 1–9
True Positive Rate (TP)
0.8 0.75
(0.3649, 0.7382) (0.2940, 0.7406) (0.4059, 0.7129) (0.2623, 0.6903)
0.7
(0.2358, 0.6604)
0.65
(0.2362, 0.6302)
0.6
MLPF SVMF EMLP FSVM-PCA FSVM-DT SVMFS FSE
0.55 (0.2005, 0.5122)
0.5
0.45 0.1 0.15 0.2 0.25 0.3 0.35 0.4
0.45 0.5 0.55 0.6
False Positive Rate (FP)
Degree of Perfectness
Fig. 7. ROC points of the seven response models.
0.8
0.7971
0.7933 0.7806
0.8039
0.7754 0.7514
0.75 0.7363
0.7
0.65
MLPF SVMF
EMLP FSVM FSVM SVMFS -PCA -DT
FSE
Response Models Fig. 8. Average degree of perfection of the seven response models.
The paired comparison hypothesis test was performed between SVM-FF and FSE using the results from 10 experiments, with H0: mdZ0 and H1: mdO0, where random variable D denotes the difference of error rates from the two methods, that is, eSVM-FSKeFSE. Assuming both error values Sd =pffiffinffi follows a are from a normal distribution d= t-distribution with a degree of freedom of 9. Because the t statistic value of 2.0072 is larger than 1.895Zt0.05, H0 is rejected with a 95% confidence. Similarly, when this t-test was taken in terms of three other measures, H0 was rejected with a 98% confidence. In conclusion, the superiority of FSE method’s performance is statistically significant.
5. Conclusions In order to build nonlinear complex response models based on neural networks and support vector machines with a limited number of data requires one to solve both feature
selection problem and bias variance dilemma. These problems have been dealt with separately although they interact with each other. In this paper, we proposed a framework, where these two issues are addressed together with a goal of obtaining a globally optimal solution, not a combination of two locally optimal solutions. In particular, we proposed an ensemble of classifiers trained with different subsets of features chosen by a GA based wrapper feature selection approach. The idea was to first create a population of ‘good’ classifiers each of which generalizes well with test data, second select a subset of diverse classifiers from them, and then finally use them in an ensemble to obtain response scores. The experiments involving a large real world public domain data set (DMEF4) showed that the proposed method was better than other six competing methods in terms of both accuracy and stability. The preliminary result ensures us that it is worth investigating this new approach further now that an automatic feature selection scheme as well as accurate and stable classifier was obtained. In addition, we discussed sampling strategies for class imbalance problems and presented relevant performance measures for those: balanced correct classification rate (BCR), ROC point and degree to perfection (DeP). There are limitations of the current approach, which call for further investigation. First, the proposed FSE model is based on a GA wrapper approach, which involves a large computational cost. A computationally more efficient way of searching would be a great interest. Second, the proposed FSE employed support vector machine as its base learner. However, the approach is not limited to that particular learner. Identifying a proper learner according to the characteristics of the problem is also necessary in the future investigation. Third, for the dataset with unbalanced class distribution, how to apply a proper sampling mechanism and proper model evaluation method would also be an interesting topic in the future.
References Baek, J. and Cho, S (2003). Bankruptcy prediction for credit risk using auto-associative neural network in Korean firms, IEEE 2003 international conference on computational intelligence for financial engineering, (pp. 359–365), Hong Kong. Bentz, Y., & Merunkay, D. (2000). Neural networks and the multinomial logit for brand choice modeling: A hybrid approach. Journal of Forecasting, 19(3), 177–200. Bishop, C. M. (1995). Neural networks for pattern recognition. Oxford: Oxford Press. Bounds, D. & Ross, D. (1997). Forecasting customer response with neural networks, Handbook of Neural Computation, G6.2, pp. 1–7. Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123– 140. Cheung, K.-W., Kwok, J. K., Law, M. H., & Tsui, K.-C. (2003). Mining customer product rating for personalized marketing. Decision Support Systems, 35, 231–243. Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and the Bias/Variance Dilemma. Neural Computation, 4(1), 1–58.
DTD 5
ARTICLE IN PRESS E. Yu, S. Cho / Expert Systems with Applications xx (2005) 1–9
Ha, K., Cho, S., & MacLachlan, D. (2005). Response models based on bagging neural networks. Journal of Interactive Marketing, 19(1), 17–30. Hashem, S. (1997). Optimal linear combinations of neural networks. Neural Networks, 10(4), 599–614. Haughton, D., & Oulabi, S. (1997). Direct marketing modeling with CART and CHAID. Journal of Direct Marketing, 11(4), 42–52. Ho, T. K. (1998). The random subspace method for constructing decision forests. IEEE transactions on Pattern Analysis and Machine Intelligence, 20(8), 823–844. Karp, R. M. (1972). Reducibility among combinatorial problems. In R. E. Miller, & J. W. Thatcher (Eds.), Complexity of Computer Computations. New York: Plenum Press. Kim, Y., & Street, N. (2004). An Intelligent system for customer targeting: A data mining approach. Decision Support Systems, 37, 215–228. Krogh, A., & Vedelsby, J. (1995). Neural network ensmelbes Cross validation and active learning Advances in neural information processing systems, 2 pp. 231–238. Kubat, M., Holte, R. & Matwin, S. (1997). Learning when negative examples abound, Proceedings of the 9th European conference on machine learning, ECML 0 97, Prague. Lewis, D., & Catlett, J. (1994). Heterogeneous uncertainty sampling for supervised learning Proceedings of the 11th International Conference on Machine Learning ICML 0 94. New Jersey: Morgan Kaufmann pp.148–156. Ling, C. X., & Li, C. (1998). Data mining for direct marketing: Problems and solutions Proceedings of ACM SIGKDD international conference on knowledge discovery and data mining (KDD-98) pp. 73–79.
9
Moutinho, L., Curry, B., Davies, F., & Rita, P. (1994). Neural network in marketing. New York: Routledge. Perrone, M. P., & Cooper, L. N. (1993). When networks disagree: Ensemble methods for neural network Neural networks for speech and image processing. London: Chapman and Hall. Schapire, R. (1990). The strength of weak learnability. Machine Learning, 5, 197–227. Sharkey, A. J. (1996). On combining Artificial neural nets. Connection Science, 8, 299–314. Sharkey, A. J. C., Sharkey, N. E., & Chandroth, G. O. (1995). Diverse neural net solutions to a fault diagnosis problem. Neural Computing and Applications, 4, 218–227. Vapnik, V. (1995). The nature of statistical learning theory. New York: Springer-Verlag. Yang, J., & Honavar, V. (1998). Feature subset selection using a genetic algorithm. In H. Liu, & H. Motoda (Eds.), Feture selection for knowledge discovery and data mining (pp. 117–136). Dordecht: Kluwer Academic Publishers. Yao, X., & Liu, Y. (1998). Making use of population information in evolutionary artificial neural networks. IEEE Transactions on Systems, Man, and Cybernetics, 28(3), 417–425. Yu, E., & Cho, S. (2004). Keystroke dynamics identity verification—its problems and practical solutions. Computer and Security, 23(5), 428–440. Zahavi, J., & Levin, N. (1997a). Issues and problems in applying neural computing to target marketing. Journal of Direct Marketing, 11(4), 63–75.