Intelligent Data Analysis 16 (2012) 265–278 DOI 10.3233/IDA-2012-0523 IOS Press
265
A hybrid customer prediction system based on multiple forward stepwise logistic regression mode AliReza Sorousha , Ardeshir Bahreininejada,b,∗ and Jan van den Bergc a Department
of Industrial Engineering, Tarbiat Modares University, Tehran, Iran of Engineering Design and Manufacture, Faculty of Engineering, University of Malaya, Kuala Lumpur, Malaysia c Faculty of Technology, Policy, and Management, Section of ICT, Delft University of Technology, Delft, The Netherlands b Department
Abstract. In today’s world, customer purchasing behavior prediction is one of the most important aspects of customer attraction. Good prediction can help to develop marketing strategies more accurately and to spend resources more effectively. When designing a customer prediction system (CPS) two issues are key, namely, feature selection and the prediction method to be used. Furthermore, it seems necessary to design CPSs with both high computational speed and good prediction abilities. The purpose of this paper is to develop such a system by using a hybrid approach. The resulting system is a hybrid CPS (HCPS) and is based on Multiple Forward Stepwise Logistic Regression (MFSLR) model. The MFSLR model combines a forward stepwise regression (FSR) technique that rapidly selects an optimal subset of features with multiple logistic regression (MLR) technique. In practice, the new MFSLR model provides very good prediction results. Since customer identification is one of the principal concerns in the insurance industry, an insurance company dataset has been used. The obtained results show that the FSR selects around 55% of the initially available features, in this way considerably reducing computational costs. In addition, the results show that the MLR method leads to more accurate prediction than some other methods we tried, namely, feedforward neural networks, radial basis networks and regression trees. Keywords: Customer relationship management, Feature selection, Forward stepwise regression, Multiple logistic regression, Neural networks
1. Introduction Machine learning (ML) algorithms for solving pattern recognition problems are often only successful if the available data are preprocessed based on appropriate feature selection. However, performing appropriate feature selection is a hard job and there is no general applicable method available. The feature selection process can be considered as a problem of global combinatorial optimization in ML. It can help to reduce the total number of features and to remove irrelevant and redundant data. Feature selection has received considerable attention in various areas where thousands of features are available. The main goal of feature selection is to identify a subset of features that are most informative, and therefore most predictive, for a given response variable. Finding an optimal feature subset is usually ∗ Corresponding author: Ardeshir Bahreininejad, Faculty of Engineering, University of Malaya, Kuala Lumpur, Malaysia. Tel.: +60 379675266; Fax: +60 379675330; E-mail:
[email protected].
1088-467X/12/$27.50 2012 – IOS Press and the authors. All rights reserved
266 A. Soroush et al. / A hybrid customer prediction system based on multiple forward stepwise logistic regression mode
hard to control [1] and many problems related to feature selection have been shown to be NP-hard [2]. Successful implementation of feature selection not only provides important information for prediction or segmentation, but also reduces computational and analytical efforts for the analysis of high-dimensional data. Appropriate feature selection experienced much success in real world applications because it can often notably reduce the dimensionality of the input space. In recent years, customer relationship management (CRM) is one of the fields of research and development in which feature selection has been employed [3]. CRM is highly necessary today because of the increasing rate of competition among companies. In addition, the rapid changes in the wishes and expectations of customers should be taken into account. CRM is the main means by which businesses can enter in these challenges: since it is able to help them grasp the varied demands of customers and, by doing so, earn competitive advantage [4]. For this reason, the effects of feature selection on a CRM system setting have been studied [5]. Traditionally, the optimal selection of customer targets has been considered one of the most important factors for a successful CRM program. Many models have been proposed to identify as many customers as possible who will buy a specific product, or who will continue their relationship with the firm. Thus, firms try to develop predictive models that accurately identify which customers are most likely to buy. These models are described as a customer prediction system (CPS). The number of features at the access of the designer of a CPS is usually very large. This number can easily become of the order of a few dozens or even hundreds. Several reasons can be identified for the necessity to reduce this number. Low computational complexity is the obvious first one, good interpretability of the system a second one. A related reason is that two different features, each containing good recognition information, may together – because of their high mutual correlation – result into little performance improvement. In the latter case, model complexity increases without much gain. Another major reason is imposed by the required generalization properties of the predictor: good customer prediction performance should especially be achieved for new, yet unknown customers. It is well known that a higher ratio (N/M ) of the number N of training patterns to the number M of free parameters of the customer predictor model usually results into better prediction. Based on these considerations it can be argued that there is a high need to create CPSs that have both limited complexity (especially in terms of the number of attributes that are taken into account) and good prediction abilities. The latter concerns the performance evaluation stage in the design of a customer recognition system, in which the prediction error probabilities of various customer predictors are estimated and compared in order to select the best predictor model. To limit the total number of attributes of the model, different scenarios can be adopted. One is to examine the features individually and discard those with little discriminatory capability. Another is to examine attributes in different combinations. Sometimes the application of a linear or nonlinear transformation of (combinations of) attributes into a feature vector of new attribute variables leads to a new prediction model with better recognition properties. 2. Related works In recent years, some algorithms have been applied for appropriate feature selection on CRM and some comparative studies have been carried out. Since each additional feature variable can increase the cost and running time of a recognition system, there is strong motivation within the customer recognition community to design and implement systems with a small feature set. At the same time, there is a high need to include a sufficient set of features to achieve high recognition rates under difficult conditions. This has led to the development of a variety of techniques within the customer recognition community
A. Soroush et al. / A hybrid customer prediction system based on multiple forward stepwise logistic regression mode 267
for finding an optimal subset of features from a larger set of possible features. Ng and Liu [3] for example have performed feature selection by running an induction algorithm on the dataset. They also recommend the decision tree approach for feature selection. Yan et al. [6] suggested a receiver operating characteristic (ROC) curve for feature selection, or more specifically, they state the area under the curve (AUC) can be used for feature selection. Kim and Street [7] presented a novel approach for customer targeting in database marketing. A standard Genetic algorithm (GA) is used to search through the possible combinations of features. The input features selected by GA are used to train Neural Networks. Also, Yu and Cho [8] proposed an ensemble creation method based on GA based wrapper feature subset selection mechanism to forecast, based on purchase history information, how much each customer will spend. Kim [5] employed a simple statistical procedure, forward feature selection based on a chi-square score, which starts with the empty set of variables and greedily adds variables that most improve performance. Ahn et al. [9] proposed a case-based reasoning system with the two-dimensional reduction technique for customer classification that predicts customers’ buying behavior for a specific product using their demographic characteristics. To avoid overfitting, Buckinx et al. [10] employed a multiple linear regression model for feature selection procedure. Yan and Changrui [11] combined a nested partition algorithm with simulated annealing (SA) in order to select features for customer recognition. Tseng and Huang [12] applied rough set theory to feature selection in CRM. Lessmann and Voß [13] proposed a hierarchical reference model for a support vector machine (SVM) based classification in real-world CRM applications in which recursive feature elimination has been proposed as an iterative backward-elimination procedure for SVM-based feature ranking. Among the described feature selection algorithms, case based reasoning and multiple linear regression models do not consider nonlinear relations in customer behavior; rough set theory also cannot deal with very large database. ROC curve, statistical procedure, and SVM are inaccurate and do not select the number of optimal features. Another weakness of SVM is limitation in kernel selection, low speed and size. In addition, the disadvantage of GA and SA is their expensive runtime cost. Even, according to [14] GAs can suffer from excessively slow convergence before finding an accurate solution because of the characteristics of the use of minimal a priori knowledge and failure of exploiting local information. Another drawback of SA is that once a cell is assigned to one side of the cut line, it will never move to the other side to improve the placement [15], i.e. SA falls in local minima and stops. The principle goal of this study is to build a CPS for effective CRM programs. Looking at literature, we discovered that the available papers on CPSs usually apply a high complex and time-consuming approach since it is always difficult to deal with both requirements of high computational speed and good performance. Based on the executed literature review we observed that a combination of forward stepwise regression (FSR) and multiple logistic regression (MLR) techniques was not applied for customer purchasing behavior prediction while our hypothesis was that this could be an appropriate approach. To do the job, we developed a hybrid CPS (HCPS), namely, a multiple forward stepwise logistic regression (MFSLR) model that is computationally efficient and effective. The new method can be described as follows. We first employ a FSR technique to identify the best set of features from the given training dataset. Second, a MLR technique is used for future customers’ prediction. Prediction results are analyzed with and without performing feature selection. We used one feature selection technique named pruned regression tree (PRT) and three different prediction techniques: a PRT, a feedforward neural network (FFNN), and a radial basis network (RBN). The remainder of this paper is organized as follows. In Section 2, the MFSLR model is described in more detail. In Section 3, the performed case study is introduced. In Section 4, the adopted feature selection technique is described. In Section 5, the chosen prediction technique is described and executed, and the results obtained are compared. Finally, in Section 6, our conclusions are presented.
268 A. Soroush et al. / A hybrid customer prediction system based on multiple forward stepwise logistic regression mode
3. Framework of Multiple Forward Stepwise Logistic Regression model In many ML problems, a wide range of input variables is available which can be used as input variables of an MLR model, but it is hard to define which of them are most relevant or useful at all. The situation is often further confused when there are interdependencies between some of the input variables. In the latter case, appropriate subsets might be sufficient for finding a well-performing model. Input Feature Selection (IFS) covers a variety of techniques that seek to identify input variables that do not contribute significantly to model performance, so that they can be removed. IFS is an intrinsically hard problem since it concerns an exponential search where the number of feature subsets to be considered is 2n , (n = the number of features). For 19 features, there are already over half a million possible feature subsets. Therefore, often techniques that combine algorithms and MLR are used. These can be stepwise algorithms that progressively add or remove variables. Alternatively, a (pruned) regression tree can be used. These algorithms may discover subsets of inputs that are not discovered by other techniques. Feature selection is preferable to feature transformation when the original units and meaning of features are important for the users and the modeling goal is to identify an influential subset. When categorical features are present, and numerical transformations are inappropriate, the preferred feature selection method is dimension reduction. FSR technique is a sequential feature selection technique designed specifically for least squares fitting. It makes use of optimizations that are only possible with least-squares criteria. Unlike generalized sequential feature selection, forward stepwise regression may remove features that have been added, or add features that have been removed. The features are sequentially added to an empty candidate set until the addition of further features does not further decrease the error criterion [16]. Park et al. [17] proposed a stepwise feature selection using the generalized logistic loss and noted that for both real and simulated datasets, the proposed method can improve the quality of feature selection if compared to the support vector machine with recursive feature elimination. The described technique can be appropriate to identify selected features for improved customer prediction since its computation speed is also very high. On the other hands, MLR technique is used when the dependent variable is nominal and there is more than one independent variable. One goal is to see whether the probability of getting a particular value of the nominal variable is associated with the measurement variable; the other goal is to predict the probability of getting a particular value of the nominal variable, given the measurement variable [18]. Based on two mentioned techniques, the main framework of MFSLR model to predict customer purchasing behavior is configured: see Fig. 1. As can be seen from Fig. 1, the implementation of the proposed working framework consists of three steps: feature selection, customer prediction, and model measurement, respectively. We explain these steps in more detail in the next sections. To measure the efficiency and effectiveness of customers’ prediction model, a dataset from an insurance company has been used. 4. Data description: A case study of insurance company The insurance industry concerns a business field where the identification of the right customers is a key issue. The dataset used in our application study is owned and supplied by the Dutch data mining company Sentient Machine Research, and is based on real world business data. The data are taken from applicants of 9822 customers who buy insurance policies for their mobile home. These data have been used in the CoIL Challenge 2000 prediction competition and provide an opportunity to assess the capability of the HCRS model in a customer prospecting application.
A. Soroush et al. / A hybrid customer prediction system based on multiple forward stepwise logistic regression mode 269 Feature selection
Entering all features to FSR model to make models on training data Selecting the best model based on lower root mean square error criterion Separating optimal subset of features based on the best model
Customer prediction
Entering optimal input features to MLR model on training data to fit model
Customers forecasting on training data and predicting customers on validation data
Measuring error based on MSE criteria on both dataset
Selecting 20% of both dataset separately with the most purchasing probability
Computing Number of purchasers at both dataset
Model measurement Comparing MFSLR model to three models without feature selection and five models with feature selection and the best presented prediction
Fig. 1. Framework of MFSLR model to predict customer purchasing behavior.
In the research described in this paper, two separate datasets have been employed: a training set with 5822 customers and an evaluation set with 4000 customers. Each record consists of 86 features, containing socio-demographic data (features 1–43) and product ownership (features 44–86). Attribute 86 named ‘CARAVAN: Number of mobile home policies’ is the target variable. The training data is used to calibrate the MFSLR model and to estimate the expected hit rate in the evaluation set. Inspecting the 5822 prospects in the training dataset revealed that 348 persons purchased a mobile home policy, resulting in a hit rate of 348/5822 = 5.97%. From the manager’s perspective, this hit rate would be obtained if applicants were sent out randomly to consumers in the firm’s database. The evaluation data are used to validate the MFSLR model. From the 4000 prospects in the evaluation dataset, 238 persons purchased a mobile home policy, resulting in a hit rate of 238/4000 = 5.95%. As can be seen, the hit rate is almost equal in both datasets. Our MFSLR model is designed to identify the top 20% of customers in the evaluation dataset judged to be most likely to buy mobile home policy. The MFSLR prediction accuracy is examined by computing the observed hit rate among the selected customers. It is important to understand that only information in the training dataset is used for developing the MFSLR model. Data in the evaluation dataset is exclusively used for prediction estimation. 5. Feature selection using Forward Stepwise Regression Forward Stepwise Regression (FSR)-based feature selection starts by finding the input variable that, by itself, best predicts the output variable. It then looks for a second variable, which most improves the
270 A. Soroush et al. / A hybrid customer prediction system based on multiple forward stepwise logistic regression mode
model when added to the first. This process is continued until either all variables have been selected or no further improvement is being observed. The FSR functions as a preprocessor for MLR and does not transform or prepare the data. Generally, this technique has two components: – An objective function, called the criterion that should be minimized over all feasible feature subsets. A common criterion is the mean squared error as is used in classical linear regression models. – A sequential search algorithm, which adds or removes features from a candidate subset while evaluating the criterion value. Since an exhaustive comparison of the criterion value at all 2n subsets of an n-feature data set is typically infeasible (depending on the size of n and the cost of objective calls), sequential searches move in only one direction, always growing or always shrinking the candidate set. While being processed, each of the implemented algorithms displays messages showing how they are progressing. With the FSR selection technique, these messages serve to give a ranking of the importance of the variables. An alternative feature selection technique is the pruned regression tree (PRT). Regression trees have become one of the simplest as well as most successful learning algorithms in DM. Regression trees are well known because they are easy to interpret and computationally inexpensive [19]. The pruning algorithm is general in that it applies to trees that are not necessarily classification but regression trees [20]. PRT has been shown to be an effective feature selection and prediction method where it is tried to reduce the number of inputs in order to obtain more accurate prediction results in significantly shorter training times [19]. Other benefits of using regression tree as a preprocessor include [20]: no need to transform or prepare the data, automatic handling of missing values, automatic handling of categorical (nominal) predictors, ability to handle very large numbers of predictors, and ability to handle very large training data files. 5.1. FSR algorithm FSR concerns a systematic method for adding and removing terms from a multi-linear model based on their statistical significance in a regression. The method begins with an initial model and then compares the explanatory power of incrementally larger and smaller models. At each step, the p-value of an F -statistic is computed to test models with and without a potential term. If a term is not currently in the model, the null hypothesis is that the term would have a zero coefficient if added to the model. If there is sufficient evidence to reject the null hypothesis, the term is added to the model. Conversely, if a term is currently in the model, the null hypothesis is that the term has a zero coefficient. If there is insufficient evidence to reject the null hypothesis, the term is removed from the model. The algorithm proceeds as follows [16]: 1. Fit the initial model; 2. If any terms not in the model have p-values less than an entrance tolerance (that is, if it is unlikely that they would have zero coefficient if added to the model), add the one with the smallest p-value and repeat this step; otherwise, go to step 3; 3. If any terms in the model have p-values greater than an exit tolerance (that is, if it is unlikely that the hypothesis of a zero coefficient can be rejected), remove the one with the largest p-value and go to step 2; otherwise, end. Depending on the terms included in the initial model and the order in which terms are moved in and out, the method may build different models from the same set of potential terms. The method terminates
A. Soroush et al. / A hybrid customer prediction system based on multiple forward stepwise logistic regression mode 271
Fig. 2. Optimal model for selecting features based on minimum of RMSE.
when no single step improves the model. There is no guarantee, however, that a different initial model or a different sequence of steps will not lead to a better fit. In this sense, stepwise models are locally optimal, but may not be globally optimal [18]. Furthermore, performance may be compared in several ways. It is important to use an assessment criterion appropriate to the real problem under investigation. Therefore, the root mean square error (RMSE) criterion is taken into account in order to determine the optimal model. In many practical situations, a designer is confronted with features, the values of which lie in different dynamic ranges. As a consequence, features with large values may have a larger influence in the cost function than features with small values, although this does not necessarily reflect their respective significance in the design of the predictor. Performing preprocessing steps on inputs and targets can lead to better estimations. Constraining the inputs and targets into a specific range by scaling them is often the best solution here: it is all about normalization of feature values. In this research, we have normalized the inputs and target so that they have zero mean and unity standard deviation. 5.2. FSR results and comparison Based on mentioned algorithm, computations have done for 5822 customer records having 85 features. At first, looking at the size of the model, one can say that the model complexity is considerably high, and, therefore, the produced MLR model is expected to be non-optimal. To obtain a more accurate description of the data, we removed some of the features based on the FSR algorithm. As a result, the model that has the lowest cost with lower number of features is selected. Figure 2 indicates the model that is optimal when feature selection is applied based on minimum of RMSE. As can be seen in Fig. 2, the line shows trend of the estimated cost for each model. With the increasing number of features (entering features based on smallest p-value criterion), the cost criterion (i.e., the root mean square error) decreases. However, this decreasing trend stops for the model with number 33
272 A. Soroush et al. / A hybrid customer prediction system based on multiple forward stepwise logistic regression mode Table 1 Optimal sequential input features based on FSR algorithm Input priority number 1 2 3 4 5
Coefficient Input priority number Contribution car policies 0.030 25 Number of boat policies 0.030 26 Purchasing power class 0.006 27 Contribution private third party 0.027 28 insurance Lower level education −0.026 29
6 7
Married Contribution fire policies
0.017 0.025
30 31
8
Number of social security insurance policies Farmer Number of bicycle policies Contribution third party insurance Contribution disability insurance policies Average age Income >123.000 Average income High level education Number of disability insurance policies High level education Protestant Private health insurance National Health Service Contribution family accidents insurance policies Number of family accidents insurance policies Middle management
0.016
32
Contribution agricultural machines policies Home owners Number of private third party insurance Contribution boat policies
−0.005 0.009 −0.010
33 34 35
Other religion Contribution lorry policies Income 45–75.000
0.024
36
1 car
0.009 −0.005 0.008 −0.012 −0.018
37 38 39 40 41
Household without children Number of houses Skilled labourers Roman catholic Number of tractor policies
−0.008 −0.003 −0.001 −0.006 −0.010
0.003 −0.0003 −0.118 −0.111 0.037
42 43 44 45 46
Contribution trailer policies Number of trailer policies 2 cars No car Number of life insurances
0.013 −0.010 0.009 0.008 0.014
−0.033
47
Contribution life insurances
−0.014
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Input feature
Input feature
Coefficient
Social class C Rented house Number of surfboard policies Contribution surfboard policies
0.005 −0.140 0.011 −0.008 −0.001
−0.008 0.0003 −0.006 0.010
−0.134 −0.019
0.013
0.008
and for models with higher numbers, the cost function starts to rise slowly. Then, the trend suddenly dips again on model number 47 as the number of features further grows. The cost function reaches its (global) minimum value 0.22945 for 47 features: for higher numbers, the cost value gradually fluctuates along with an ascent trend. Overall, model number 47 displays the optimal point for determining the best set of features. Table 1 shows the input priority number for each feature, the input feature name, and the coefficient of each feature based on the application of the FSR algorithm. As can also be observed in Table 1, the FSR algorithm selects a set of 47 features as optimal set. That is, features appearing with a smaller priority number got smaller p-values, and features with a larger coefficient have more influence on the customer purchasing behavior with respect to buying a mobile home policy (or not). The features named ‘contribution car policies’, ‘number of boat policies’, ‘contribution private third party insurance’, ‘lower level education’, ‘contribution fire policies’, ‘contribution disability insurance policies’, ‘private health insurance’, ‘national health service’, ‘contribution family accidents insurance policies’, ‘number of family accidents insurance policies’, ‘rented house’, and ‘home owners’ are the most influential features. Overall, this yields a considerable reduction (equal to 44.7%) of the number of input data.
A. Soroush et al. / A hybrid customer prediction system based on multiple forward stepwise logistic regression mode 273
To evaluate the effectiveness of the selected 47 features by FSR, we also applied feature selection based on the PRT technique. The displayed results show that the PRT algorithm selected six features (‘contribution car policies’, ‘customer main type’, ‘Contribution fire policies’, ‘Contribution boat policies’, ‘Unskilled laborers’, and ‘Contribution private third party insurance’) which concerns a considerable reduction of features in comparison with the results found by means of the FSR algorithm. However, this does not ensure good performance in prediction. In addition, comparing the results obtained with the two techniques indicates that even two features selected by PRT (‘customer main type’ and ‘unskilled laborers’) are not among the 47 features selected features by FSR. This can be explained by the fact that the optimization procedures of the two algorithms are much different. 6. Prediction by developing MLR Feature selection concerns the process of identifying the attributes that are supposed to be the most appropriate for prediction. Once an initial list of features has been extracted from a given dataset the list needs to be validated, i.e., it should be tested whether good prediction results can be obtained using the selected features. Here, the MLR technique is used for predicting the probability of an event by data fitting using a logit function logistic curve. It is a generalized linear model used for binomial regression. MLR is a useful way of describing the relationship between several independent variables and a binary response variable [21]. In addition, since artificial neural networks have shown good results in various disciplines of science and engineering [22], we have employed them here as well: two radial basis networks (RBN) and a feedforward neural network (FFNN) have been tried, and the results obtained have been compared to those found by the MLR technique. In the following subsection, we proceed by describing our setup of the MLR model. 6.1. Prediction techniques used The logistic function used in the MLR model takes as input any value from negative infinity to positive infinity, whereas the output is confined to values between 0 and 1. Here, the input is denoted as P , and the output as f (P ). The input P represents the ‘exposure’ to some set of independent variables, while f (P ) represents the probability of a particular outcome. The input P is a measure of the total contribution of all the independent variables used in the model and is known as the logit. In general terms, the MLR input-output function can be written as [21]: f (P ) =
eP , with P = B T X 1 + eP
(1)
where B is the vector containing the model parameters and X is the design vector (ISO matrix), in our case, the set of 47 features selected. Based on the feature selection results described above, the input P can be described here as: P = 0.06 + 0.03 var1 + 0.03 var2 + 0.005 var3 + . . . + 0.014 var46 − 0.014 var47.
(2)
where 0.06 is the intercept and 0.03, 0.03, 0.005, and so on, are called the regression coefficients of var1, var2, var3 respectively. Each of the regression coefficients describes the size of the contribution of that feature. In order to train the FFNN, Levenberg-Marquardt algorithm has been used. A two-layer FFNN consisting of one hidden layer was used. To find the best number of neurons in the hidden layer, a range
274 A. Soroush et al. / A hybrid customer prediction system based on multiple forward stepwise logistic regression mode Table 2 Overall results obtained with and without feature selection Prediction technique MSE MFSLR model (47 selected features) MLR (85 features) FFNN without feature selection (85 features) Regression tree without pruning (85 features)
0.0522 0.0521 0.0523 0.0442
Purchasers among 20% selected customers Training dataset Evaluation dataset Number of mobile home MSE Number of mobile home policy purchasers policy purchasers 203 0.0539 123 200 0.0540 118 172 0.0544 119 269 0.0618 92
of 1 to 50 neurons has been examined and for each number of neurons, 100 iterations has been executed in order to increase the chance of ending up in an absolute minimum. There is only one neuron in the output layer. A logistic sigmoid transfer function has been employed in the hidden layer and output layer. The network has been trained for 1000 epochs. After training, an FFNN with the lowest MSE is chosen. It has been shown that RBN can require more neurons than standard FFNN, but often it can be trained in a fraction of the time. It works best when many training vectors are available [23]. RBN creates a two-layer network. Initially, the first layer has no neurons, and calculates its weighted inputs using the Euclidean distance weight function. Notice that the expression for the net input of a radial basis neuron is different from that of other neurons. Here the net input to the radial basis transfer function is the vector distance between its weight vector and the input vector, multiplied by the bias. The second layer has linear transfer function neurons, and calculates its weighted input using the dot product weight function. Both layers have biases [24]. The network training is continued for 1 to 50 neurons (similar to FFNN) until the network’s MSE reduces and the maximum number of neurons has been reached. After MLR training, the prediction performance of the found model should be examined based on the evaluation dataset. In our case, the first 20% of all customers that have a higher probability of purchasing a mobile home policy is determined. Finally, the number of predicted purchasers is specified among 20% selected customers (for comparison with best solution on this data). It can determine quality of designed MFSLR model. 6.2. Results and model measurement As mentioned above, the purchasers’ prediction of a mobile home policy has been obtained using an MFSLR model with the 47 selected features as its inputs. It is important to note that the dataset, the feature selection was based on, should be completely independent from the evaluation dataset; otherwise, there is a high risk of overfitting the training of training process [22]. To evaluate the effectiveness of feature selection by using the FSR, another MLR has been developed that uses the data without feature selection for the prediction. Furthermore, to indicate superiority of the MLR over the FFNN (without feature selection) and regression tree without pruning, their performance is compared by looking at the MSE and accuracy of the techniques on the evaluation dataset. The overall results obtained by using the MFSLR model and those obtained by using the MLR, FFNN, and regression tree (which use data with 85 features) are shown in Table 2. Table 2 shows the optimal results obtained for running the four designed statuses with varying number of features and prediction techniques according to the MSE and the number of mobile home policy purchasers. As can be seen in the table, directly employing a regression tree, a FFNN, and an MLR with 85 features produce a 0.0618, 0.0544, and respectively 0.0540 mean square error (MSE) on the evaluation
A. Soroush et al. / A hybrid customer prediction system based on multiple forward stepwise logistic regression mode 275 Table 3 Predictions obtained for mobile home policy purchasers by developing six models Model type
Feature Prediction Purchasers among 20% selected customers selection technique Training dataset Evaluation dataset technique MSE Number of Percentage Percentage MSE Number of Percentage Percentage purchasers of total of predicted purchasers of total of predicted purchasers customers purchasers customers MFSLR FSR MLR 0.0522 203 58.3% 17.4% 0.0539 123 51.7% 15.4% HFSR FSR FFNN 0.0511 197 56.6% 16.9% 0.0543 119 50.0% 14.9% HFFNN PRT FFNN 0.0524 193 55.5% 16.6% 0.0541 119 50.0% 14.9% HRBN PRT RBN 0.0511 191 54.9% 16.4% 0.0538 117 49.2% 14.6% MPLRT PRT MLR 0.0537 178 51.2% 15.3% 0.0544 105 44.1% 13.1% DPRT PRT PRT 0.0513 178 51.2% 15.3% 0.0549 100 42.0% 12.5%
dataset, whereas it decreases to 0.0539 by using the reduced number (47) of features obtained through developing the MFSLR. Considering the other performance criterion (i.e. number of mobile home policy purchasers), the MFSLR model also improves the prediction results, namely 123 purchasers against of 92, 119, and 118. In addition to, if 49 features were considered, prediction output is 122 purchasers. Therefore, having 47 features instead of 85 features considerably reduces computation complexity and improves the prediction results and no highly relevant or partially relevant features are lost. To evaluate the accuracy of the prediction results obtained by the MFSLR model, the PRT technique has been employed for feature selection, and the FFNN, RBN, MLR, and PRT techniques have been applied to predict mobile home policy purchases. The results obtained from all these experiments have been compared. To do so, FSR feature selection technique that uses 47 selected features as inputs for FFNN (we call it hybrid FSR (HFSR) model) has been run. In addition, four FFNN, RBN, MLR and PRT prediction techniques that use the six selected features dataset resulting from PRT (we call them hybrid FFNN (HFFNN), hybrid RBN (HRBN), multiple pruned logistic regression tree (MPLRT), and Double PRT (DPRT) models) have been developed. Table 3 shows the predictions obtained by the six models used, i.e., the MFSLR, HFSR, HFFNN, HRBN, MPLRT, and DPRT models for predicting mobile home policy purchasers. Comparing the prediction results in Table 3, it can be seen that the predictions made by the MFSLR model (based on the evaluation dataset) show superiority, in all performance criteria, over the HFSR, HFFNN, HRBN (except MSE), MPLRT, and DPRT models. Employing FFNN as prediction techniques after using FSR produce 0.0543 mean squared error, and applying the PRT as feature selection technique before developing the PRT, MLR, and FFNN presents 0.0549, 0.0544, and 0.0541 mean squared errors respectively, whereas it reduces to 0.0539 through developing the MFSLR. Considering the other three performance criteria, i.e. number of purchasers, percentage of total purchasers, and percentage of predicted customer, also proves that using the MFSLR model improves the prediction results. These values are 123, 51.7%, and 15.4% for MFSLR model against of 119, 50.0%, and 14.9% for HFSR and HFFNN models, 117, 49.2%, and 14.6% for HRBN model, 105, 44.1%, and 13.1% for MPLRT model, and 100, 42.0%, and 12.5% for DPRT model among 20% selected customers on evaluation dataset respectively. In addition while the results of MFSLR, HFSR, HFFNN and HRBN may be quite close but MFSLR outperforms others in terms of computation (CPU) time, (the MLR technique used as prediction technique is much fast than FFNN, RBN.) The reason is that FFNN or RBN require several runs until they reach the best solution. Generally, the results in Table 3 show that the MFSLR model gives a better and robust representation of data as it was able to reduce (i) the number of features for 44.7% of the original data and (ii) the error in predicting new customers. In addition, comparing the best number of predicted customers using
276 A. Soroush et al. / A hybrid customer prediction system based on multiple forward stepwise logistic regression mode
the same data set (121 purchasers by Na¨ıve Bayes method on the site with URL link: http://www.liacs. nl/∼putten/library/cc2000/report2.html) to the results obtained by MFSLR shows performance superiority of the latter. In addition, we noted in the literature that GA is a very time consuming method, but it is a powerful feature selection tool, especially when the dimensions of the original feature set are large. GAs are best known for their ability to efficiently search large spaces about which little is known a priori [25]. Thus, authors suggest employing GA for feature selection as a future research. 7. Conclusions Recent developments in feature selection have shown promising results in improving the performance of predictors. Following this line of pragmatic research, we have developed a Hybrid Customer Prediction System (HCPS) for the prediction of mobile home policy purchasers. A novel model has been proposed for predicting customer behavior by designing an MFSLR based upon a reduced set of 47 features using FSR. This MFSLR model led to an unconstrained combinatorial optimization problem in which the lowest MSE rate is the evaluation criterion. The minimum RMSE criterion was employed to determine the optimal FSR-based model, which ultimately led to 47 features. Applying this proposed feature selection procedure displayed considerable improvement in features’ reduction. This procedure also reduces computation time in the feature selection process. The purchasers’ prediction estimation has also been performed based on a MLR that makes use of 47 selected features. To evaluate the effectiveness of the MFSLR model, predictions have been obtained using only a MLR, a FFNN, and a regression tree, all of them without feature selection. The results proved superiority and effectiveness of the MFSLR model in performance criteria. In addition, for evaluating the robustness of feature selection output, the PRT technique, and to measure accuracy of the obtained prediction results, three techniques including FFNN, RBN and PRT have been developed. Comparing the prediction results shows the superiority of the combining FSR and MLR (MFSLR model) over the five other models in terms of performance criteria. Furthermore, the results illustrated superiority of the MFSLR model over the so far best-obtained result on the data used. Acknowledgements The authors wish to thank the Iran Telecommunication Research Center (ITRC) for the dedicated research grant, the Dutch data mining company Sentient Machine Research for supplying the data required for this research and M. Davarynejad for his guidance. Glossary of terms Customer relationship management (CRM): It is a widely-implemented strategy for managing a company’s interactions with customers, clients and sales prospects. It involves using technology to organize, automate, and synchronize business processes – principally sales activities, but also those for marketing, customer service, and technical support. The overall goals are to find, attract, and win new clients, nurture and retain those the company already has, entice former clients back into the fold, and reduce the costs of marketing and client service.
A. Soroush et al. / A hybrid customer prediction system based on multiple forward stepwise logistic regression mode 277
Double PRT (DPRT) model: It is a combination model of two pruned regression tree to select features and to predict customers. Feedforward neural network (FFNN): It is an artificial neural network where connections between the units do not form a directed cycle. The FFNN was the first and arguably simplest type of artificial neural network devised. In this network, the information moves in only one direction, forward, from the input nodes, through the hidden nodes (if any) and to the output nodes. There are no cycles or loops in the network. Forward stepwise regression (FSR) technique: It includes regression models in which the choice of predictive variables is carried out by an automatic procedure. Usually, this takes the form of a sequence of F-tests, but other techniques are possible, such as t-tests. Forward selection, which involves starting with no variables in the model, trying out the variables one by one and including them if they are ’statistically significant’. Hybrid customer prediction system (HCPS): It is a system to predict customers that it consists of combination of at least two techniques or methods. Hybrid FFNN (HFFNN) model: It is a combination model of a pruned regression tree to select features and a feedforward neural network to predict customers. Hybrid FSR (HFSR) model: It is a combination model of a forward stepwise regression to select features and a feedforward neural network to predict customers. Hybrid RBN (HRBN) model: It is a combination model of a pruned regression tree to select features and a radial basis network to predict customers. Logistic regression (LR) technique: It is used for prediction of the probability of occurrence of an event by fitting data to a logit function logistic curve. Machine learning (ML): A branch of artificial intelligence, is a scientific discipline that is concerned with the design and development of algorithms that allow computers to evolve behaviors based on empirical data, such as from sensor data or databases. Mean square error (MSE): It is one of many ways to quantify the difference between values implied by an estimator and the true values of the quantity being estimated. MSE is a risk function. Multiple forward stepwise logistic regression (MFSLR) model: It is a combination system of forward stepwise regression (FSR) and multiple logistic regression (MLR). Multiple pruned logistic regression tree (MPLRT) model: It is a combination model of a pruned regression tree to select features and a multiple linear regression to predict customers. Radial basis network (RBN): It is an artificial neural network uses radial basis functions as activation functions. It is a linear combination of radial basis functions. They are used in function approximation, time series prediction, and control. Receiver operating characteristic (ROC) curve: It is a graphical plot of the sensitivity, or true positive rate, vs. false positive rate (1 – specificity or 1 – true negative rate), for a binary classifier system as its discrimination threshold is varied. Regression tree (PRT): A data-analysis method that recursively partitions data into sets each of which are simply modeled using regression methods. The pruning algorithm is general in that it applies to trees that are not necessarily classification but regression trees. Root mean square error (RMSE) criterion: It is a frequently-used measure of the differences between values predicted by a model or an estimator and the values actually observed from the thing being modeled or estimated. RMSD is a good measure of precision.
278 A. Soroush et al. / A hybrid customer prediction system based on multiple forward stepwise logistic regression mode
References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25]
R. Kohavi and G.H. John, Wrappers for feature subset selection, Artif Intell (1–2) (1997), 273–324. A.L. Blum and R.L. Rivest, Training a 3-Node Neural Networks is NP-Complete, Neural Networks 5 (1992), 117–127. K.S. Ng and H. Liu, Customer retention via data mining, Artif Intell Rev 14(6) (2000), 569–590. E.T. Anderson, Sharing the wealth: when should firms treat customers as partners? Manage Sci 48(8) (2002), 955–971. Y.S. Kim, Toward a successful CRM: variable selection, sampling, and ensemble, Decis Support Syst 41 (2006), 542–553. L. Yan, R. Wolniewicz and R. Dodier, Predicting customer behaviour in telecommunications, IEE Intelligent Systems 19 (2004), 50–58. Y.S. Kim and W.N. Street, An intelligent system for customer targeting: a data mining approach, Decis Support Syst 37 (2004), 215–228. E. Yu and S. Cho, Constructing response model using ensemble based on feature subset selection, Expert Syst Appl 30 (2006), 352–360. H. Ahn, K. Kim and I. Han, A case-based reasoning system with the two-dimensional reduction technique for customer classification, Expert Syst Appl 32 (2007), 1011–1019. W. Buckinx, G. Verstraeten and D.V. Poel, Predicting customer loyalty using the internal transactional database, Expert Syst Appl 32 (2007), 125–134. L. Yan and Y. Changrui, A new hybrid algorithm for feature selection and its application to customer recognition, LNCS 4616 (2007), 102–111. T.L. Tseng and C.C. Huang, Rough set-based approach to feature selection in customer relationship management, Omega 35 (2007), 365–383. S. Lessmann and S. Voß, A reference model for customer-centric data mining with support vector machines, Eur J Oper Res 199 (2009), 520–530. J.M. Renders and S.P. Flasse, Hybrid methods using genetic algorithms for global optimization, IEEE T Syst Man and Cy B 26(2) (1996), 243–258. C.A. Tovey, Simulated, Simulated Annealing, Amer J of Math Management Sci 8(3–4) (1988), 389–407. N.R. Draper and H. Smith, Applied Regression Analysis, Hoboken, NJ: Wiley-Interscience, 1998. Ch. Park, J.Y. Koo, P.T. Kim and J.W. Lee, Stepwise feature selection using generalized logistic loss, Comput Stat and Data An 52(7) (2008), 3709–3718. P. McCullagh and J.A. Nelder, Generalized Linear Models, New York: Chapman and Hall, 1990. X. Wu, V. Kumar, J.R. Quinlan, J. Ghosh, Q. Yang, H. Motoda, G.J. McLachlan, A. Ng, B. Liu, S. Yu, Ph. Zh.H. Zhou, M. Steinbach, D.J. Hand and D. Steinberg, Top 10 algorithms in data mining, Know Inform Syst 14 (2008), 1–37. A.R. Webb, Statistical Pattern Recognition, second ed., John Wiley and Sons, Ltd., 2002. J.M. Hilbe, Logistic Regression Models, Chapman & Hall/CRC Press, 2009. E.E. Wan, Neural Network Classification: A Bayesian Interpretation, IEEE T Neural Networ 1(4) (1990), 303–305. S. Chen, C.F.N. Cowan and P.M. Grant, Orthogonal Least Squares Learning Algorithm for Radial Basis Function Networks, IEEE T Neural Networ 2(2) (1991), 302–309. H. Demuth, M. Beale and M. Hagan, Neural Network ToolboxTM User’s Guide, Version 6.0.3, The MathWorks, Inc., 2009. W. Siedlecki and J. Sklansky, A note on genetic algorithms for large-scale feature selection, Pattern Recogn Lett 10 (1989), 335–347.