Document de travail du LEM 2011-06
A PROBABILITY-MAPPING ALGORITHM FOR CALIBRATING THE POSTERIOR PROBABILITIES: A DIRECT MARKETING APPLICATION
Kristof Coussement*, Wouter Buckinx** *
IESEG School of Management (LEM-CNRS) ** Python Predictions, Brussels, Belgium
A Probability-Mapping Algorithm for Calibrating the Posterior Probabilities: A Direct Marketing Application* Kristof Coussement£, Wouter Buckinx+ £
IESEG School of Management (LEM-CNRS), Department of Marketing, 3 Rue de la Digue, F-59000, Lille (France). +
Managing partner (PhD), Python Predictions,
Avenue R. Van den Driessche 9, B-1150 Brussels, Belgium.
First and corresponding author: Kristof Coussement,
[email protected] , Tel.:+33320545892 Second author: Wouter Buckinx,
[email protected]
This paper is accepted for publication in European Journal of Operational Research
1
A Probability-Mapping Algorithm for Calibrating the Posterior Probabilities: A Direct Marketing Application
Abstract Calibration refers to the adjustment of the posterior probabilities output by a classification algorithm towards the true prior probability distribution of the target classes. This adjustment is necessary to account for the difference in prior distributions between the training set and the test set. This article proposes a new calibration method, called the probability-mapping approach. Two types of mapping are proposed: linear and non-linear probability mapping. These new calibration techniques are applied to 9 real-life direct marketing datasets. The newly-proposed techniques are compared with the original, non-calibrated posterior probabilities and the adjusted posterior probabilities obtained using the rescaling algorithm of Saerens, Latinne, & Decaestecker (2002). The results recommend that marketing researchers must calibrate the posterior probabilities obtained from the classifier. Moreover, it is shown that using a ‘simple’ rescaling algorithm is not a first and workable solution, because the results suggest applying the newly-proposed non-linear probability-mapping approach for best calibration performance.
Keywords: data mining, direct marketing, response modeling, calibration, decision support systems
This paper is accepted for publication in European Journal of Operational Research
2
1. Introduction Due to recent developments in IT infrastructure and the ever-increasing trust placed in complex computer systems, analysts are showing an increasing interest in classification modeling in a variety of disciplines such as credit scoring (Martens et al., 2010; Paleologo et al., 2010), medicine (Conforti & Guido, 2010), text classification (Bosio & Righini, 2007), SMEs fund management (Kim and Sohn, 2010), revenue management (Morales & Wang, 2010), and so on. The same interests are shared by the direct marketing community. Direct marketing analysts have an increasing interest in building prediction models that assign a probability of response to each and every individual customer in the database (Lamb et al., 1994). The task of classification is made even more interesting by the fact that nowadays current marketing environments store incredible amounts of customer information at a very low cost, including socio-demographics, transactional buying behavior, attitudinal data, etc. (Naik et al., 2000), while at the same time there has been a tremendous increase in academic interest in direct marketing applications (e.g. Allenby et al., 1999; Baumgartner & Hruschka, 2005; Hruschka, 2010; Lee et al., 2010; Piersma & Jonker, 2004). Therefore response models are defined as classification models that attempt to discriminate between responders and nonresponders on a certain company mailing.
In the past, purely statistical methods like logistic regression, discriminant analysis and naive bayes models have been proposed to discriminate between responders and non-responders in a direct marketing context (Baesens et al., 2002; Bult, 1993; Deichmann et al., 2002). Although these techniques may be very effective, they make a stringent assumption about the underlying relationship between the independent variables and the dependent or response variable. In response to this, more advanced data mining algorithms like decision treegenerating techniques, artificial neural networks and support vector machines have been applied (Baesens et al., 2002; Bose & Chen, 2009; Crone and et al., 2006; Haughton & Oulabi, 1997; Zahavi & Levin, 1997). All these binary classification models are used for two reasons. First, researchers rely on them to obtain robust parameter estimates of the independent variables by modeling the probability of response as a function of the independent variables. Second, these models are used to obtain consistent predicted probabilities of response, which are then used (i) to rank the customers based on their This paper is accepted for publication in European Journal of Operational Research
3
responsiveness to the campaign, (ii) to optimize the overall campaign strategy by offering the customer the product with the highest response probability over the different response models and (iii) for the discrimination task of the response event itself where one classifies customers into responders and non-responders. For (ii) and (iii), the absolute size of the posterior response probabilities is crucial. This study focuses on the process of obtaining correct response probabilities, where calibrating the posterior probabilities could have a positive impact on the optimization of the overall campaign strategy and the efficiency of the discrimination task.
In practice, a classification model is built on a training set, i.e. a set of customers where both the independent variables and the dependent variable are present. In order to correctly measure the discrimination power of the trained classifier, the classification model is applied to a group of customers who have not been used for training, called the scoring or test set. The purpose is to obtain robust and consistent predictions for the response probability of these unseen customers. As one is interested to divide the customers into responders and nonresponders, a judicious classification based on the posterior response probabilities of the customers is needed. In other words, customers having a response probability exceeding a certain threshold will be classified as responders and vice versa.
However, it often happens that a classifier is trained using a dataset that does not reflect the true prior probabilities of the target classes in the real-life population. This may have serious negative consequences on the discrimination performance because the posterior probabilities do not reflect the true probability of interest. This phenomenon occurs in a direct marketing context as well where the prior probabilities between the training set and the (out-of-sample) test set are significantly different. More specifically, the training set consists of customers who are preselected by an earlier response model as being customers with a high response probability, while the test set does not make any restrictions based on the customer profiles in the database. In such a case, a large discrepancy exists between the response distributions on the training set and the test set. The incidence, which is the percentage of responders in a data set, is much higher in the training set as compared to the incidence of real response in the outof-sample test set. This inconsistency has a negative effect on the discrimination performance on the test set, especially because the classifier’s decision to classify customers into This paper is accepted for publication in European Journal of Operational Research
4
responders or non-responders is based on setting a threshold on the raw posterior probabilities of class membership. For instance, when a classifier is trained on a dataset with a higher incidence than the one in the test set, the posterior probabilities on the test set are inflated. Thus making a classification decision based on the absolute value of the posterior probabilities may significantly harm the discrimination performance. Moreover, optimizing the campaign strategy by offering the product with the highest response probability to the customer becomes useless because the response probabilities for different products for a particular customer are not comparable. This paper focuses on how researchers can adjust the posterior probabilities based on the true prior distribution of the response variable. This process of adjustment is called calibration.
This paper proposes a new methodology to be used to calibrate the posterior probabilities from the test set with the real-world situation, a process called probability-mapping. It maps the posterior response probabilities obtained from the classifier onto the prior distribution of real response. The new probability-mapping approaches using generalized linear models and non-parametric generalized additive models are compared with the original, non-calibrated posterior probabilities and the calibrated probabilities using the rescaling methodology of Saerens et al. (2002).
This paper is structured as follows. Section 2 describes the methodological framework, while Section 3 explores the different calibration approaches (rescaling approaches and probabilitymapping approaches). Section 4 explains the characteristics of empirical validation, while Section 5 explores the results. Section 6 gives managerial recommendations, and finally Section 7 concludes this paper.
2. Methodological framework Figure 1 shows the methodological framework for the different calibration methods applied in this study.
This paper is accepted for publication in European Journal of Operational Research
5
[INSERT FIGURE 1 OVER HERE]
Define a training set TRAIN M
(x i , y i )
m i 1
consisting of m customers. Each customer
( x i , y i ) is a combination of an input vector x i representing the independent variables and a dependent variable y i with y i
0,1 corresponding to whether or not a customer responded
on a certain mailing. TRAIN M consists of all customers who were selected by a previous response model, thus received a direct mailing to buy the product, and therefore indicated as customers having a high response probability. During the training phase, a classifier C maps the input vector space onto the binary response variable using the training set observations. For the test set TEST N
(x i )
n i 1
consisting of n customers, the trained classifier C is applied
and for every customer in TEST N a response probability Porg is obtained. The purpose of this paper is to adjust the posterior probabilities Porg to the real response distribution because the trained sample TRAIN M is not representative for TEST N which corresponds to the true population. Therefore for every observation (x i ) in TEST N , the real response is collected and summarized in REAL N
(yi )
n i 1
with y i
0,1 corresponding to whether or not the
customer spontaneously bought that particular product in a time window without direct mailing actions. The real response represents a response of pure interest in the product. In other words, REAL N is used to represent the true prior probabilities.
The purpose of the calibration phase is to adjust Porg, the non-calibrated posterior probabilities of TEST N , in order to truly represent the probability of response. With the aim of methodologically benchmarking the different calibration methods, a k-fold cross-validation is applied. In a k-fold cross-validation, the dataset is randomly split into k equal parts of which one after the other is used during the scoring phase; while the other k-1 parts are used for training the calibration model. Note that TEST kN ( REAL kN ) represents the k-fold for TEST N ( REAL N ), while Pkorg represents the non-calibrated posterior probabilities of TEST kN .
3. Calibration approaches This paper is accepted for publication in European Journal of Operational Research
6
Two types of calibration methods are applied: (i) the rescaling algorithm of Saerens et al. (2002) and (ii) the newly-proposed probability-mapping approaches. The former algorithm rescales Pkorg the posterior probabilities of TEST kN taking into account the real incidence of REAL kN (Saerens et al., 2002), while the latter type adjusts the posterior probabilities of TEST kN by mapping them onto the real responses of REAL kN .
3.1 Rescaling algorithm (SAERENS) This section explains the methodology of Saerens et al. (2002). The starting point of the Saerens et al. (2002) calibration approach is based on Bayes’ rule, i.e. the posterior probabilities of response depend in a non-linear way on the prior probability distribution of the target classes. The prior probability distribution of the target class is defined as the incidence of the target class, or in this setting the percentage of responders in the dataset. Therefore, a change in the prior probability distribution of the target classes changes the posterior response probabilities of the classification model. Saerens et al. (2002) describe a process that adjusts the posterior probabilities of response output by the classifier to the new prior probability distribution of the target classes making use of a predefined rescaling formula. In detail, the calibrated posterior probabilities of response for the customers in the test set of fold k are obtained by weighting the non-calibrated posterior probabilities, Pkorg, by the ratio of the response incidence of REAL kN , i.e. the new prior probability distribution, to the response incidence in the training set, i.e. the old prior probability distribution. The denominator is a scaling factor to make sure that the calibrated posterior probabilities sum up to one. In summary,
Pknew
Pk (c1 ) Pkorg Pkt (c1 ) Pk (c0 ) Pk (c1 ) (1 Pkorg ) Pkorg Pkt (c0 ) Pkt (c1 )
(1)
with Pknew representing the calibrated posterior response probabilities in fold k, Pk(ci) and Pkt(ci) the new and old prior probabilities for class i with i
0,1 . A data set NEWkN is
This paper is accepted for publication in European Journal of Operational Research
7
obtained which contains Pknew, the calibrated posterior probabilities for the test data of TEST kN .
3.2 Probability-mapping approaches The purpose of the probability-mapping approaches is to map Pkorg, the old posterior probabilities of TEST kN , onto the real response probabilities of REAL kN . As such, one is able to build a classification model that maps the non-calibrated probabilities onto the real response probabilities. This model is then used to calibrate the old probabilities with the corrected probabilities of response. However, the real probability distribution of the target classes is not directly available from REAL kN which only contains the real responses yi with
yi
0,1 on an individual customer level. In order to convert the real responses yi with
yi
0,1 on an individual level in REAL kN into a real response probability distribution, a
number of bins b are constructed. The incidence of response is calculated per bin and equals the percentage of real response. This incidence is used as an approximation for the real probability of response per bin. In practice, both TEST kN and REAL kN are split into a number of bins b using the equal frequency binning approach based on the posterior probabilities of TEST kN . TEST kb ( REAL kb ) represents the b-th bin in the k-fold of TEST kN ( REAL kN
respectively). TEST kb and REAL kb logically contain identical observations, while Pkborg is the non-calibrated posterior probability average for the b-th bin in TEST kN and Pkbreal is the percentage of real responders in the b-th bin of REAL kN . Pkbreal serves as a proxy for the true prior probability. In order to formalize the relationship between the average posterior probabilities of TEST kN and the approximate real probabilities obtained from REAL kN , a formal mapping is obtained using the binned training set of fold k by
Pkbreal = fk(Pkborg)
(2)
with fk being the classifier that maps the non-calibrated posterior probabilities onto the real probabilities in fold k. After the classifier fk is built, it is applied to the unseen test data of This paper is accepted for publication in European Journal of Operational Research
8
TEST kN to obtain the new posterior probabilities, Pknew, for every individual in the test data
set of the k-th fold. A new data set is obtained NEWkN which contains Pknew, the calibrated posterior probabilities.
There are several possibilities for fk, a function that links the estimated, non-calibrated probilities of TEST kb to the approximated real probabilities of REAL kb . This study uses one probability-mapping approach based on generalized linear models (Section 3.2.1.) and three non-linear approaches; one based on generalized linear models with log-transformed noncalibrated probabilities (Section 3.2.2.) and two approaches based on generalized additive models (Section 3.2.3. and Section 3.2.4.).
3.2.1 Generalized linear model (GLM) Given yi as the dependent variable with y i
0,1 representing Pkbreal, the averaged true prior
probabilities from REAL kb and xi equal to Pkborg, the averaged posterior probabilities of TEST kb , a generalized linear model with logit link function is employed to model fk(xi)
0,1 .
Moreover, it assumes that the relationship between Pkborg and Pkbreal is linear in the log-odds via
logit y i
log
yi 1 yi
αk
β ki x i
(3)
or yi ≡ fk(xi) = logit -1 (α k
β ki x i )
(4)
with α k as the intercept and β ki x i as the predictor. The parameters α k and β ki are estimated using maximum likelihood (Tabachnick & Fidell, 1996).
3.2.2. Generalized linear model with log transformation (LOG)
This paper is accepted for publication in European Journal of Operational Research
9
Another approach is to log-transform xi in equation (3) and equation (4), because as such one captures the non-linearity in the log-odds space between yi, Pkbreal the true prior probabilities from REAL kb , and xi, Pkborg the posterior probabilities of TEST kb .
3.2.3. Generalized additive models An attractive alternative to standard generalized linear models is generalized additive models (Hastie & Tibshirani, 1986, 1987, 1990). Generalized additive models relax the linearity constraint and apply a non-parametric non-linear fit to the data. In other words, the data themselves decide on the functional form between the independent variable and the dependent variable. Define yi as the dependent variable with y i
0,1 representing Pkbreal, the true
posterior probabilities from REAL kb , and xi equals to Pkborg, the posterior probabilities of TEST kb . To model fk(xi)
0,1 , generalized additive models with logit link function are
employed. Methodologically, generalized additive models generalize the generalized linear model principle by replacing the linear predictor β ki x i in equation (4) with an additive component where
yi ≡ fk(xi) = logit -1 (α k
s ki ( x i ))
(5)
with s ki ( x i ) as a smooth function. This study uses penalized regression splines s ki ( x i ) to estimate the non-parametric trend for the dependency of yi on xi (Wahba, 1990; Green and Silverman, 1994). These smooth functions use a large number of knots leading to a model quite insensitive to the knot locations, while the penalty term is used to avoid the danger of over-fitting that would otherwise accompany the use of many knots. The complexity of the model is controlled by a parameter λ and it is inversely related to the degrees of freedom (df). If λ is small (i.e. the df are large), a very complex model that closely matches the data is employed. When λ is large (i.e. the df are small), a smooth model is considered. In order to optimize the generalized additive model, the fitting amounts to penalized likelihood maximization by penalized iteratively reweighted least squares (Wood, 2000; 2004; 2008).
This paper is accepted for publication in European Journal of Operational Research
10
3.2.4. Generalized additive models with monotonicity constraint Due to the fact that generalized additive models produce a non-linear relationship between the independent variable Pkborg and the dependent variable Pkbreal, the original ranking of the posterior probabilities of TEST kN and its calibrated version may change. However, marketing analysts could argue that the mapping from TRAIN M onto TEST N and the corresponding ranking of the customers in TEST N (and respectively TEST kN ) given by the initial classifier C should be conserved. As such a non-decreasing monotonicity constraint on the generalized additive models predictions is introduced to retain the original ranking of the customers. Inspired by rule-set creation advances in the post-learning phase (e.g. pedagogical rule-based extraction techniques as employed in Martens et al. (2007)), a rule set on the training set of fold k is produced in the post-estimation phase of the generalized additive models to obtain a function fk’, a non-decreasing monotone function. This ensures that the initial ranking of Pkborg is maintained in the corresponding predictions Pkbreal of fold k. Practically, the training set is sorted by Pkborg. Afterwards the rule-based algorithm detects all non-decreasing monotonic inconsistencies on the prediction values fk(Pkborg) on the training set. For instance, suppose that the prediction value for bin X+1 is lower than the prediction value for bin X than the rulebased algorithm adds a rule to the rule-base to change the prediction value of bin X+1 to the larger prediction value of bin X. In the end, the generalized additive model and the rule-base describe a non-decreasing monotone generalized additive model based function fk’ with following characteristics (Denlinger, 2010) if
PkborgX ≤ PkborgX+1 => fk’(PkborgX) ≤ fk’(PkborgX+1)
(7)
with PkborgX and PkborgX+1 original non-calibrated posterior probabilities for bins X and X+1 in the training data set, and fk’(PkborgX) and fk’(PkborgX+1) the calibrated posterior probabilities in fold k for bins X and X+1.
This paper is accepted for publication in European Journal of Operational Research
11
4. Empirical validation The calibration methods are employed on a test bed of 9 real-life direct marketing datasets provided by a large European financial institution. Each of these datasets corresponds to a typical financial product. Table 1 shows the characteristics of the response datasets.
[INSERT TABLE 1 OVER HERE]
With the aim of methodologically comparing the different algorithms, a 10-fold crossvalidation is applied. Furthermore, the classifier C which links TRAIN M and TEST N and outputs Porg is a logistic regression with forward variable selection as it is a robust and wellknown classification technique in the marketing environment (Neslin et al., 2006). Moreover, the calibration approaches based on generalized additive models use different levels of degrees of freedom (df) representing the non-linearity of the model. The higher the df, the higher the non-linearity. On the hand, the df are set manually by the researcher (userspecified), while on the other hand the df are simultaneously estimated in correspondence with the shape of the response function (automatic). This study opts to manually set the df equal to {3,4,5} (resulting in GAMdf and GAMdf MONO). This df range is inspired by the recommendation and the applications in Hastie & Tibshirani (1990) and Hastie et al. (2001) that use a relatively small number of df to account for different levels of non-linearity. Additionally, the generalized cross-validation procedure (GCV) is employed to automatically select the ideal number of df, resulting in GAMgcv and GAMgcv MONO (Gu & Wahba, 1991; Wood, 2000; 2004). The number of bins b for TEST kN and REAL kN is set to 200. Furthermore, Porg, the non-calibrated posterior probabilities of TEST N , are used as a benchmark (ORIGINAL). The different algorithms are compared on an individual customer level using the log-likelihood (LL) by N
LL
p( x i ) y i 1 p( x i )
ln( i 1
1 yi
N
)
y i ln p( x i )
(1 y i ) ln 1 p( x i )
(8)
i 1
with N the number of customers, p( x i ) equal to Pknew, the calibrated posterior response probability, and yi as the real response variable with y i
0,1 . The LL is a well-known
This paper is accepted for publication in European Journal of Operational Research
12
metric in (direct) marketing to evaluate the performance of an algorithm (e.g. Baumgartner & Hruschka, 2005). The higher the LL, the better the calibration of the posterior probabilities to the true response distribution is. Moreover, the non-parametric Friedman test (Demšar, 2006; Friedman, 1937, 1940) with the Bonferroni-Dunn test (Dunn, 1961) is used in order to significantly compare the different approaches with the best performing algorithm.
5. Results Table 2 represents the 10-fold cross-validated log-likelihood values for the different datasets and the different algorithms. Three panels (a,b,c) are included representing the various levels of the user-selected degrees of freedom for the generalized additive model mappings. For each dataset, the best performing algorithm in terms of log-likelihood is put in italics. Moreover, the average ranking (AR) per algorithm over the different datasets is given. The lower the ranking, the better the algorithm is shown to be. The best performing algorithm is underlined and set in bold, while the algorithms that are not significantly different to the best one at a 5% significance level are only set in bold.
[INSERT TABLE 2 OVER HERE]
The algorithms are split into 4 categories; the original, non-calibrated posterior probabilities (ORIGINAL), the rescaling methodology (SAERENS), the linear probability-mapping approach (GLM) and the non-linear probability-mapping approaches (LOG, GAMdf, GAMdf MONO, GAMgcv and GAMgcv MONO).
Table 2 reveals that calibrating the posterior probabilities has a beneficial impact when a discrepancy exists between the true prior probabilities of the training set and the test set: ORIGINAL always performs worse than the other calibration approaches.
This paper is accepted for publication in European Journal of Operational Research
13
Comparing the rescaling approach (SAERENS) with the best performing calibration approaches, one concludes that SAERENS always significantly performs less well than the non-linear probability-mapping approaches, while SAERENS performs better than the linear probability-mapping approach (GLM). These results show that the analyst better shifts towards a non-linear probability-mapping approach, despite the fact that SAERENS is an easy and workable solution to the calibration problem.
Contrasting the various probability-mapping approaches, Table 2 discloses that the non-linear calibration approaches (LOG, GAMdf, GAMdf MONO, GAMgcv and GAMgcv MONO) are always amongst the best performing algorithms. The linear mapping approach (GLM) is never significantly competitive with one of its non-linear counterparts. However, the generalized linear model with log-transformation (LOG) is competitive to the more advanced GAM approaches (GAMdf, GAMdf MONO, GAMgcv and GAMgcv MONO). Within the nonlinear calibration setting, one concludes that GAMgcv MONO always performs best, followed by the other non-linear calibration approaches.
Table 3 contains the performance measures for all generalized additive models approaches (GAMdf, GAMdf MONO, GAMgcv and GAMgcv MONO), for all the levels of degrees of freedom. On a dataset level, the best performing algorithm is put in italics. Furthermore, the average ranking (AR) for each algorithm is given and the best performing algorithm (i.e. the one with the lowest ranking) is underlined and set in bold, while the ones that are not significantly different to the best at a 5% significance level are simply put in bold.
[INSERT TABLE 3 OVER HERE]
Table 3 reveals that GAM5 MONO is the best performing algorithm amongst the GAM and GAM MONO approaches, quickly followed by GAMgcv MONO. Table 3 shows a better performance trend for the GAM approaches when the number of df are increased. GAM3 performs less well than GAM4, while GAM4 has a less well performance than GAM5. Furthermore, it is clear that including the monotonicity constraint has a beneficial impact on This paper is accepted for publication in European Journal of Operational Research
14
the calibration performance of the GAM approaches. The average ranking of the GAM approaches including the monotonicity constraint is always better than their original GAM counterparts (i.e. GAMdf versus GAMdf MONO and GAMgcv versus GAMgcv MONO). Moreover, the automatic smoothness parameter selection procedure proves its beneficial impact. For the non-monotonicity models, GAMgcv has always a better ranking than the GAMdf approaches. For the monotonicity models, GAMgcv MONO performs always better than GAM3 MONO and GAM4 MONO, while GAMgcv MONO is very competitive to GAM5 MONO.
6. Discussion The results suggest that marketing analysts should calibrate the posterior probabilities when the training set does not represent the true prior distribution. In general, calibrating the posterior probabilities is more beneficial than using the non-calibrated posterior probabilities. Moreover, it is shown that a ‘simple’ rescaling algorithm (SAERENS) that takes into account the ratio of the old and the new priors is not sufficient to be a first and workable solution to initially solve the calibration problem. SAERENS always performs significantly worse than the more complex non-linear probability-mapping approaches. Furthermore, marketing researchers should better not apply the linear probability-mapping approach in this specific setting. Indeed, amongst the different probability-mapping approaches, it has been shown that non-linear approaches are preferable over the linear mappings. The LOG approach is competitive to the more complex GAM-based calibration approaches, and because it is based on the common generalized linear model framework, LOG could be seen as a first and workable approach. However if one is interested to optimize the calibration performance, the GAM-based approaches are preferable. Moreover, one concludes that using the automatic smoothing parameter selection procedure and imposing a monotonicity constraint on the GAM method are the most preferred options to be employed in GAM models in order to optimize calibration performance.
7. Conclusion Direct marketing receives considerable attention these days in academia as well as in business due to a serious drop in the cost of IT equipment and the ever increasing usage of response This paper is accepted for publication in European Journal of Operational Research
15
models in a variety of business settings. In a direct marketing context, a discrepancy sometimes exists between the prior distributions on the training set and scoring set which is problematic. This may happen due to the fact that the training set consists entirely of customers previously selected by a response model, and thus this dataset consists of a higher percentage of responders. Applying a classification model built on this training set to the complete set of customers will harm the estimation of the response probabilities. Thoroughly adjusting the posterior probabilities to the real response probability distribution will improve the classification performance. This study reveals that the non-linear probability-mapping approaches are amongst the best performing algorithms and their usage is highly recommended in a day to day business setting for following reasons. Firstly, the non-linear probability-mapping approaches deliver a better performance compared to the other calibration algorithms included in this research paper. This leads to the fact that the calibrated probabilities better reflect the true probabilities of response. Secondly, there is a possibility to visualize the relationship between Pkborg and Pkbreal. This gives managers a better and visual understanding of the calibration process for a particular setting. For instance, the more the calibration curve is away from the 45 degree line (,i.e. the line where Pkborg=Pkbreal or no calibration is necessary), the higher the added value of sending a leaflet because the incidence in TRAIN M is higher than in REAL N . Finally, the underlying techniques like generalized linear models and generalized additive models are easily implementable in today’s business environment due to the availability of the classifiers in traditional software packages like SAS and R. Whilst we are confident that our study adds significant value to the literature, valuable directions for future research are identified. Beside the probability-mapping approaches which map the Pkborg onto the Pkbreal, an extensive research project could be dedicated to investigate the impact of ‘integrated’ calibration approaches, i.e. methods that integrate the calibration process into the initial training phase of classifier C in order to come up with a new classifier C’ which directly outputs calibrated probabilities. For instance, a workable ‘integrated’ calibration approach could be represented by a two-stage Bayesian logistic regression approach that directly outputs calibrated posterior probabilities. In order to obtain this ‘integrated’ Bayesian calibration model, the following procedure is proposed. Under the assumption that the commonly-used prior distribution for β ki is multivariate Gaussian, i.e. p( β ki )~N( β 0 ,∑0), the Bayesian empirical approach could be used to specify the values of
This paper is accepted for publication in European Journal of Operational Research
16
β 0 and ∑0 by fitting a Bayesian logistic regression to TRAIN kM using non-informative priors.
Consequently, the resulting posterior mean vector and variance-covariance matrix of this initial model could then be used for the values of β 0 and ∑0 for the second Bayesian logistic regression on REAL kN . The resulting ‘integrated’ Bayesian logistic regression approach C’ will directly output adapted, calibrated posterior probabilities1. Furthermore, the probabilitymapping approaches are validated in a direct marketing setting, whereas future research efforts could be spent to investigate the external validity to other operational research settings.
Acknowledgements The authors would like to thank the anonymous company for freely distributing the datasets. We would like to thank our friendly and journal reviewers for their fruitful comments on earlier versions of this paper and the editor, Jesus Artalejo, for guiding this paper through the reviewing process.
1
Nevertheless, this approach is not tested in the current version of the paper for confidentiality reasons.
This paper is accepted for publication in European Journal of Operational Research
17
References Allenby, G. M., Leone, R. P., & Jen, L. C. (1999). A dynamic model of purchase timing with application to direct marketing. Journal of the American Statistical Association, 94, 365-374. Baesens, B., Viaene, S., Van den Poel, D., Vanthienen, J., & Dedene, G. (2002). Bayesian neural network learning for repaeat purchase modeling in direct marketing. European Journal of Operational Research, 138, 191-211. Baumgartner, B., & Hruschka, H. (2005). Allocation of catalogs to collective customers based on semiparametric response models. European Journal of Operational Research, 162, 839849. Bose, I., & Chen, X. (2009). Quantitative models for direct marketing: A review from systems perspective. European Journal of Operational Research, 195, 1-16. Bosio, S., & Righini, G. (2007). Computational approaches to a combinatorial optimization problem arising from text classification. Computers & Operations Research, 34, 1910-1928. Bult, J. R. (1993). Semiparametric Versus Parametric Classification Models: An Application to Direct Marketing. Journal of Marketing Research, 30, 380-390. Conforti, D., & Guido, R. (2010). Kernel based support vector machine via semidefinite programming: Application to medical diagnosis. Computers & Operations Research, 37, 1389-1394. Crone, S. F., Lessmann, S., & Stahlbock, R. (2006). The impact of preprocessing on data mining: An evaluation of classifier sensitivity in direct marketing. European Journal of Operational Research, 173, 781-800. Deichmann, J., Eshghi, A., Haughton, D., Sayek, S., & Teebagy, N. (2002). Application of multiple adaptive regression splines (MARS) in direct response modeling. Journal of Interactive Marketing, 16, 15-27.
This paper is accepted for publication in European Journal of Operational Research
18
Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7, 1-30. Denlinger, C.G. (2010). Elements of real analysis. Jones and Bartlett Publishers. Dunn, O. J. (1961). Multiple comparisons among means. Journal of the American Statistical Association, 56, 52-64. Friedman, M. (1937). The use of ranks to avoid the assumption of normality implicit in the analysis of variance. Journal of the American Statistical Association, 32, 675-701. Friedman, M. (1940). A comparison of alternative tests of significance for the problem of m rankings. The Annals of Mathematical Statistics, 11, 86-92. Green, P.J. & Silverman, B.W. (1994). Nonparametric regression and generalized linear models. Chapman and Hall/CRC Press. Gu, C., & Wahba, G. (1991). Minimizing GCV/GML scores with multiple smoothing parameters via the Newton method. SIAM Journal of Scientific and Statistical Computing, 12, 383-398. Hastie, T., & Tibshirani, R. (1986). Generalized additive models. Statistical Science, 1, 297318. Hastie, T., & Tibshirani, R. (1987). Generalized Additive Models: Some applications. Journal of the American Statistical Association, 82, 371-386. Hastie, T., & Tibshirani, R. (1990). Generalized Additive Models. London: Chapman and Hall. Hastie, T., Tibshirani, R., & Friedman, J. (2001). The Elements of Statistical Learning: Data Mining, Inference and Prediction. New York: Springer-Verlag. Haughton, D., & Oulabi, S. (1997). Direct marketing modeling with CART and CHAID. Journal of Direct Marketing, 11, 42-52.
This paper is accepted for publication in European Journal of Operational Research
19
Hruschka, H. (2010). Considering endogeneity for optimal catalog allocation in direct marketing. European Journal of Operational Research, 206, 239-247. Kim, H.S., & Sohn, S.Y. (2010). Support vector machines for default prediction of SMEs based on technology credit. European Journal of Operational Research, 201, 838-846. Lamb, C. W., Hair, J. F., & McDaniel, C. (1994). Principles of marketing (second ed.). Cincinnati: Soulh-Westem Publishing Co. Lee, H. J., Shin, H., Hwang, S. S., Cho, S., & MacLachlan, D. (2010). Semi-Supervised Response Modeling. Journal of Interactive Marketing, 24, 42-54. Martens, D., Baesens, B., Van Gestel, T., & Vanthienen, J. (2007). Comprehensible credit scoring models using rule extraction from support vector machines. European Journal of Operational Research, 183, 1466-1476. Martens, D., Van Gestel, T., De Backer, M., Haesen, R., Vanthienen, J., & Baesens, B. (2010). Credit rating prediction using Ant Colony Optimization. Journal of the Operational Research Society, 61, 561-573. Morales, D. R., & Wang, J. B. (2010). Forecasting cancellation rates for services booking revenue management using data mining. European Journal of Operational Research, 202, 554-562. Naik, P. A., Hagerty, M. R., & Tsai, C. L. (2000). A new dimension reduction approach for data-rich marketing environments: Sliced inverse regression. Journal of Marketing Research, 37, 88-101. Neslin, S. A., Gupta, S., Kamakura, W., Lu, J. X., & Mason, C. H. (2006). Defection detection: Measuring and understanding the predictive accuracy of customer churn models. Journal of Marketing Research, 43, 204-211. Paleologo, G., Elisseeff, A., & Antonini, G. (2010). Subagging for credit scoring models. European Journal of Operational Research, 201, 490-499.
This paper is accepted for publication in European Journal of Operational Research
20
Piersma, N., & Jonker, J.J. (2004). Determining the optimal direct mailing frequency. European Journal of Operational Research, 158, 173-182. Saerens, M., Latinne, P., & Decaestecker, C. (2002). Adjusting the outputs of a classifier to new a priori probabilities: A simple procedure. Neural Computation, 14, 21-41. Tabachnick, B. G. & Fidell, L. S. (1996). Using multivariate statistics. HarperCollings Publishers, New York. Wahba, G. (1990). Spline models for observational data. Society for Industrial and Applied Mathematics (SIAM) Capital City Press, Montpelier (Vermont). Wood, S.N. (2000). Modelling and Smoothing Parameter Estimation with Multiple Quadratic Penalties. Journal of the Royal Statistical Society B, 62, 413-428. Wood, S.N. (2004). Stable and efficient multiple smoothing parameter estimation for generalized additive models. Journal of the American Statistical Association, 99, 673-686. Wood, S.N. (2008). Fast stable direct fitting and smoothness selection for generalized additive models. Journal of the Royal Statistical Society B, 70, 495-518. Zahavi, J., & Levin, N. (1997). Applying neural computing to target marketing. Journal of Direct Marketing, 11, 76-93.
This paper is accepted for publication in European Journal of Operational Research
21
k = 1 to 10
TRAINM
C
TESTN
LIN | LOG| GAM|GAM MONO
ORIGINAL | SAERENS
TESTkN TESTk1 TESTk2 TESTk3 …. …. TESTkb
TESTkN
NEWkN
NEWN
NEWk1 NEWk2 NEWk3 …. …. NEWkN
NEW1 NEW2 NEW3 …. …. NEWN
REALkN
REALN
REALk1 REALk2 REALk3 …. …. REALkb
Figure 1: Methodological framework.
This paper is accepted for publication in European Journal of Operational Research
22
Dataset ID 1 2 3 4 5 6 7 8 9
TRAINM TESTN # customers % responders # customers % responders 70,463 1.29% 119,329 0.18% 56,301 2.40% 119,104 0.44% 23,328 7.57% 117,7433 0.14% 9,027 11.94% 305,567 0.57% 14,946 17.11% 1,073,346 0.18% 14,586 5.04% 1,223,703 0.05% 25,660 3.10% 748,602 0.18% 12,603 0.56% 127,651 0.24% 19,190 0.95% 113,496 0.23% Table 1: Dataset characteristics.
This paper is accepted for publication in European Journal of Operational Research
# variables used by C 10 16 19 12 22 11 14 10 18
23
RESCALING
Panel a
PROBABILITY-MAPPING NON-LINEAR
LINEAR DATASET 1 2 3 4 5 6 7 8 9 AR
ORIGINAL
SAERENS
GLM
LOG
GAM3
GAMgcv
-242.91 -479.55 -998.78 -223.14 -243.69 -9884.39 -3823.20 -17802.90 -5493.35 7.67
-179.07 -306.81 -1280.32 -206.79 -140.90 -1192.41 -1032.46 -1290.20 -510.03 5.00
-202.76 -323.14 -1064.30 -223.00 -246.53 -1189.09 -1025.02 -1297.41 -525.81 6.89
-178.02 -304.73 -980.99 -206.29 -140.36 -1173.78 -1016.21 -1294.45 -506.11 3.22
-177.88 -306.83 -982.94 -206.69 -142.71 -1165.40 -1017.24 -1291.47 -523.01 4.44
-180.22 -304.23 -980.74 -207.16 -145.56 -1163.68 -1016.52 -1290.74 -515.27 3.78
RESCALING
Panel b
1 2 3 4 5 6 7 8 9 AR
ORIGINAL
SAERENS
GLM
LOG
GAM4
GAMgcv
-242.91 -479.55 -998.78 -223.14 -243.69 -9884.39 -3823.20 -17802.90 -5493.35 7.66
-179.07 -306.81 -1280.32 -206.79 -140.90 -1192.41 -1032.46 -1290.20 -510.03 5.22
-202.76 -323.14 -1064.30 -223.00 -246.53 -1189.09 -1025.02 -1297.41 -525.81 6.88
-178.02 -304.73 -980.99 -206.29 -140.36 -1173.78 -1016.21 -1294.45 -506.11 3.22
-178.06 -305.68 -983.81 -206.48 -146.48 -1164.36 -1016.80 -1290.92 -522.89 4.55
-180.22 -304.23 -980.74 -207.16 -145.56 -1163.68 -1016.52 -1290.74 -515.27 3.88
RESCALING
Panel c ORIGINAL
SAERENS
GLM
GAM4 MONO -177.40 -305.68 -980.17 -206.37 -140.02 -1164.06 -1016.27 -1292.02 -507.05 2.77
GAMgcv MONO -177.40 -303.60 -979.81 -206.61 -139.96 -1163.46 -1016.04 -1291.86 -505.91 1.77
GAM5 MONO -177.38 -304.86 -979.81 -206.33 -139.94 -1163.62 -1016.11 -1291.86 -506.63 2.33
GAMgcv MONO -177.40 -303.60 -979.81 -206.61 -139.96 -1163.46 -1016.04 -1291.86 -505.91 1.88
PROBABILITY-MAPPING NON-LINEAR
LINEAR DATASET
GAMgcv MONO -177.40 -303.60 -979.81 -206.61 -139.96 -1163.46 -1016.04 -1291.86 -505.91 1.56
PROBABILITY-MAPPING NON-LINEAR
LINEAR DATASET
GAM3 MONO -177.58 -307.03 -981.08 -206.56 -140.35 -1165.18 -1016.53 -1292.57 -507.20 3.44
LOG
GAM5
GAMgcv
1 -242.91 -179.07 -202.76 -178.02 -178.70 -180.22 2 -479.55 -306.81 -323.14 -304.73 -305.46 -304.23 3 -998.78 -1280.32 -1064.30 -980.99 -982.74 -980.74 4 -223.14 -206.79 -223.00 -206.29 -206.52 -207.16 5 -243.69 -140.90 -246.53 -140.36 -149.81 -145.56 6 -9884.39 -1192.41 -1189.09 -1173.78 -1163.91 -1163.68 7 -3823.20 -1032.46 -1025.02 -1016.21 -1016.59 -1016.52 8 -17802.90 -1290.20 -1297.41 -1294.45 -1290.75 -1290.74 9 -5493.35 -510.03 -525.81 -506.11 -522.79 -515.27 AR 7.66 5.11 6.88 3.33 4.66 4.00 * 10-fold CV LL values, AR = average ranking Table 2: The 10-fold cross-validated log-likelihood values. Panel a: overview with GAM3 & GAM3 MONO. Panel b: overview with GAM4 & GAM4 MONO. Panel c: overview with GAM5 & GAM5 MONO.
This paper is accepted for publication in European Journal of Operational Research
24
NON-LINEAR GAM3 GAM4 GAM5 GAM3 GAM4 GAM5 MONO MONO MONO 1 -177.88 -177.58 -178.06 -177.40 -178.70 -177.38 2 -306.83 -307.03 -305.68 -305.68 -305.46 -304.86 3 -982.94 -981.08 -983.81 -980.17 -982.74 -979.81 4 -206.69 -206.56 -206.48 -206.37 -206.52 -206.33 5 -142.71 -140.35 -146.48 -140.02 -149.81 -139.94 6 -1165.40 -1165.18 -1164.36 -1164.06 -1163.91 -1163.62 7 -1017.24 -1016.53 -1016.80 -1016.27 -1016.59 -1016.11 8 -1291.47 -1292.57 -1290.92 -1292.02 -1290.75 -1291.86 9 -523.01 -507.20 -522.89 -507.05 -522.79 -506.63 AR 6.55 5.55 5.88 3.66 5.22 2.11 * 10-fold CV LL values, AR = average ranking Table 3: The 10-fold cross-validate log-likelihood values for GAM and GAM MONO calibration models. DATASET
This paper is accepted for publication in European Journal of Operational Research
GAMgcv -180.22 -304.23 -980.74 -207.16 -145.56 -1163.68 -1016.52 -1290.74 -515.27 4.55
GAMgcv MONO -177.40 -303.60 -979.81 -206.61 -139.96 -1163.46 -1016.04 -1291.86 -505.91 2.33
25