Response models based on bagging neural networks - CiteSeerX

3 downloads 3827 Views 217KB Size Report
an important issue in direct marketing. Response models are typically built from historical purchase data. A popular method of choice, logistic regression, is easy.
RESPONSE MODELS BASED ON BAGGING NEURAL NETWORKS KYOUNGNAM HA, SUNGZOON CHO, AND DOUGLAS MACLACHLAN

I

dentifying customers who are likely to respond to a product offering is

an important issue in direct marketing.Response models are typically built from historical purchase data. A popular method of choice, logistic regression, is easy

KYOUNGNAM HA is a graduate student at the Department of Industrial Engineering, Seoul National University, Korea; e-mail: [email protected]

to understand and build, but limited in that the model is linear in parameters. Neural networks are nonlinear and have been found to improve predictive accuracies for a variety of business applications. Neural networks have not always demonstrated clear supremacy over traditional statistics competitors, largely because of over-fitting and instability.Combining multiple networks may alleviate these problems. A systematic method of combining neural networks is proposed, namely bagging or bootstrap aggregating, whereby overfitted multiple

SUNGZOON CHO is currently a visiting scholar at the Department of Marketing and International Business, University of Washington Business School, Seattle, WA. He is an Associate Professor, Department of Industrial Engineering, Seoul National University, Korea; e-mail: [email protected]

neural networks are trained with bootstrap replicas of the original data set and then averaged. We built response models using a publicly available DMEF data

DOUGLAS MACLACHLAN is a Professor, Department of

set with three methods: bagging neural networks, single neural networks, and

Marketing and International

conventional logistic regression. The proposed method not only improved but

Business, University of Washington

also stabilized the prediction accuracies over the other two.

e-mail: [email protected]

© 2005 Wiley Periodicals, Inc. and Direct Marketing Educational Foundation, Inc. JOURNAL OF INTERACTIVE MARKETING VOLUME 19 / NUMBER 1 / WINTER 2005

Business School, Seattle, WA;

This research was financially supported by Brain Science and Engineering Research Program sponsored by Korean Ministry of Science and

Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/dir.20028

Technology.

17

INTRODUCTION As part of relationship marketing programs, marketing executives are taking advantage of vast quantities of customer data newly available. Models commonly used in the direct marketing arena to predict response to mailings and other forms of direct marketing promotions (including e-mail and targeted Internet) are increasingly being used to up-sell or cross-sell customers who contact companies through call centers, for example. The models can be used to decide which of various possible products or services to offer the customer based on a predicted probability of accepting an offer that is estimated on the fly from data already available on the customer or obtained with a couple questions. A class of such models is the so-called response models, in which the dependent variable is a simple response or not. The reason it is “so-called” is that similar models might be used whenever the dependent variable is dichotomous, as in attrition (or churn) models, modeling the presence or absence of anything a customer has or does, such as mortgage, long-term care insurance, overseas travel, etc. Notice that the types of models being described here are not developed to attempt to understand the process generating the predicted responses. Thus, we are not concerned about assessing the relative importance of the variables employed or econometric issues such as omitted variable bias, etc. Traditionally, a linear statistical method such as logistic regression has been used to model response based on a test of a random sample of customers from the complete list (Aaker, Kumar, & Day, 2001). In order to overcome the limitations of logistic regression, other approaches such as ridge regression (Malthouse, 1999), stochastic RFM response models (Colombo & Jiang, 1999) and hazard function models (Gönül, Kim, & Shi, 2000) have been proposed recently. Neural networks, a class of non-linear models that mimic brain function, have been shown to produce better predictive accuracies for a wide variety of business problems (Smith & Gupta, 2000) such as retail, banking, finance, insurance, telecommunication, and operations management. Neural networks have also been employed in marketing because no a priori knowledge or assumption about the error distribution is required (Zahavi & Levin, 1997b). It has been shown in one instance that neural network models improved the response rate up to 95% in direct marketing (Bounds

18

JOURNAL OF INTERACTIVE MARKETING

& Ross, 1997). In another application, bank customers’ response was predicted using a neural network (Moutinho, Curry, Davies, & Rita, 1994), yielding superior results. A neural network was also shown to outperform multinomial logistic regression (Bentz & Merunkay, 2000). Input variables have also been selected successfully in direct marketing applications using neural networks (Viaene, Baesens, Van den Poel, Dedene, & Vanthienen, 2001). But there have also been reports that neural networks did not outperform simpler logistic regression models (Suh, Noh, & Suh, 1999; Zahavi & Levin, 1997a). Indeed, it is often the case that a simple logistic regression predicts better than a neural network. One major reason is that a neural network model has to be built with great care. In particular, its performance is sensitive to its complexity, determined by the number of synapses or weight parameters. If a network is more complex than the problem at hand or the available data set requires, then the network learns not only the underlying function but also the noise peculiar to the finite training data set (Hansen & Salamon, 1990). The “over-fitted” neural network model will fit the training data perfectly, but will fail to predict well for the unseen “test” data. An overly complex neural network is said to have a large variance; performance of the network varies greatly over different data sets from an identical population distribution. Simple models such as logistic regression would not have such a problem. When models have a large discrepancy between the true target and the expectation of the model output over different data sets, the models are said to have a large bias. Both bias and variance create classification error. A complex model has a large variance and a small bias while a simple model has a large bias and a small variance. One can typically improve one type of error at the expense of the other, thus the “bias-variance” dilemma (German, Bienenstock, & Doursat, 1992). It is a difficult task to determine the “optimal” complexity of a neural network given a finite training data set. What is usually practiced is a model selection process, a tedious, time-consuming trial-and-error search for the optimal complexity. Numerous research papers have been written about ways to find the right complexity, but it appears impossible to find an easy-to-understand and easyto-follow procedure.

One way to alleviate the instability of a single neural network is to combine multiple networks. A simple averaging combination of L neural networks can reduce prediction error L times if each network’s prediction error is statistically uncorrelated with the others (Perrone & Cooper, 1993). The key here is how to make each network different. A couple of multiple models or ensemble approaches can be found in marketing literature. Zahavi and Levin (1997a) suggested a doublescoring model that combines a neural network with logistic regression in 15 different heuristic ways. Suh, Noh, and Suh (1999) proposed to combine (1) neural networks and RFM models, (2) neural networks and logistic regression models, and (3) RFM and logistic regression models in 26 different ways. But these previous ensemble models share the shortcoming that the combining methods are all ad-hoc and arbitrary. In the machine learning community, more principled and structured methods have been proposed such as bagging (Breiman, 1996; Perrone & Cooper, 1993), boosting (Freund & Schapire, 1997), and observed learning algorithm (OLA; Jang & Cho, 2002). Bagging, or bootstrap aggregating, is a method that aggregates outputs of many models that were trained separately with bootstrap replicates of the original training data set. The procedure is particularly effective with complex learners such as decision trees and neural networks. Bagging reduces variance without increasing bias, thus reducing the overall generalization error. The other advantage is modeling simplicity due to the method’s insensitivity to the complexity of basic learners; a group of over-fitted learners work very well as basic learners, thus the need for a tedious and time-consuming model-selection process vanishes. This is particularly important in marketing applications. While marketing analysts may have a basic idea about how to develop and use response models such as logistic regression or even neural networks, they rarely become experts in those modeling efforts. In the case of neural networks, in particular, better models are typically “trained” by analysts with considerable experience and expertise. Becoming an expert in neural network modeling, for example, is not likely to justify the time and effort of marketing analysts who usually have many other activities and assignments to which they are better suited and more inclined. Thus, what is needed by marketing analysts is a way to assure that models they develop are “pretty good” with minimum expertise in neural networks. Bagging

exactly meets this requirement. In this paper, we introduce the bagging procedure and apply it to a group of neural networks, or multi-layer perceptrons to be more specific, for response modeling. A purchase history database called DMEF-4 from the Direct Marketing Educational Foundation was used for the experiment. The performance of the proposed model is compared with that of the single neural network models and conventional logistic regression models. This paper is structured as follows. First, in the next section, multi-layer perceptron (MLP), the most popular type of neural network is introduced, followed by a description of the bagging method. In the ensuing two sections, experimental settings and results are presented, respectively. Last, we conclude the paper with a summary of results and recommendations for future work.

METHOD Multi-Layer Perceptron Neural Network Neural networks are artificial intelligence models that mimic the structure and function of the brain. We employ the most popular kind of neural network called multi-layer perceptron (MLP) here, but we will simply refer to it as a neural network. Statistically speaking, it may be viewed as a non-linear, non-parametric regression model. Usually, three layers of nodes (or neurons) are called input, hidden, and output, respectively (see Figure 1). Each input node corresponds to an independent variable or predictor variable while the output node corresponds to the dependent variable or response variable in response modeling. Hidden nodes introduce non-linearity into the model. One can

FIGURE 1 Structure of MLP

RESPONSE MODELS AND NEURAL NETWORKS

19

use as many hidden layers as he/she wishes, but it has been shown that only one hidden layer makes the model complex enough to be a universal function approximator (Cybenko, 1989; Hornik, Stinchcombe, & White, 1989). All the input nodes are connected to every hidden node and every hidden node is connected to the output nodes. The connection (or synapse) has a weight value associated with it, which represents the strength of the connection. The network’s output or dependent variable’s value yk is computed by the following equations, which remotely mimic what happens in a neuron and between neurons: zj  s a a wji xi b i

(1)

yk  s a a wkj zj b

Bagging

j

where xi is the ith input, zj is the jth hidden, yk is the kth output, wji is the weight from the ith node to the jth node, and s, s is the transfer function of hidden layer and output layer respectively. The crucial nonlinearity of the model is introduced by the use of a sigmoid function, s(x)  s(x)  1 1 ex as the transfer function. Note that the computation that takes place in each node is identical. Training a neural network involves finding the synaptic weight values such that the output of the model is close to the target t in the given training data set {(x, t)}. The weight vector is initialized randomly and then updated to minimize the mean squared error (MSE): E(w) 

1 n 2 a (ti  yi ) . 2n i1

(2)

Minimizing the error with a gradient descent method results in the back-propagation algorithm (Rumelhart, Hinton, & Williams, 1986). But we use in this paper a more advanced Newton-type algorithm called Levenberg-Marquardt that is known to have the fastest convergence in training (Bishop, 1995; Levenberg, 1994; Marquardt, 1963). Building a response model with a neural network also involves a process called model selection. A network’s complexity is determined by the number of synapses, which in turn is determined by the number of hidden nodes. If a network is too large, it learns not only the input-output relation, but also the noise present in

20

JOURNAL OF INTERACTIVE MARKETING

that particular data set. Such an “over-fitted” network may fit the training data perfectly, but it does not predict well for unseen data, thus resulting in poor generalization. If a network is too small, on the other hand, it cannot even fit the given data set. Such an “under-fitted” network is not desirable either. A middle ground has to be found, but it is not possible to know in advance how complex the network should be. Thus, in practice, a variety of networks with different numbers of hidden nodes are trained and then tested over a validation set. The number of hidden nodes that corresponds to the smallest validation error is chosen for final training, which includes not only training data but also validation data. This trial and error process is tedious and time-consuming, yet critical.

The bagging process begins with bootstrapping the training data (Step 1 in Figure 2). Given a data set of N patterns (i.e., customer records), a bootstrap sample is constructed by randomly sampling N patterns with replacement. Due to replacement, some patterns are picked more than once while some other patterns are not picked at all. It is estimated that a bootstrap sample contains about 67% of the original data set patterns (Breiman, 1996). Next, each bootstrap data set is used to train a neural network (Step 2 in Figure 2). Then one has L neural networks. The outputs of L neural networks are all potentially different. The final model output is computed by averaging L outputs for a regression problem or by a majority voting for a classification problem (Step 3 in Figure 2). It is obvious that bagging training takes at least L times the duration of a single neural network training, plus the bootstrap time. One positive aspect of bagging in terms of computation time is that the training of L neural networks can be done in parallel, if sufficient hardware and/or software systems are available. Bagging reduces variance (Breiman, 1996) or model variability over different data sets from a given distribution, without increasing bias, which results in a reduced overall generalization error and an improved stability. This has been demonstrated in numerous published studies. The other advantage of using bagging is related to model selection. Since bagging transforms a group of over-fitted networks into a better-than-

Step 1. Generate L bootstrap samples 5Bi 6, i  1 , 2, L: A set of bootstrapping replicates of original data set B 5fi 6, i  1 , 2, L: A set of networks in the ensemble Step 2. Training L different neural networks fi (x, y), i  1 , 2, L, (x, y) 僆 Bi: Train each learner based on the bootstrap sample Step 3. Aggregating L outputs L

fBag (x)  g i1ai fi (x): Aggregate the outputs of learners based on weight i or majority voting

FIGURE 2 Conceptual Model of Bagging With L MLPs, Each Trained With N Patterns

perfectly-fitted network, the tedious time-consuming model selection is no longer necessary. This could even offset the computational overhead introduced by bagging that involves training L neural networks.

EXPERIMENTAL SETTING Problem Definition and Data Set DMEF4 is a public-domain data set provided by the Direct Marketing Educational Foundation (DMEF, 2002). The data set is from an upscale gift business that mails general and specialized catalogs to its customers several times a year. There are two time peri-

ods, “base time period” from December 1971 through June 1992, and “later time period” from September 1992 through December 1992. Every customer in the “later time period” received at least one catalog in early autumn of 1992. Now we are to build a response model for period October 1992 through December 1992 time period. That data set consists of 101,532 customers or records and 91 predictor variables or columns. The response rate is 9.4%. Since the experiments were repeated 30 times, the number of data or customers had to be reduced. Instead of randomly selecting 20% of customers, we chose to select those 20% of customers who are “important” in terms of their recent purchase. Recency was implemented

RESPONSE MODELS AND NEURAL NETWORKS

21

using “weighted dollar amount spent” defined as Weighted Dollar  0.4  Dollars of this year  0.3  Dollars of last year  0.1  Dollars of 2 years ago (3)  0.1  Dollars of 3 years ago  0.1  Dollars of 4 years ago The particular values of weights were arbitrarily determined. However, a particular choice of the values would not alter the outcome of the experiments. The reduced data set now has 20,300 “important” customers with 18.9% response rate. We randomly chose 90% of these data for training. The remaining 10% or 2,030 records were set aside for test. The response rate of the test data set was 17.73%. Typically, direct mail data have a low response rate. Such data sets are called unbalanced. Neural networks compute the posterior probability of response given the predictor variable values, so they tend to ignore rare classes. In other words, neural networks do not work very well with unbalanced data sets. As Zahavi and Levine (1997a) have pointed out, this is an important

TABLE 1 NATURE

issue in applying MLP to target marketing. One common way to circumvent the problem is to balance the training data set, either subsampling the non-respondent class or over-sampling the respondent class. We chose the former approach. The resulting training data set now consists of 7,326 records with 50% response rate. For logistic regression, we used both balanced and unbalanced data sets for LR training, to be fair.

Input and Output Variables Selection of input variables is a critical step in response modeling. No matter how powerful a model is, non-relevant input variables lead to poor accuracy. However, it is not the issue that we are dealing with in this paper. To be fair and comparable, we chose the 17 input variables that were used to predict the response in a previously published study (Malthouse, 2001). The rationale was to remove the important “variable selection” issue in this study. After 20% sampling, the minimum and maximum values of two variables became equal. So the latter were eliminated to result in a total of 15 input variables (see Table 1). In order

Input Variables

NAME

FORMULATION

DESCRIPTION

Original Variables Purseas Falord Ordtyr Puryear

Number of seasons with a purchase LTD fall orders Number of orders this year Number of years with a purchase

Sprord

LTD spring orders

Derived Variables Recency

Order days since 10/1992

Tran53 Tran54 Tran55

180  recency  270 270  recency  366 366  recency  730

Tran38

1/recency

Comb2

a ProdGrp i

14

Number of product groups purchased from this year

i1

22

Tran46

1comb2

Tran42 Tran44 Tran25

log(1  ordtyr * falord) 1ordhist*sprord

JOURNAL OF INTERACTIVE MARKETING

1/(1 lorditm)

Interaction between the number of orders Interaction between LTD orders and LTD spring orders Inverse of latest-season items

to speed up neural network training, we performed scaling for each variable. For instance, variable ri was converted to ri* as follows: r*i 

used was misclassification error (MCE) defined as MCE 

number of (T 0, O1)number of (T 1, O0) number of (total)

(5)

ri  min(ri ) . max(ri )  min(ri )

(4)

If the orders during fall 1992 (labeled TARORD) were larger than zero, then the target variable Response was set to 1. Otherwise, it was set to zero. One final transformation we performed as a typical neuralnetwork preprocessing step was to apply the log transformation to all the input variables because they were highly skewed.

Model Selection For single neural networks, it is vital to choose the proper model complexity. Most common practice is to set aside a part of the training set as a validation set, and to choose the best performing model on that validation set. We set aside 40% of the training data set for validation, trained neural networks with seven through 17 hidden nodes and chose the one with nine hidden nodes (see Figure 3). The measure of performance we

where T denotes target while O denotes output of neural network model. The validation data is used to train the final model after validation. For bagging neural networks, on the other hand, the model selection process was not necessary since an ensemble of over-fitted neural networks tends to cancel out each other’s peculiarities resulting from noise from particular data sets. Thus, we simply chose a large enough number of hidden units around the number of input variables, 14 in this experiment. A total of 25 bootstrap replicas were generated, each of which was used to train a neural network. A majority-voting scheme was used for aggregating the 25 outputs as suggested in Breiman (1996). Since training neural networks is stochastic in nature, both single neural networks and bagging neural networks were built 20 times and their average, best, and worst performances were recorded. However, logistic regression was trained only once since training is not stochastic but deterministic.

0.40

validation MCE

0.35

0.30

0.25

0.20

0.15

0.10 6

7

8

9

10

11

12

13

14

15

16

hidden node TH : 0.1

TH : 0.2

TH : 0.3

TH : 0.4

TH : 0.5

TH : 0.6

TH : 0.7

TH : 0.8

TH : 0.9

TH : Threshold

FIGURE 3 MCE Versus Number of Hidden Nodes

RESPONSE MODELS AND NEURAL NETWORKS

23

Confusion Matrix With O and T Denoting Model Output and Target, Respectively, and NR and R Denoting No Response and Response, Respectively

TABLE 2 TOO NR

NR

R

SUM

A

B

A B

R Sum

C A C

D B D

C D Total

Evaluation Measures Calder and Malthouse (2003) suggested fit and performance as criteria to evaluate a score model. Fit is concerned with how close the model output is to the target while performance is concerned with how many people out of the direct mail recipients will actually respond. The fit of a model can be measured by various measures. Let us first consider the confusion matrix (Table 2), which shows the numbers of predictions for response (R) and non-response (NR) for model output (O) and target (T) respectively. Entry A, for instance, denotes the number of NR outputs when the target is NR while B denotes the number of R outputs when the target is NR. C denotes the number of NR outputs when the target is R while D denotes the number of R outputs when the target is R. A model with large A and large D is a better one. The accuracy defined as the ratio of the diagonal entries A  D over total number is a popular fit measure. However, it is an overall measure, not distinguishing its accuracy in predicting R from that in predicting NR. Response (or non-response) precision measures the ratio of D (or A) over B  D (or A  C) while response (or non-response) recall measures the ratio of D (or A) over C  D (or A  B). The performance of a binary classifier with a decision threshold is often plotted using a Receiver Operating Characteristics (ROC) curve. It plots 1-sensitivity (or sensitivity) against 1-specificity for various decision-threshold values where they are defined as follows: C CD B 1-specificity  AB

1-sensitivity 

24

JOURNAL OF INTERACTIVE MARKETING

(6)

Different threshold values lead to different pairs of rates. A decrease in one error causes an increase in the other error, and vice versa. So there is a tradeoff between the two errors. A model is preferred to another if its ROC curve is located closer to the origin. Gains tables or charts are also used to assess the performance of a model. Gain represents the ratio of actual responders over the classified responders. Every customer in the test set is sorted in a descending order based on the response score. Then, they are grouped into 10 roughly equal subsets (deciles) and each subset’s accumulated gain is computed and plotted.

RESULTS Model Fit Table 3 shows the confusion matrices of four models that we built, namely, Bagging NN, Single NN, LR with balanced data set, and LR with unbalanced data set, respectively. In order to convert a model’s output values into a binary decision, a threshold is employed. A lower threshold makes the model more liberal in the sense that it tends to classify a customer as a respondent. A larger threshold will make the model more conservative in the sense that it tends to classify a customer as a non-respondent. Various threshold values were employed to compare accuracies of different models. Just to give a glimpse of how the model did, a threshold of 0.5 was used in Table 3. A repetition result whose classification accuracy is closest to the average was selected and shown here out of 20 repetitions. If one considers only the accuracies (sum of diagonal elements of the confusion matrix or A  D), each model producing 84%, 82%, 77%, and 87%, respectively, logistic regression with the balanced data set (BLR) seemed to do best. Let us see how it achieved the best accuracy. The sum A  C corresponds to the number of customers that the model classified as non-respondents. The total amounted to 1,842, which, compared to other models, is far too large. The second best model, Bagging MLP (BMLP), resulted in a mere 1,472. The tendency of BLR to classify a customer as a non-respondent helped to achieve high

Confusion Matrices of (A) Bagging NN, (B) Single NN, (C) LR With Balanced Data, and (D) LR With Unbalanced Data

TABLE 3 (A) O TO NR

NR

R

SUM

(C) 1407

263

1670

R Sum

65 1472

295 558

360 2030

TOO

NR

R

SUM

NR R

1377 71

293 289

Sum

1446

582

(B)

TOO NR R

NR

R

SUM

1297 103

373 257

1670 360

Sum

1400

630

2030

(D) NR

R

SUM

1670 360

TOO NR R

1620 222

50 138

1670 360

2030

Sum

1842

188

2030

accuracy since the test data set is highly unbalanced, 1,670 non-respondents vs. 360 respondents. Note that a “blind” classifier that classifies every customer as a non-respondent would achieve an accuracy of 82% (1,670/2,030). The BLR pays the price in the form of a high miss rate or false negative rate of 62%

(222/360), which translates into a huge business opportunity cost. The BMLP’s miss rate is only 18% (65/360). Figure 4 displays ROC charts for best, worst and median case out of 20 repetitions. Each curve was

1.0 0.9 0.8

1-sensitivity

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1-specificity BLR

UBLR

SMLP

BMLP

(a)

FIGURE 4 ROC Curves for Four Models: BLR, UBLR, SMLP, and BMLP

RESPONSE MODELS AND NEURAL NETWORKS

25

1.0 0.9 0.8

1-sensitivity

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.9

1.0

1-specificity BLR

UBLR

SMLP

BMLP

(b) 1.0 0.9 0.8

1-sensitivity

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

1-specificity BLR

UBLR

SMLP

BMLP

(c)

FIGURE 4 (Continued)

formed by connecting adjacent points, each of which corresponds to a (false positive, false negative) pair obtained from a particular threshold value among nine, ranging from 0.1 through 0.9. Three observations can be made. First, neural network based

26

JOURNAL OF INTERACTIVE MARKETING

models tend to be located near the origin, i.e., having smaller error measures. Second, the BLR and UBLR did not seem to differ much in terms of false positive and false negative. Third, Single MLP (SMLP) did best than any other model in the

FIGURE 5 Pairwise Consideration of Accuracy, Precision, and Recall Between Single MLP and Bagging MLP

best case, but did almost worst in the worst case. In contrast, the bagging MLP (BMLP) did best in worst and median cases. That was exactly what we set out to demonstrate by using BMLP in the first place: building a model that is good as well as stable in fit.

The stability of BMLP is clearly shown in Figure 5 that shows the accuracy, precision and recall of Single MLP versus Bagging MLP for a threshold value of 0.5. The dotted horizontal lines inside boxes denote the median values while the solid horizontal lines denote the average values. The top and the bottom of

RESPONSE MODELS AND NEURAL NETWORKS

27

the box correspond to the 25th percentile and the 75th percentile, respectively, while the top and the bottom of the lines correspond to the maximum and minimum values, respectively. In all performance measures, the averages of BMLP were higher than those of SMLP while the ranges were much smaller. The differences were verified as statistically significant by t tests.

Model Performance Model performance may be measured in terms of cumulative gains charts, sometimes referred to as “banana charts.” Figure 6 shows cumulative gains made by the four different response models that we built. The straight line passing across the box with * mark corresponding to a “random” model or no model (i.e., the first 10% of customers yield only 10%

100% 90% 80% response rate

70% 60% 50% 40% 30% 20% 10% 0% 0%

10%

20%

BLR

30%

40%

UBLR

50% decile

60%

SMLP

70%

BMLP

80%

90%

100%

No Model

(a) 100% 90% 80% response rate

70% 60% 50% 40% 30% 20% 10% 0% 0%

10%

20%

BLR

30%

40%

UBLR

50% decile

60%

SMLP (b)

FIGURE 6 Gains Chart for Four Models: BLR, UBLR, SMLP, and BMLP

28

JOURNAL OF INTERACTIVE MARKETING

70%

BMLP

80%

90%

No Model

100%

100% 90% 80% response rate

70% 60% 50% 40% 30% 20% 10% 0% 0%

10%

20% BLR

30%

40%

UBLR

50% decile SMLP

60%

70% BMLP

80%

90%

100%

No Model

(c)

FIGURE 6 (Continued)

response and the first 20% customers yield only 20% response, and so on). The curves that are located farther above from the straight line are preferred (i.e., yielding a “fatter banana”). Here we also can make three observations. First, neural-network-based models tend to be located farther from the straight line, i.e., having larger gains values. Second, the BLR and UBLR did not seem to differ much in terms of accumulated response rate. Third, Single MLP (SMLP) did better than any other model in the best case, but did not do well in the worst case. In contrast, the bagging MLP (BMLP) did best in worst and median cases. Identical observations were made in terms of performance as well as fit. We found that the BMLP gave the best as well as the most stable performance.

CONCLUSION In this paper, we introduced bagging as a “principled” way to combine multiple neural networks. The approach was applied to a real world marketing data set and was compared to other methods such as single MLP (SMLP) and logistic regression (LR) in terms of fit and performance. Fit was measured in terms of confusion matrix and ROC charts while performance was measured with gains charts. Among the four, the proposed BMLP came out as the best model overall. LR trained with balanced data achieved a higher

accuracy than BMLP, but at the expense of an intolerably high miss rate, thus resulting in a potentially huge loss of business opportunities. Concrete data on mailing cost and opportunity cost would have enabled us to compute and compare the exact revenue and cost for the competing models. Single MLP yielded better fit and performance than bagging MLP in one run out of 20 repetitions, but it did much poorer than bagging MLP in its worst and median cases. The stability of the model was another property that we set out to examine. Both range and variance of the model fit out of 20 repetitions were smaller for BMLP. The third desirable property of BMLP was that it requires neither a time-consuming trial and error procedure of model selection nor the modeling expertise that is usually required in building a single MLP. However, there still exist limitations of the proposed bagging MLP for response modeling in direct marketing or other marketing situations. Just like a single MLP, bagging MLP is a black box model that sheds little light on what is going on inside the model. The input variables are combined in a complicated, nonlinear way to produce an output. Subsequently, the 25 outputs are combined again to produce the final output by averaging or majority voting. Those marketers who would like to understand how individual predictor variables influence the target and how they

RESPONSE MODELS AND NEURAL NETWORKS

29

interact might be baffled by the neural network model’s inability to provide any insight in that regard. One way to circumvent this difficulty is to run two models, i.e., a neural network model for prediction and a decision tree or regression model for understanding.

Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer Feedforward Networks Are Universal Approximates. Neural Networks, 2, 359–366. Jang, M., & Cho, S. (2002). Observational Learning Algorithm for an Ensemble of Neural Networks. Pattern Analysis & Applications, 5, 154–167.

Future research should include investigation of other “principled” ways to combine models, e.g., boosting with simple models such as logistic regression or decision trees.

Levenberg, K. (1994). A Method for the Solution of Certain Non-Linear Problems in Least Squares. Quarterly Journal of Applied Mathematics, II(2), 164–168. Malthouse, E.C. (1999). Ridge Regression and Direct Marketing Scoring Models. Journal of Interactive Marketing, 13(4), 10–23.

REFERENCES

Malthouse, E.C. (2001). Assessing the Performance of Direct Marketing Scoring Models. Journal of Interactive Marketing, 15(1), 49–62.

Aaker, D.A., Kumar, V., & Day, G.S. (2001). Marketing Research. New York: John Wiley & Sons. Bentz, Y., & Merunkay, D. (2000). Neural Networks and the Multinomial Logit for Brand Choice Modeling: A Hybrid Approach. Journal of Forecasting, 19(3), 177–200.

Marquardt, D.W. (1963). An Algorithm for Least-Squares Estimation of Non-Linear Parameters. Journal of the Society of Industrial and Applied Mathematics, 11(2), 431–441.

Bishop, C.M. (1995). Neural networks for Pattern Recognition. New York: Oxford University Press.

Moutinho, L., Curry, B., Davies, F., & Rita, P. (1994). Neural Network in Marketing. New York: Routledge.

Bounds, D., & Ross, D. (1997). Forecasting Customer Response With Neural Networks. Handbook of Neural Computation, G6.2, 1–7.

Perrone, M.P., & Cooper, L.N. (1993). When Networks Disagree: Ensemble Methods for Hybrid Networks, Artificial Neural Networks for Speech and Vision. London: Chapman & Hall.

Breiman, L. (1996). Bagging Predictors. Machine Learning, 24(2), 123–140. Calder, B.J., & Malthouse, E.C. (2002). What Is Integrated Marketing? In D. Iacobucci & B. Calder (Eds.), Kellogg on Integrated Marketing (pp. 6–15). Evanston: John Wiley and Sons. Colombo, R., & Jiang, W. (1999). A Stochastic RFM Model. Journal of Interactive Marketing, 13(3), 2–12. Cybenko, G. (1989). Approximation by Superpositions of a Sigmoidal Function. Mathematics of Control, Signals and Systems, 2, 303–314. Direct Marketing Education Foundation. Available: http://www.the-dma.org/dmef/ Freund, Y., & Schapire, R.E. (1997). A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. Journal of Computer and System Sciences, 55(1), 119–139. German, S., Bienenstock, E., & Doursat, R. (1992). Neural Networks and the Bias/Variance Dilemma. Neural Computaion, 4(1), 1–58. Gönül, F.F., Kim, B.D., & Shi, M. (2000). Mailing Smarter to Catalog Customer. Journal of Interactive Marketing, 14(2), 2–16. Hansen, L.K., & Salamon, P.R. (1990). Neural Network Ensembles. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(10), 993–1001.

30

JOURNAL OF INTERACTIVE MARKETING

Rumelhart, D.E., Hinton, G.E., & Williams, R.J. (1986). Learning Internal Representation by Error Propagation. In D.E. Rumelhart & J.A. McClelland (Eds.), Parallel Distributed Processing: Explorations in the Microstructure of Cognition (pp. 318–362). Cambridge, MA: MIT Press. Smith, K.A., & Gupta, J.N.D. (2000). Neural Network in Business: Techniques and Applications for the Operations Research. Computers & Operations Research, 27(11–12), 1023–1044. Suh, E., Noh, K., & Suh, C. (1999). Customer List Segmentation Using the Combined Response Model. Expert Systems with Applications, 17(2), 89–97. Viaene, S., Baesens, B., Van den Poel, D., Dedene, G., & Vanthienen, J. (2001). Wrapped Input Selection Using Multilayer Perceptrons for Repeat-Purchase Modeling in Direct Marketing. International Journal of Intelligent Systems in Accounting, Finance & Management, 10(2), 115–126. Zahavi, J., & Levin, N. (1997a). Applying Neural Computing to Target Marketing. Journal of Direct Marketing, 11(1), 5–22. Zahavi, J., & Levin, N. (1997b). Issues and Problems in Applying Neural Computing to Target Marketing. Journal of Direct Marketing, 11(4), 63–75.

Suggest Documents