performance-based variable selection for scoring models - CiteSeerX

18 downloads 175029 Views 223KB Size Report
The performance of a direct marketing scoring model at a particular mailing depth, d ... rect mail to drive traffic to a website use scoring ...... The Advanced Theory.
PERFORMANCE-BASED VARIABLE SELECTION FOR SCORING MODELS Edward C. Malthouse f

ABSTRACT The performance of a direct marketing scoring model at a particular mailing depth, d, is usually measured by the total amount of revenue generated by sending an offer to the customers with the 100d% largest scores (predicted values). Commonly used variable selection algorithms optimize some function of model fit (squared difference between training and predicted values). This article (1) discusses issues involved in selecting a mailing depth, d, and (2) proposes a variable selection algorithm that optimizes the performance as the primary objective. The relationship between fit and performance is discussed. The performance-based algorithm is compared with fit-based algorithms using two real direct marketing data sets. These experimental results indicate that performancebased variable selection is 3– 4% better than corresponding fitbased models, on average, when the mailing depth is between 20% and 40%.

© 2002 Wiley Periodicals, Inc. and Direct Marketing Educational Foundation, Inc. f JOURNAL OF INTERACTIVE MARKETING VOLUME 16 / NUMBER 4 / AUTUMN 2002 Published online in Wiley InterScience (www.interscience.wiley.com). DOI: 10.1002/dir.10043

37

EDWARD C. MALTHOUSE is Assistant Professor, Integrated Marketing Communications, at the Medill School of Journalism, Northwestern University, Evanston, Illinois.

JOURNAL OF INTERACTIVE MARKETING

INTRODUCTION

purchases? How should frequency be defined? Should purchases from 20 years ago be included, or should frequency be defined as the number of purchases in the past three years? Should monetary value be defined by average order size, or total dollar amount? As a second example, consider prospecting scoring models, where an organization is attempting to establish a relationship with a customer for the first time. In this case, the organization has no information on RFM variables. It can usually get a large set of overlay variables including demographics, hobbies, interests, etc. This article focuses on scoring models with a dependent variable that is an amount. There are similar issues for variable selection problems for dichotomous (e.g., response or nonresponse) or polytomous responses. Methods for response models are mentioned briefly in the discussion section. Stepwise regression is the most commonly used tool by practitioners. Implementations in commercial statistical and data mining software packages make decisions about which variables should be included based on the fit of the model—some function of the expected squared difference between a future amount and its prediction from the model:

Direct marketing scoring models predict some dependent variable from other variables. For example, catalog companies use scoring models to determine who should receive an upcoming catalog. The dependent variable usually is either response (whether or not someone will respond if sent a catalog) or demand (amount spent if sent a catalog). If the company is picking from names that have made purchases before, variables usually include various versions of variables such as recency (how recently a customer has made a purchase), frequency (number of previous purchases), and monetary value (dollar amount spent in the past). Notfor-profit organizations use similar models to determine which people they should solicit for donations. E-commerce companies that use direct mail to drive traffic to a website use scoring models to determine who should receive a promotion. Credit card companies use scoring models to determine who should receive an offer for a new credit card. Publishers use scoring models when prospecting for new subscribers or selling memberships to book clubs and the like. Insurance companies also use scoring models to generate leads for agents. See Shepard (1999) for further discussion of scoring models. Scoring models are usually implemented using some sort of regression or classification method (see, e.g., Bult, 1993; Bult & Wansbeek, 1995; Colombo & Jiang, 1999; Hansotia & Wang, 1997; Magidson, 1988; Malthouse, 1999; Zahavi & Levin, 1997). This article focuses on the variable selection problem. When building a scoring model there are usually a large number of candidate predictor variables available. The variable selection problem involves choosing a subset of these candidate variables to be included in the model. For example, recency, frequency, and monetary value are known to be strong predictors in many situations. These are referred to as RFM variables. A catalog company might have many versions of RFM variables to choose from. If the catalog features clothing for women, should it define the RFM variables based on women’s clothing purchases or all JOURNAL OF INTERACTIVE MARKETING

E共 y ⫺ ˆy 兲 2

(1)

Optimizing fit, however, is usually not the primary objective of a scoring model. Suppose there are N people in the database and that n ⬍ N offer are to be mailed. The depth of the mailing is d ⫽ n/N. Specific issues that must be considered in selecting the value of d are discussed in a later section. The objective of a direct marketing scoring model is to choose n names such that the expected total sales generated from these offers is maximized. I refer to this objective as scoring model performance. The reasoning goes that a model providing a good fit to the data will rank the customers so that the scoring model performance is also good, that is, good fit implies good performance. This strategy is intuitively appealing and has been highly successful in practice. This article examines ●

38

VOLUME 16 / NUMBER 4 / AUTUMN 2002

PERFORMANCE-BASED VARIABLE SELECTION FOR SCORING MODELS

mance-based model would generate $170,000 more than the fit-based model.

whether one can do better by selecting variables to optimize scoring model performance directly. It proposes a performance-based variable extraction algorithm and compares it with traditional fit-based selection methods.

SCORING MODEL PERFORMANCE This section gives a rigorous definition of scoring model performance and how it can be estimated. The scoring models considered here predict the behavior of a group of customers in response to some marketing intervention. We consider only behavior measurements that are amounts, for example, dollars spent in response to an offer, solicitation, etc. Training data for the specific behavior are often unavailable and a proxy must be used. For example, catalog companies often do not have data on how customers will respond to a particular catalog they will be mailing, but they might have data on the responses to a similar catalog they mailed in the past. There are often many differences between the previous catalog and the one under consideration including the merchandise, people modeling the merchandise, competitive offers in the market, economic conditions, etc. As a result, practitioners often find that the absolute predictions are biased one way or the other, but that the relative ranking of the customers can help the company pick better prospects and have a more profitable mailing. Suppose we have a training data set consisting of pairs (xi, yi), where i ⫽ 1, … N, xi ⫽ (xi1, …, xip)⬘ contains measurements of p candidate variables on subject i, and yi is the value of the dependent variable. Likewise, suppose we have a test (xi*, yi*), where i ⫽ 1, …, N*. I assume that the training and test sets are both sampled from the same universe. A scoring model is a function ƒ that maps x vectors into unidimensional scores

ILLUSTRATIVE EXAMPLE This section describes an example that motivates the need for performance-based measures. The example used the Direct Marketing Educational Foundation data set 1 (DMEF1), which is also used later in this article to do a more thorough evaluation of the methods considered here. DMEF1 is a data set from a charity that uses direct mail to solicit donations. The objective is to predict how much someone will donate in response to a specific solicitation (the dependent variable) based on their prior donation history (predictor variables), which includes many different flavors of recency, frequency, monetary value, and promotions mailed in the past. I applied stepwise regression to a random sample of 33,067 observations from this data set. At each step, stepwise regression adds the variable that increases the value of R-squared the most, provided that the increases are “significant.” It is implicitly attempting to optimize the fit of the model. The first four variables that enter the model produce an overall R-square of .0978. If we were to use this four-variable model to rank the file from best to worst and we mailed solicitations to the top 20%, we would generate an average of $4.73 per solicitation. I also applied the performance-based forward selection algorithm proposed in this article to the same data. At each step this algorithm selects the variable that increases the amount of money generated by the mailing the most. Thus it attempts to maximize the performance of the model. The first four variables that enter the model have a lower R-squared value of .0954, but the model does a better job of identifying the donors, generating an average of $4.90 per solicitation from the top 20% of the file. This is a difference of $.17 per solicitation. If the database has 5 million names and the charity intends to send 1 million solicitations, the perforJOURNAL OF INTERACTIVE MARKETING

ˆy i ⫽ f 共x i 兲

and ˆy*i ⫽ f 共x*i 兲

where ˆy i and ˆy *i are the scores for training and test set customer i, respectively. I assume that larger values of ƒ are associated with better customers. Scoring models are fitted using the training data and evaluated using the test data. ●

39

VOLUME 16 / NUMBER 4 / AUTUMN 2002

JOURNAL OF INTERACTIVE MARKETING

particular subset ᏿ will be denoted by f᏿, that is, the functional form and method of estimation are fixed. Consider two examples. When a linear functional form is used,

Let [ 䡠 ] be a function that orders the values of ˆy * so that ˆy *[1] ⱖ ˆy *[2] ⱖ . . .ⱖˆy *[N *] . I now define scoring model performance. Suppose we wish to mail to 100d% of the customers within the universe, referred to as “mailing to a depth of d.” The scoring model performance at depth d is the cumulative amount of revenue we expect to generate from such a mailing. Practitioners use a gains table to evaluate this quantity (see, e.g., Shepard, 1999, pp. 341–344). This is usually estimated using the test-set data. If all of the scores (ˆy*i) were unique, the cumulative revenue for mailing depth d would be easy to compute



f ᏿共x兲 ⫽ a ⫹

冘 c ␴共a ⫹ 冘 b x 兲 K

f ᏿共x兲 ⫽ a0 ⫹

k

k⫽1

y *关i兴

i⫽1

k

kj j

j僆᏿

where K is the number of hidden nodes, ␴( 䡠 ) is a sigmoidal function, and ak, bkj, and ck are weights that must be estimated under some objective function. The forward selection algorithm is one way to arrive at a set ᏿. Forward selection is a greedy algorithm (Horowitz & Sahni, 1978, Chapter 4) that arrives at ᏿ by adding one variable at a time. In a particular step, it considers all variables that are not currently in the model and evaluates how much each would improve the model. It then augments the set of predictors with the variable that yields the greatest improvement, provided that the improvement is “substantial.” The algorithm terminates either when all variables are in the model, or when no remaining variable yields a “substantial improvement.”

where 䡠 denotes the greatest integer function.1 In practical modeling situations the scores are sometimes not unique. This is often the case with tree-based models or models where the predictor variables take only a small number of possible values, for example, in the data set used in the empirical evaluation frequency takes only values from {1, 2, …, 48, 51, 55} and recency (measured in months) takes only 110 possible values. With large training and test sets and only discrete predictors in the model, there are bound to be ties in scores. When large groups of people all have the same score, the details of computing R*d are a bit tedious. These details are given at the end of the next section.

FORWARD SELECTION OPTIMIZING PERFORMANCE

Step 0 Initialize ᏿ ⫽ , i.e., begin with no variables in the model.

The variable selection problem amounts to selecting a subset of q ⱕ p candidate variables to be included in the scoring model. Let set ᏿ 傺{1, …, p} contain the indices of the variables selected. Assume that each possible subset of variables is associated with exactly one scoring model. The scoring model associated with a

Step 1 (Find variable giving greatest improvement) Let j⬘ ⫽ argmax g共᏿ 艛 兵 j其兲 j僆 ᏿៮

៮ the complement of ᏿ and g(᏿) is where ᏿ some measure of the quality of scoring model f᏿. Assume that larger values of g are associated with better models.

1 The greatest integer function returns the largest integer that is less than or equal to the argument.

JOURNAL OF INTERACTIVE MARKETING

(2)

j j

where a and bj (j 僆 ᏿) are estimated to satisfy some objective function such as least squares. When a neural network is used

dN *

R *d ⫽

冘 bx

j僆᏿



40

VOLUME 16 / NUMBER 4 / AUTUMN 2002

PERFORMANCE-BASED VARIABLE SELECTION FOR SCORING MODELS

Step 2 If g(᏿ 艛 {j⬘}) is a “substantial improvement”over g(᏿) then set ᏿ 4 ᏿ 艛 {j⬘}; otherwise terminate. Step 3 If ᏿ ៮ ⫽ A, return to step 1; otherwise terminate. With forward selection for linear regression, g is usually defined in terms of fit, for example, the familiar sequential F statistic (Neter, Kutner, Nachtsheim, & Wasserman, 1996, section 7.1). Substantial improvement is taken to mean that we can reject the null hypothesis that the coefficient for variable j⬘ is 0 at some specified level of significance. This version of forward selection shall be called fit-based forward selection, in contrast to the performance-based forward selection algorithm proposed in this article. See Miller (1990) for thorough discussion of fitbased selection methods. Breiman (1996) discusses the use of test sets and cross validation with fit-based model selection algorithms. It is interesting to note that there are several commonly used definitions of g for logistic regression. For example, SPSS offer several options. Proc logistic in SAS uses a different definition than the step function in S-Plus. This article considers definitions of g based on scoring model performance instead of fit.2 Before providing a definition, it is necessary to note that scoring models are often used more than one time. A scoring model for, e.g., a back-to-school catalog, may be used in several subsequent years. The mailing depth may vary from year to year, so a scoring model should perform well within a range of values of d. If a model were to be used only at a fixed depth, the natural way to define g is the cumulative revenue in the test set R*d, where d is the desired depth. The training set is used to estimate models and the test set is used for evaluating the performance of the model. I have done some testing of this definition of g and found that it selects models that perform better (on a third sample of data that was not used for training or for model selection hereafter referred to as the

FIGURE 1

Plot of Cumulative Revenue (R*d) Against Mailing Depth (d) for an Example Model

holdout sample) than fit-based forward selection algorithms at the particular depth under consideration. The performance of these models, however, deteriorates relative to fit-based selection at other depths, that is, when a model is chosen to perform well at a particular mailing depth, the resulting model does perform well at this level, but does not do as well as fit-based selection at other depths. Models are often used multiple times with different mailing depths. In practice, it is desirable to select a model that performs well over a range of depths. When a model is to be used for a range of depths, I propose defining g as an average of R*n values over the range. Figure 1 shows a plot of R*n against mailing depth (100n/N)% for one of the models discussed in the evaluation section. Suppose that the model is to be used in the 20 – 40% depth range. The area under the curve within the range of interest is a reasonable way to define g. The details for computing this area are given at the end of this section. The meaning of “substantial improvement” for the performance-based forward selection algorithm is taken to be any improvement, that is, step 2 of the algorithm becomes if g(᏿ 艛 {j’}) ⬎ g(᏿) then ᏿ 4 ᏿ 艛 {j⬘}; otherwise terminate. It is possible that there will be ties in step 1, which identifies the candidate variable giving the greatest improvement. We need a tie-breaking rule: when there are ties, select the variable

2 Software available from http://www.medill.northwestern.edu/ faculty/malthouse.

JOURNAL OF INTERACTIVE MARKETING



41

VOLUME 16 / NUMBER 4 / AUTUMN 2002

JOURNAL OF INTERACTIVE MARKETING

formations of variables or functions of variables are preferred by many practitioners. The fitted values of the linear and inverse models are also plotted. Clearly 1/(recency ⫹ 1) gives a better fit than the linear model, since its fit is closer to that of the spline. It is interesting to note that in the first pass through the performance-based forward selection algorithm, the linear and inverse models will be tied, since one is a monotonic transformation of the other. Even though the linear model gives negative predictions for many values of recency, both will produce identical rank orderings of customers. The introduction stated that the rationale for using fit-based variable selection was that good fit implies good performance. This example shows that the converse of this statement need not be true, that is, good performance does not necessarily imply good fit. Ideally, the final model should perform well and have a good fit. This tie-breaking rule would seem to help achieve this goal. The proposed definition of g is related to the Gini coefficient, which is a measure of concentration (Kendall & Stuart, 1977, pp. 49 –50; Mulhern, 1999, p. 33). The Gini coefficient is proportional to the area between the cumulative revenue curve in Figure 1 and a diagonal line across all mailing depths. The main difference is that the proposed function g considers only the area within a range of mailing depths. If a scoring model is never to be used for certain mailing depths, these depths should not be considered when selecting variables. I now give the details on computing g(᏿), corresponding to the area under the curve in Figure 1. Denote the lower and upper bound for the range of mailing depth by L and U, respectively (0 ⱕ L ⬍ U ⱕ 1). Group the observations by unique score values 共ˆy*i ⫽ f᏿共x*i 兲兲, denote the unique values by uˆ h, and sort the values in descending order, i.e., uˆ 1 ⬎ uˆ 2 ⬎ . . .. Let mh be the number of observations in group h (note that ¥hmh ⫽ N*). Let sh be the sum of all yi such that ˆy*i ⫽ uˆ h. Compute cumulative counts and totals as follows

FIGURE 2

Illustration Showing the Dollar Amount Predictions Using Three Different Versions of Recency

that improves fit the most. Ties are very common with some classes of functional forms. For example, consider the linear functional form in (2) and suppose that the two best candidate variables are recency (years since last donation), labeled “linear,” and 1/(recency ⫹ 1), labeled “inverse.” Figure 2 shows the relationship between the dependent variable and recency for one of the models in the next section. The dependent variable is the logarithm of the dollars generated by the offer (labeled by targdol). The logarithm is used as a variance stabilizing transformation, as the dispersion of an amount often increases with its mean. It also reduces the influence on the least-squares estimation procedure of extreme values of the dependent variable. The plot shows the fit of a smoothing spline with 4 degrees of freedom. Of the three fits shown, the spline is thought to be the best estimate of the relationship (as summarized by E(log(targdol)|recency)), since splines offer the most flexible functional form of the three models. Thus, consider the spline to be the “ideal” fit. A problem with splines is that it is difficult to compute scores (evaluate the estimated function) in a relational database package. Relational database packages can easily compute products (necessary for the linear model), quotients (necessary for the inverse model), and some transcendental functions such as the logarithm. As a result, linear transJOURNAL OF INTERACTIVE MARKETING



42

VOLUME 16 / NUMBER 4 / AUTUMN 2002

PERFORMANCE-BASED VARIABLE SELECTION FOR SCORING MODELS

冘m h

ch ⫽

ables. An open research question is to develop algorithms to evaluate the performance of “families” of models in a more computationally efficient way. Ratner (1999) presents a method called DMAX that seems similar to the method proposed here. Details are sketchy and no further references are provided. DMAX seems to use genetic algorithms to optimize the mailing for a specific mailing depth, specified a priori by the user.

冘s h

k

and th ⫽

k⫽1

h

k⫽1

Define weights wh (h ⫽ 1, 2, …) as wh ⫽



0 ch /N* ⫺ L U⫺L 1 U ⫺ ch⫺1 /N* U⫺L U⫺L mh /N*

ch /N* ⬍ L or ch⫺1 /N* ⬎ U ch⫺1 /N* ⬍ L and L ⱕ ch /N* ⱕ U ch⫺1 /N* ⱖ L and ch /N* ⱕ U L ⱕ ch⫺1 /N* ⱕ U and ch /N* ⬎ U

SELECTING A SOLICITATION DEPTH

ch⫺1 /N* ⬍ L and ch /N* ⬎ U

This section discusses issues involved in picking a solicitation depth d. Ideally, the solicitation depth should be the break-even point, where the marginal cost of the solicitation equals the expected revenue generated from the contact. Several recent articles on scoring models also include discussion on solicitation depth and use the break-even point objective (e.g., Bult & Wansbeek, 1995, p. 381; Colombo & Jiang, 1999, p. 4). Some of the articles, however, use narrow definitions of marginal cost and expected revenue. They take marginal cost to be the cost of the specific contact and marginal revenue to be probability of response multiplied by the unit contribution (expected order amount less cost of goods). These definitions lead to maximizing the short-term profit generated from the mailing, but can produce suboptimal circulation plans when other factors are considered. In practice, circulation planners use broader definitions of marginal costs and revenues. Many of the factors that determine costs and revenues are difficult to estimate reliably at the time of mailing, which makes them difficult to include in models. Similar discussion can be found in Malthouse (2002).

where c0 ⫽ t0 ⫽ 0. With these definitions, g共᏿兲 ⫽

冘 w uˆ

h h

h

Computational complexity. The proposed performance-based variable section algorithm is computationally more difficult than the corresponding fit-based methods. The SWEEP algorithm allows variables to be added or deleted from a linear regression model estimated under OLS without recomputing the entire regression (Thisted, 1998, Sections 3.4.3, 3.12.3.2). While this algorithm can also be used to estimate the models for performance-based selection, evaluating model performance is substantially more difficult than evaluating fit. The current implementation generates a gains table for each model considered. Generating a gains table requires a sort of n items. Selecting the first variable requires computing p gains table, the second variable p ⫺ 1 gains tables, etc. Quicksort requires roughly n log n operations on average (Sedgewick, 1992, p. 115), so the sorts of the current implementation require roughly



Long-term strategic goals. Corporations often have long-term strategic goals that supersede maximizing short-term profits. ● A catalog company may have a long-term goal to increase in size. Perhaps it has already invested in infrastructure such as Internet servers, call centers, fulfillment centers, warehouses, etc., and it needs to

冘 共 p ⫺ j兲n log n ⫽ q共2p ⫺2q ⫹ 1兲 n log n

q⫺1

j⫽0

operations. Users should be careful to offer a modest number of candidate predictor variJOURNAL OF INTERACTIVE MARKETING



43

VOLUME 16 / NUMBER 4 / AUTUMN 2002

JOURNAL OF INTERACTIVE MARKETING

increase the number of active customers to justify these investments. The definitions of cost and revenue are complicated. In the short term, such a company might need to mail below the point where the unit contribution equals the cost of the mailing. If short-term profit maximization were Amazon.com’s goal, would Amazon have been unprofitable for so long? ●



A credit card company may have, long term, a strategic goal to increase its market share by acquiring new card holders. Estimating revenues is complicated because they depend on many other factors. How much will new cardholders use their card, if at all? Will they carry revolving balances or will they pay the full amount of their balance each month? If the new membership was generated by offering a low interest rate offer on a balance transfer, will the new member attrite or move the balance to another card when the interest rate is raised?



Long-term customer value. It is well known (e.g., see Berger & Nasr, 1998; Dwyer, 1997) that the long-term value of a customer is often greater than the unit contribution (price paid less cost of goods) and should be considered when making circulation decisions.



Inventory/capacity considerations. Catalog and Internet companies that sell merchandise must consider available inventory levels when making circulation decisions. Inventory considerations may prompt these companies to mail above or below the point where unit contribution equals the cost of the contact.









The lead time for acquiring additional units of merchandise can be several months, which is unacceptably long when the products are seasonal, e.g., trendy Christmas toys or summer clothing. If a company knows that it cannot fulfill the orders it would generate by mailing to the point where unit contribution equals contact cost, it should mail fewer names. Unfulfilled orders would seem to have a negJOURNAL OF INTERACTIVE MARKETING



44

ative effect on customer loyalty and the company’s reputation. Conversely, a company that has ordered too much inventory may have to mail more names to sell the merchandise. Likewise, service providers that use direct marketing programs to generate business should consider their existing capacity and their ability to increase capacity quickly when making circulation decisions. Contact strategy constraints. A company that attempts to implement longitudinal contact strategies such as those discussed in Kestnbaum, Kestnbaum, and Ames (1998) will usually not mail to the point where unit contribution equals the cost of the mailing. Contact strategies argue against “effort-byeffort” circulation planning. They involve planning a sequence of contacts over time to achieve some goal. For example, a catalog company that sends out eight catalogs and email/postcard promotions during a six-month season may want to determine which combination of contacts each customer should receive to maximize profitability over the entire season. Here, it is difficult to estimate expected revenues because of the possible interactions between the contacts. Lester Wunderman’s “curriculum marketing” programs (Wunderman, 1996, pp. 226 –244, 260 –263) are akin to longitudinal contact strategies. These programs are useful when more than one customer contact is necessary to produce a sale. Wunderman gives the sample of using direct mail in the process of selling cars. Biased estimates. Scoring models estimate unit contribution and response probabilities using data from a previous mailing. For example, scoring models for an upcoming spring fashion catalog are usually estimated by modeling response and unit contribution for the spring fashion catalog that was mailed during the previous year. In using data from the previous year, the modeler implicitly assumes that nothing has changed between the two years, which is often not realistic. Many factors change,

VOLUME 16 / NUMBER 4 / AUTUMN 2002

PERFORMANCE-BASED VARIABLE SELECTION FOR SCORING MODELS

lowing procedure 30 times in an attempt to answer the question:

including the merchandise offered in the catalog, what items are trendy during a particular year, economic conditions, competition, weather conditions, etc. These factors can bias the predictions from a model. For example, sales of winter clothing are higher when a winter is severe. If the competition in a market is more intense this year than last, response probabilities estimated from last year’s data may be systematically high. Despite these possible biases, the relative ordering of the customers from a scoring model is usually still useful. ●

1. Split the 99,200 available observations into three data sets(training, test, and holdout). The training and test sets had N ⫽ N* ⫽ 33,067 observations and the holdout had 33,066 observations. A stratified sampling procedure was used to ensure that all three data sets had almost equal numbers of responders. 2. Select a set of variables using the performance-based algorithm. The training set was used to estimate a linear model of the form in (2) and the test set to estimate scoring model performance. The variable tran13 was forced in as the first variable in the model. The depth range was L ⫽ .2 and U ⫽ .4. This range is reasonable because scoring models are often used by practitioners to select prospects between these depths. 3. Use forward selection with the significance level to enter the model set at a conservative .05. All observations in the training and test sets were used to estimate this model. 4. Compute the scoring model performance of the two models using the holdout sample.

Rate-base constraints. Publishers of magazines usually have contracts with advertisers guaranteeing a certain paid circulation. If the circulation of the magazine falls below this figure, the publisher encounters a financial penalty. In order to make rate base, publishers often must mail below the point where unit contribution equal contact cost. Here, the definition of mailing cost is particularly complicated; it takes one value if the company makes rate base, and, if not, another value that depends on the number of responders.

EMPIRICAL EVALUATION I evaluate the performance-based forward selection algorithm using real data provided by the Direct Marketing Educational Foundation (DMEF). The DMEF makes available several data sets for research and teaching purposes. The first evaluation uses the DMEF1 data set, which is from a non-profit organization that uses direct mail to solicit additional contributions from past donors. There are 99,200 observations in the file. The behavior to be modeled is the total amount of donations between 10/95 and 12/95. Predictor variables include donation and solicitation history from 10/86 to 6/95. I defined p ⫽ 20 good candidate variables. The details of these variables are given in the appendix. The question is, which version of forward selection, fit-based or performancebased, selects better models? I repeated the folJOURNAL OF INTERACTIVE MARKETING

The three-sample approach (training, test, and holdout) is used because the performancebased algorithm uses the test data set in selecting the final model. A fair comparison of the two approaches is to set aside holdout samples, which are not used in the estimation or modelselection process. The results are compared in Figure 3. The ratios of the performances of the two models at various depths for the three data sets are plotted. A ratio of 1 indicates that the performanceand fit-based models perform equally well and the dotted lines show the value 1. Ratios greater than 1 indicate that the performance-based model did better than the fit-based model; ratios less than 1 indicate the opposite. Boxplots are used to summarize the distribution of the ratios across the 30 splits. For the holdout sam●

45

VOLUME 16 / NUMBER 4 / AUTUMN 2002

JOURNAL OF INTERACTIVE MARKETING

FIGURE 3

Comparison of Performance- and Fit-Based Forward Selection Algorithms across 30 Different Splits of the Data into Training, Test, and Holdout Samples

lection algorithm. The conclusions are the same as for the fit-based forward-selection algorithm. On average the performance-based algorithm selected 6.8 variables (the minimum was 5 and the maximum 10), while the fitbased algorithm selected 16.7 variables (minimum 15 and maximum 17). It seems unlikely that the forward-selection models are overfitting the data, given the smooth nature of the functional form and the large number of observations used to fit the model (2 ⫻ 33,067), but perhaps the difference in scoring model performance is simply due to the number of predictors in the model rather than the variable selection algorithm. I evaluated this possibility by comparing the models from performance-based forward selection with sevenvariable models from fit-based forward selection. The seven variables were the first seven to enter the regression. The models are compared in Figure 5. The conclusions are consistent with the other comparisons. I repeated the evaluation procedure with the DMEF4 dataset, which is from an upscale gift catalog. The objective is to predict demand of

ple, the performance-based model always performs at least as well as the corresponding fitbased model, and usually better (there is one case out of the 30 where the performances are exactly equal). In some cases the performancebased model is about 8 –9% better. The improvement is greatest for small depths. Even outside the specified depth range (.2–.4) the performance-based models outperform the fitbased models. In the specified depth range, the performance-based models are about 2–3% better than the fit-based models, on average. Stepwise selection is usually preferred over forward selection. The stepwise algorithm is very similar to the forward selection algorithm, except that there is an extra step in the algorithm that considers whether any term that has already entered the model can be dropped without substantially reducing the quality of the fit. It is possible that stepwise regression will produce more competitive models than forward selection. I ran stepwise regression on the same 30 splits of the data using a significance level of .05 for entry and removal from the model. Figure 4 compares the resulting models with those identified by the performance-based forward seJOURNAL OF INTERACTIVE MARKETING



46

VOLUME 16 / NUMBER 4 / AUTUMN 2002

PERFORMANCE-BASED VARIABLE SELECTION FOR SCORING MODELS

FIGURE 4

Comparison of Performance-Based Forward Selection and Fit-Based Stepwise Selection Algorithms across 30 Different Splits into Training, Test, and Holdout Samples

vations and 93 variables. I defined p ⫽ 22 good variables, which are listed in the Appendix. I used the same lower and upper limits of the

former customers during the fall season of 1992 (9/92–12/92) from previous purchase history through 6/92. The data set has 101,532 obser-

FIGURE 5

Comparison of Performance- and Fit-Based Forward Selection algorithms across 30 Different Splits of the Data into Training, Test, and Holdout Samples (only the first seven terms in the fit-based algorithm were used)

JOURNAL OF INTERACTIVE MARKETING



47

VOLUME 16 / NUMBER 4 / AUTUMN 2002

JOURNAL OF INTERACTIVE MARKETING

mailing depths of interest (L ⫽ .2 and U ⫽ .4). The variable tran2 (1/recency) was forced in as the first variable to enter all performancebased models. Figure 6 compares the performances of performance-based forward selection and fit-based stepwise selection. The results suggest that the performance-based method almost always does better at depths of .3 and .4; on average, performance-based selection does 3–5% better in this depth range. At other depths the results are more mixed. Above the cutoff depth of .4, the two approaches are roughly equal (the ratios are roughly 1). At the lower cutoff depth (.2) and below, fit-based selection usually does better than performancebased selection. The conclusions are similar when performance-based forward selection is compared with fit-based forward selection.

considering the fit as well as which variables make better business sense. The performance-based algorithm is computationally much more difficult than the fit-based algorithm. When there are hundreds or thousands of candidate variables, the performance-based method is not currently practical. In these cases, I suggest using some fit-based method to reduce the candidate variables to a more manageable number and then using the performance-based algorithm. Future research. Further testing of the proposed algorithm is certainly necessary using other data sets, different lower and upper limits for mailing depths, other regression approaches (e.g., trees, neural networks, etc.), and other estimation procedures. The algorithm is sensitive to the first variable that enters the model. In my experiments I have found that the first variable chosen by the algorithm does not always lead to the best overall models. I have had better luck forcing a variable in as the first predictor that I know is a strong predictor and takes on many possible values, e.g., 1/recency. My analyses so far suggest that highly discrete variables do not seem to work well as a first variable. Further empirical and theoretical work on this heuristic and the first-variable problem in general is necessary. The computational cost of generating a gains table for each model under consideration is great; more efficient algorithms for evaluating the performance of families of models is necessary. Understanding the statistical properties of gains tables and performance is another important area for future research. Rosset, Neumann, Eick, Vatnik, and Idan (2001) examined the sampling distribution of gains tables but there are many more open questions. A deeper theoretical understanding of the properties of “performance” would help us understand under what circumstances performance-based selection is expected to yield an improvement. Finally, response models (binary dependent variable) are very common in direct, database, and electronic commerce marketing. Performance-based variable selection should be evaluated on response models as well.

DISCUSSION This article proposes a forward selection algorithm that takes model performance into consideration when making decisions about which variables should enter the model. The empirical results presented here using the DMEF1 and DMEF4 data sets suggest that on average, the overall model performance was improved by 3– 4% over fit-based forward selection. In some cases the improvement was 9% better, and that there is little chance of doing worse, at least within the specified range of possible mailing depths (20 – 40%). Database marketers commonly mail millions of pieces; with mailings of this magnitude, improvements of 3– 4% translate into large amounts of money. The results suggest that taking scoring model performance into consideration is important when making variable selection decisions. This does not imply that it should be the only criterion when selecting variables. This article shows that good fit is not a prerequisite for good performance. I suggest that a good scoring model should both perform and fit well. If two models give comparable performances, judgment should be used to pick a final model, by JOURNAL OF INTERACTIVE MARKETING



48

VOLUME 16 / NUMBER 4 / AUTUMN 2002

PERFORMANCE-BASED VARIABLE SELECTION FOR SCORING MODELS

FIGURE 6

Comparison of Performance-Based Forward Selection and Fit-Based Stepwise Selection Algorithms across 30 Different Splits into Training, Test, and Holdout Samples Using the DMEF4 Data Set

APPENDIX: CANDIDATE VARIABLES FOR DMEF1

cnmonf months since first contribution. cnmonl months since largest contribution.

The variables consist of variables provided in DMEF1, hereafter referred to as raw variables, as well as transformations of the raw variables. I started with the raw variables and created additional variables to model two-way interactions between important variables. The variable selection node in the SAS Enterprise miner was used to identify which variables interacted. I coded the interactions as products and quotients. I searched for “power” transformations (square, square root, logarithm, inverse root, and inverse) of the raw and interaction variables. I used a fit-based variable selection procedure to narrow the list down to 20 good candidate variables for a final model. The variables are as follows:

sltmlif number of solicitations life-todate. tran1 dollars per contribution (cntrlif/ cntmlif) tran2 dollars per solicitation (cntrlif/ sltmlif). tran3 contributions per solicitation (cntmlif/sltmlif). tran4 log(recency ⫹ 1). tran5 square root of monetary value. tran6 inverse of monetary value. tran7 inverse of tran3, solicitations per contribution. tran8 square root of dollars per solicitation.

recency months since latest contribution. This was computed from the cndat1 variable, latest contribution month. cntrlif monetary value—life-to-date dollars contributed. cntmlif frequency—number of contributions life-to-date. JOURNAL OF INTERACTIVE MARKETING

tran9 square of solicitations. tran10 log of dollars per contribution. tran11 monetary value/(recency ⫹ 1). tran12 frequency/(recency ⫹ 1). tran13 log(monetary value/(recency ⫹ 1)). ●

49

VOLUME 16 / NUMBER 4 / AUTUMN 2002

JOURNAL OF INTERACTIVE MARKETING

tran14 log(frequency/(recency ⫹ 1)). The same procedure was used for DMEF4. The 22 variables are as follows: dayfp days since first purchase falord LTD fall orders freq LTD orders mon LTD dollars ordlyr number of orders last year ordtyr number of orders this year purseas number of seasons with a purchase recency days since most recent purchase slslyr dollar demand from last year slstyr dollar demand from this year int1 an indicator variable taking the value 1 if recency is less than 1 year and dayfp is greater than 1 year, and 0 otherwise (indicates “old” and recent customers) int2 ordtyr* ordlyr int3 dayfp/ recency int4 ordtyr* purseas int5 ordtyr* falord

Dwyer, R. (1997). Customer Lifetime Valuation to Support Marketing Decision Making. Journal of Direct Marketing, 11(4), 6 –13. Hansotia, B.J., & Wang, P. (1997). Analytical Challenges in Customer Acquisition. Journal of Direct Marketing, 11(2), 7–19. Horowitz, E., & Sahni, S. (1978). Fundamentals of Computer Algorithms. Rockville, MD: Computer Science Press. Kendall, M., & Stuart, A. (1977). The Advanced Theory of Statistics, Volume 1 (4th ed.). New York: MacMillan. Kestnbaum, R.D., Kestnbaum, K.T., & Ames, P.W. (1998). Building a Longitudinal Contact Strategy. Journal of Interactive Marketing, 12(1), 56 – 62. Magidson, J. (1988). Improved Statistical Techniques for Response Modeling. Journal of Direct Marketing, 2(4), 6 –18. Malthouse, E.C. (1999). Ridge Regression and Direct Marketing Scoring Models. Journal of Interactive Marketing, 13(4), 10 –23. Malthouse, E.C. (in press). Scoring Models. In D. Iacobucci & B. Calder (Eds.), Kellogg on Integrated Marketing. New York: Wiley. Miller, A.J. (1990). Subset Selection in Regression. New York: Chapman & Hall. Mulhern, F.J. (1999). Customer Profitability Analysis: Measurement, Concentration, and Research Directions. Journal of Interactive Marketing, 13(1), 25– 40. Neter, J., Kutner, M., Nachtsheim, C.J., & Wasserman, W. (1996). Applied Linear Statistical Models (4th ed.). New York: Irwin. Ratner, B. (1999). Direct Marketing Models Using Genetic Algorithms. In D. Shepard (Ed.), The New Direct Marketing (3rd ed.). New York: Irwin. Rosset, S., Neumann, E., Eick, U., Vatnik, N., & Idan, I. (2001). Evaluation of Prediction Models for Marketing Campaigns. In Proceedings of KDD-2001 (pp. 456 – 461). New York: Association for Computing Machinery. Sedgewick, R. (1992). Algorithms in C⫹⫹. Reading, MA: Addison-Wesley. Shepard, D. (1999). The New Direct Marketing (3rd ed.). New York: McGraw Hill. Thisted, R.A. (1988). Elements of Statistical Computing. New York: Chapman & Hall. Wunderman, L. (1996). Being Direct. New York: Random House. Zahavi, J., & Levin, N. (1997). Applying Neural Computing to Target Marketing. Journal of Direct Marketing, 11(1), 5–22.

tran1 冑 (int3) tran2 1/(recency ⫹ 1) tran3 tran4 tran5 tran6 tran7

冑 (int2)

log(slstyr ⫹ 1)) log(slslyr ⫹ 1) dayfp2 log(recency)

REFERENCES Berger, P., & Nasr, N. (1998). Customer Lifetime Value: Marketing Models and Applications. Journal of Interactive Marketing, 12(1), 17–30. Breiman, L. (1996). Heuristics of Instability and Stabilization in Model Selection. The Annals of Statistics, 24(6), 2350 –2383. Bult, J. (1993). Semiparametric versus Parametric Classification Models: An Application to Direct Marketing. Marketing Science, 380 –390. Bult, J., & Wansbeek, T. (1995). Optimal Selection for Direct Mail. Marketing Science, 14(4), 378 –394. Colombo, R., & Jiang, W. (1999). A Stochastic RFM Model. Journal of Interactive Marketing, 13(3), 2–12.

JOURNAL OF INTERACTIVE MARKETING



50

VOLUME 16 / NUMBER 4 / AUTUMN 2002

Suggest Documents