solutions for data mining in direct marketing. In this paper we ..... automation streams for mass modeling, a procedure described below. Each automation stream ...
Predictive Modeling in Automotive Direct Marketing: Tools, Experiences and Open Issues Wendy Gersten, Rüdiger Wirth, Dirk Arndt DaimlerChrysler AG, Research & Technology Data Mining Solutions, FT3/AD PO BOX 2360 89013 Ulm Germany
{wendy.gersten, ruediger.wirth, dirk.arndt} @ daimlerchrysler.com
projects with different departments within our company. We present tools we developed to support knowledge transfer and organizational learning.
ABSTRACT Direct marketing is an increasingly popular application of data mining. In this paper we summarize some of our own experiences from various data mining application projects for direct marketing. We focus on a particular project environment and describe tools which address issues across the whole data mining process. These tools include a Quick Reference Guide for the standardization of the process and for user guidance and a library of re-usable procedures in the commercial data mining tool Clementine. We report experiences with these tools and identify open issues requiring further research. In particular, we focus on evaluation measures for predictive models.
In the next section, we introduce our application scenario with a focus on technical challenges and requirements for solutions. Then we describe tools we developed while working with several different marketing departments. First, we present a Quick Reference Guide for predictive modeling tasks in marketing applications. This guide standardizes the process and guides end users through data mining projects. The second set of tools consists of re-usable procedures in the commercial data mining tool Clementine. We present practical solutions to common data mining problems like unbalanced training data and the evaluation of models. For each of these tools we discuss our experience and outline open issues for further research.
Categories and Subject Descriptors H.2.8 Database Applications Data mining
2. APPLICATION SCENARIO
Keywords
For the purpose of this paper we restrict ourselves to acquisition campaigns. The challenge is to select prospects from an address list who are likely to buy a Mercedes for the first time. We try to derive profiles of Mercedes customers and use these profiles to identify prospects. Although this sounds like a textbook data mining application, there are many complications.
Clementine, CRISP-DM, Data Mining Process, Direct Marketing, Evaluation Measures.
1. INTRODUCTION There are many opportunities for data mining in automotive industry. DaimlerChrysler's Data Mining Solutions department works on different projects in fields like finance, manufacturing, quality assurance, and marketing. Here, we focus on direct marketing. Several authors [1, 10, 11] described problems and solutions for data mining in direct marketing. In this paper we reflect our own experience working in the automotive industry.
One of the major challenges is the fact that the buying decision takes a long time and the relationship marketing action can only contribute a small part. This makes the assessment of its success and, consequently, the formulation of the proper prediction task difficult. Furthermore, the process is not as stable and predictable as one might expect. Typically, we cannot develop models of the customers behavior based on previous examples of the same behavior because in many cases we simply lack the data.
For various reasons, modeling the response to direct marketing actions in the automotive industry is more difficult than for mail order businesses. In this paper, we describe some results from
A more fundamental problem is that it is not clear what behavior we want to model. Ideally, we want to increase the acquisition rate not just the response rate. However, the acquisition rate is difficult to handle. It is not easy to measure, it is captured fairly late (up to 9 months after modeling), and the impact of the predictive model is difficult to assess.
Permission to make digital or hard copies of part or all of this work or personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear LEAVE THIS TEXT BOX IN PLACE this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers, orAND to redistribute BLANKto lists, requires prior specific permission and/or a fee. KDD 2000, Boston, MA USA © ACM 2000 1-58113-233-6/00/08 ...$5.00
Regardless of the success criterion, we face a technical problem during training. Usually, measures like accuracy or lift are used to
398
evaluate models [3], implying that the higher the measure the better the model. But this is not true, a model with a perfect lift is not necessarily the best one (see section 4.4.1).
Furthermore, it points to a stream library, i.e., executable and adaptable procedures in Clementine. Some of them are meant to be used by marketing people, others facilitate the work of data mining specialists. The experiences during the process are archived for later reuse in an experienced base.
In addition, there are many other factors, like the available data or the market situation, that make every predictive modeling project unique in a certain sense. This gets even worse if we want the process to be applicable in different countries with differences in language, culture, laws, and competitive situation.
In the following sections we describe the Quick Reference Guide and the Clementine stream library. We specify our particular requirements, discuss the state of the art, outline our solutions and identify open issues requiring new research.
Nevertheless, we aim for a standard approach and a set of tools which should be usable by marketing people with only little data mining skills. Although the projects differ in details there is a large set of common issues which come up in most of these projects. In order to support the construction of a common experience base, the solutions and the lessons learned need to be documented and made available throughout the organization.
3. QUICK REFERENCE GUIDE Basically, the Quick Reference Guide is a specification of the predictive modeling process with detailed advice for various tasks. As such it should ensure a basic level of quality, guide inexperienced users through the predictive modeling process, help them using the data mining tool, and support the evaluation and interpretation of results.
There are two groups of people involved. First, there are marketing people who will do the address selection three to four times a year and must be able to perform this process reliably. They have little time and insufficient skills to experiment with different approaches. Therefore, they need guidance and easy to use software tools. Second, there is a data mining team which does the initial case studies, develops and maintains the process, trains the marketing people, and which will later support the marketing people with more challenging application scenarios outside the standard procedure. The data mining team needs sophisticated data mining tools and a common framework for documentation and knowledge transfer.
In the remainder of this section, we explain how we developed this guide and our experience to date. The Quick Reference Guide is based on CRISP-DM, a recently proposed standard data mining process model [4]. CRISP-DM breaks a data mining process down in phases and tasks. The phases are shown in Figure 2.
To fulfill the needs of these two groups of people we developed various tools (see Figure 1). This toolbox is organized along the CRISP-DM data mining process model. We created a Quick Reference Guide which is supposed to guide the marketing people through a data mining project. The guide contains specific advice for various tasks of the process and points to appropriate tools, like questionnaires for problem understanding and check lists for the evaluation of data providers.
Business Understanding
Data Understanding Data Preparation
Deployment
Data Data Data
Modelling
CRISP-DM Phases, Steps and Tasks Data Understanding Approach Definition
1
2
3
4
5
6
Guides
Evaluation
Figure 2. CRISP-DM Process Model
Experiences
• Quick Reference Guide • Questionnaires • Advice Book Templates
Project Documentation Data Source Reports
At the task level, CRISP-DM distinguishes between a generic and a specific process model (see Figure 3). The generic process model is supposed to cover any data mining project. As such it provides an excellent framework and is useful for planning the projects, for communication both within and outside the project team, for documentation of results and experiences, and for quality assurance. However, it is too abstract to describe repeatable processes for the marketing people. Such processes typically take place in a certain context and under specific constraints. To cover this, CRISP-DM allows for instantiations of the generic process model. The resulting specific process models
Stream Library Streams for • data understanding • data preparation • modeling • evaluation • scoring
Figure 1. Toolbox
399
are meant to reflect the context, constraints and terminology of the application. The Quick Reference Guide is a specific CRISP-DM process model.
When developing the Quick Reference Guide, we got into the following dilemma: On the one hand it is obviously impossible and after all even not desirable - to give a detailed and exhaustive process description. This is due to the complexity resulting from many unknown factors and unforeseeable situations that will always remain. We experienced that even from campaign to campaign in the same country the circumstances may differ fundamentally. On the other hand, the opposite extreme, a very high level description like the generic CRISP-DM process model, is not a solution either. While it covers the whole process and is useful for experienced people, it is not suitable for the kind of users one is confronted with when moving into normal business processes. The resulting process description should guide the user as much as possible but, at the same time, enable him to handle difficult unexpected situations. See [12] for more experiences with using CRISP-DM in this application.
Phases CRISP Process Model Generic Tasks
Mapping Specialized Tasks
A major challenge is to put a project management framework around this highly iterative, creative process with many parallel activities. There must be firm deadlines for a task to be completed to ensure timely completion and proper usage of resources. But when is a task complete in a data mining project? This is a question that cannot be answered easily. So far, we have not yet a complete, satisfactory and operational set of criteria in our application. One approach to define key performance indicators is described in [6].
CRISP Process
Process Instances
Figure 3. The four levels of the CRISP-DM methodology
Business Understanding
Inventory of Resources
Situation Assessment
Specify Response Modeling Goals
Initial Project Plan
Data Understanding
Initial Data Collection Report
Import Data into Clementine
Data Description
Data Quality Verification
Select Working Data
Data Preperation
Select Attributes and Datasets
Data Cleaning
Derive New Attributes
Integrate Data Sources
Adjust Modeling & Scoring Data
Review Modeling Approach
Generate Test Design
Set Up Modeling Stream(s)
Assess First Model Results
Fine-Tune Model Parameters
Final Model Assessment
Review Process Plan
Evaluation
Evaluate Results
Quality Assurance
Determine Next Steps
Plan Scoring
Plan Monitoring& and Maintenance
Apply Predictive Model
Run Campaign
Evaluate Outcome of Campaign
Produce Final Report
Review Project
Modeling
Understand Planned Marketing Action
Deployment
The starting point for the Quick Reference Guide was the Generic CRISP-DM User Guide. We deleted tasks and activities that were not relevant for our application, renamed tasks to make them more concrete, and added a few tasks at various places. The additions, however, were not due to omissions in the generic model. It was more like splitting one abstract task into several more concrete tasks. The main work was to generate check lists of specific activities for each task. There, the generic check lists were an excellent guideline. Figure 4 shows the specific tasks of the Quick Reference Guide.
Apart from the specification of the process, the Quick Reference Guide contains pointers to examples from case studies, documented experiences, and to the library of executable procedures described in the next section. So far, the experience with the Quick Reference Guide was mostly positive. We expected it to be useful for planning, documentation, and quality assurance, and this turned out to be the case. However, its use for communication both within and outside the project was much more advantageous than we originally anticipated. Presenting the project plan and status reports in terms of the process model and, of course, the fact that we followed a process, inspired a lot of confidence in users and sponsors. It also facilitated status meetings because the process model provided a clear reference and a common terminology. In our first case studies, we encountered an unexpected difficulty with the process model. Although we stated frequently that the phases and tasks are not supposed to be strictly sequential, the clear and obvious presentation of the process model inevitably created this impression in decision makers. Despite our arguments, we found ourselves forced to very tight deadlines, which in the end lead to sub-optimal solutions. On the other hand, the process model gave us a structured documentation which allowed us to justify how much effort was spent in which task. This made it fairly easy to argue for more realistic resources in later applications.
Develop Initial Modeling Approach
Based on this experience, we advise in the Quick Reference Guide to plan for iterations explicitly. As a rough guide we suggest to plan for three iterations where each iteration is allocated half the time of the preceding one. It is also stressed that it is never the case that a phase is completely done before the subsequent phase starts. The relation between phases is such that a phase cannot start before the previous one has started. From this, it follows that the distinction between phases is not sharp. But this is a rather
Figure 4. Specific Tasks from the Quick Reference Guide
400
academic question. In practice, all we need is a sensible grouping of tasks.
Logical errors concern the content and meaning of data. Typically, they are caused by false documentation, wrong encoding or human misunderstanding. For instance, in one project we used an attribute (nondependent variable) named DealerDistance. This attribute was supposed to measure the distance from a buyer's home to the closest car dealer. First models included rules like “If DealerDistance is larger than 120 miles then the person is a car buyer.” These rules contradicted the business knowledge. Closer inspection revealed, that for buyers DealerDistance contained the distance between their home and the dealer who sold the car. This is not necessarily the nearest one.
4. CLEMENTINE STREAM LIBRARY 4.1 Overview As mentioned above, we do different projects in the field of marketing. However, there are always some business related problems as well as repeatable tasks in common. Additionally, we go through several loops between single tasks within one project as shown in Figure 2. Therefore, we want to automate parts of the data mining process. The basic idea is to develop tailored software procedures supporting special tasks of the process. These procedures should be modular, easy to handle and simple to modify. Although they mostly assist within single tasks they must have the ability to interact automatically in order to avoid human errors.
Technical errors pertain to the form of data. They are usually caused by false technical processing like wrong data formats, incorrect merging of data sources or false SQL statements. For instance, once we received data from two sources, the customer data base and an external provider, both with the same file structure. The modeling algorithm generated rules like “If Wealth-Poverty-Factor is larger than 4.99 and smaller than 5.01 then the person is a car owner.” Looking for an explanation, we detected that the export from the data sources was inconsistent. Wealth-Poverty-Factor contained real numbers when exported from the customer database and integers otherwise.
For our applications, we chose the data mining tool Clementine from SPSS, mainly because its breadth of techniques, its process support, and its scripting facilities. Using a visual programming interface, Clementine offers rich facilities for exploring and manipulating data. It also contains several modeling techniques and offers standard graphics for visualization. The single operations are represented by nodes which are linked on a workspace to form a data flow, a so called stream.
Such errors are not easy to identify. On the one hand, we need strong business knowledge to detect logical errors like the one described above. On the other hand, one needs to be familiar with data mining tools and techniques. Therefore, data mining specialists and marketing people need to work closely together during data understanding. Examples like the ones above are contained in our toolbox.
Clementine is geared towards interactive, explorative data mining. A scripting language allows the automation of repetitive tasks, such as repeated experiments with different parameter settings. Therefore, the tool serves our needs to automate the data mining process very well. Nevertheless, there are also disadvantages. Some useful functionality like gain charts or various basic statistics are laborious to realize.
The examples also illustrate the usefulness of interpretable models. Typically, we build a few models early in the process because they are good indicators for data errors. When we are fairly confident about the correctness of our data, we use automation streams for mass modeling, a procedure described below. Each automation stream can load defined files automatically and save the results for the next step in process.
We built a library of streams supporting the process as described in the Quick Reference Guide. The library contains three kinds of streams: Sample streams: These streams aim to show marketing people how to implement important procedures in Clementine. They illustrate basic techniques on the basis of tasks common in direct marketing projects. Automation streams: These streams help to automate the data mining process. They are used to repeat one task very often under different conditions, for example, to train a model with changing parameter settings. They can be used by both marketing people and data mining specialists. Functionality streams: Some functionality is used in most of our projects (e.g. score calculation) but not implemented in Clementine from the outset. So we added these functions for easy re-use. The first two CRISP-DM phases supported are data understanding and data preparation. Since these phases are highly explorative, we mostly use sample and functionality streams. These phases are the most time consuming ones. There are many traps one can avoid using systematical discovery procedures as provided through the stream library. In our projects we found two main kinds of errors in data.
Figure 5. Numerical Model Evaluation Stream
401
systematically and saved for modeling by another stream from the library.
Figure 5 shows an example of an automation stream used for numerical model evaluation. The starlike icons are so-called “Super Nodes” and represent procedures from our library (in this case numerical quality measures for modeling (see 4.4.3)).
In our projects, we noticed that different training sets strongly affect the model quality. But we did not yet discover a direct correlation between properties of the training sets and the algorithms or their parameter settings. Since we cannot narrow down the most promising training sets early in the process, we have to handle a large number of possible sets while modeling. During model evaluation the situation gets even worse, as we will discuss. Here, we need systematical tests in order to develop approaches for fine-tuning training sets regarding properties of input data and modeling algorithms (including their various parameter settings) before the learning itself begins.
In the following sections, we will have a closer look at the process from the modeling to the deployment phase. Therefore, we pick some of the most important issues for predictive modeling in direct marketing. Each issue is introduced first. Afterwards, we discuss approaches, technical solutions, experiences, and the needs for further research.
4.2 Creation of training and test subsets Before modeling, we have to define a test design. The most common approach is to split the data into training and test sets. The training set is used to generate the models and the test set serves to measure their quality on unseen data. An alternative approach is to use resampling techniques, especially for small data sets. Here we only discuss the former approach because its lower computational requirements make it more efficient for mass modeling (see section 4.3).
4.3 Mass Modeling Today there are lots of techniques and software implementations to built effective predictors. The question is not only to choose between different techniques (like decision trees, rule sets, neural networks, regressions, etc.) but also to choose between several algorithms (e.g., CART, CHAID, C5.0, etc.). One also has to consider the parameter settings of the algorithms and the various training sets as discussed before. It is well known that there is no generally best technique or algorithm. The choice depends on many factors like input data, target variable, costs, or output needed. Up to date we have no practicable approach to select even the most promising combinations early in the process.
The first decision is to choose the sizes of the sets. In the best case, we have enough data to create sufficiently large training and test sets. Practically, we are often limited somehow. In case of small data sets we do not have a general suggestion how large the test set should be. It depends on the total number of records available as well as on the number of records with positive target variable. A large training set results in more stable models but there have to be enough records in the test set to evaluate the model quality. Normally, we reserve at least 20% of the sample for testing. The remaining data are used to create various training sets.
Fine-tuning just one modeling parameter after another makes it really hard to consider all cross influences. Besides, the basic functionality of some algorithms is not published sufficiently. Then, it is difficult to judge the real influence of some parameters. For these reasons we think it is necessary to experiment with different combinations and, consequently, generate a whole palette of possible models.
A major challenge in our marketing applications is the highly unbalanced distribution of the target variable. Typically, we have an overbalance of records with negative target variable (i.e., nonbuyers) [5, 10]. This makes it difficult to build good models that find characteristics of the minority class. One way to approach this problem is to create more or less balanced training sets. According to our experiences, a ratio between 35:65 and 65:35 of buyers to non-buyers is most effective [10]. But it differs between the several projects and can change dramatically under some circumstances. Of course, the distribution in the test set should be representative for the universe and will not be changed.
For tuning the parameters, we need to automate the modeling process. Especially, if we also take the various training sets into consideration. For this purpose, we built several streams to generate large numbers of predictive models. Each of the streams deals with one technique and generates models with a whole range of sensible parameter combinations. Clementine offers several decision trees, rule sets, neural networks and linear regression as techniques to build predictors. The streams can be run independently of each other. All of them load different training sets, train the algorithm with changing parameter combinations on each set and save the generated models. This way hundreds or thousands of models are generated in multifarious facets, taking all reasonable possibilities into consideration. We call this procedure mass modeling. Because of the automatism we save time (computers work at night), avoid human mistakes and secure best practice model quality.
There are two ways to change the balance within the training set. We can either reduce the negative records or boost the positive ones [8, 9]. This will affect the modeling results because we vary amount and content of information for learning. To date, we have not yet sufficiently explored the impact of the two alternatives. The corresponding stream in our library is designed to handle these issues flexibly. At first, it allows to choose the percentage of data to be randomly assigned to the test set. Implicitly, we assume that the sample is representative for the universe. This operation is done only once and the same test set serves to evaluate all models. The remaining data is used for generating multiple training sets, where the stream automatically balances the data according to different properties. With this stream, we set only some parameters and generate multiple training sets with different sizes, ratios and boosted or reduced records. All the subsets are named
Mass modeling is a pragmatic approach yielding good results in real-life projects. But there are also some practical problems. Dealing with such a large number of models places high requirements on hard- and software and demands careful preparation. As we will see later, it is also very hard to compare that many models and pick the optimal one reliably. To improve the process we need to develop procedures to focus only on the most promising algorithms and parameter setting combinations.
402
Combinations not fitting to the overall project goal, input data and other framework should be excluded safely and as soon as possible.
4.4.2 Approaches to measure model quality Histogram of scores shows the distribution of all records according to their score overlaid with their real target value (Mercedes buyers or non-buyers). The diagram gives evidence how strong the model assigns records to one of the two classes (see Figure 7 in section 4.5). In case of many records with medium scores, the model’s discriminating power is very low. However, the histogram tells us nothing about stability and plausibility.
4.4 Evaluation After models are generated through mass modeling they must be evaluated.
4.4.1 Requirements of measures
The cumulative gain chart relates the number of objects with a real positive target value, e.g., Mercedes buyers, to all objects in the data base (buyers and non-buyers). The objects are sorted descendingly according to their score. Ideally, all true positive values should be situated at the left of the graph (see Figure 6).
[7] contains a thorough discussion of requirements and properties of evaluation measures. In our projects we found the following characteristics to be practically relevant for model quality: Predictive accuracy: How accurate is the prediction of the model? The predictive accuracy is high, if the model assigns a score of 1 to objects belonging to class 1 and a score of 0 to objects belonging to class 0. Discriminatory power: This characteristic indicates the model's ability to assign objects clearly to one of the two classes, even if it assigns them to the wrong class. Discriminatory power is high if the model assigns mostly high or low scores to the records, but rarely medium scores. Assigning mostly medium scores indicates a lack of discriminating information or mistakes in data preparation phase.
Number of objects with positive target value (Mercedes-buyers)
Selection with Predictive Model
2500 Advantage of Predictive Model
2000 1500 Random Selection
Stability: Stability indicates how models vary when generated from different training sets. It implies three facets, overall stability, stability per object and stability of the output form. Overall stability means that models with identical parameter settings generated on different training sets produce similar results. Stability per object requires that the score for every object is similar for models generated on different training sets. Under stability of the output form we understand that the form of the models is similar, e.g. different rule sets contain similar rules.
1000 500 0 0
10
20
30
40
50
60
70
80
90
100
% of objects (Mercedesbuyers and non-buyers)
Plausibility: Plausibility of models means that they do not contradict business knowledge and other expectations. It requires the possibility to interpret the results and is necessary for two reasons. First, it increases the acceptance of predictive modeling results and second, it facilitates individual and organizational learning by recognizing mistakes and inconsistencies (e.g., logical and technical errors). As such, it is a sort of quality assurance. Plausibility can also be enhanced by univariate analyses of important attributes and segments with high scores.
Figure 6. Cumulative Gain Chart Only taking into account objects with positive target value presupposes that it does not cause costs to predict a non-buyer as buyer. This is not in concordance with reality. Furthermore the measure says nothing about discriminatory power, plausibility and stability of the model. Histogram of scores and cumulative gain charts give visually a good insight in model performance. However, it is not feasible to compare hundreds of models. We overcome this disadvantage using numerical indicators.
Accuracy and discriminatory power are often meant to be the most important objectives of a good model. Although both properties are important, they are not sufficient. In our setting, there are many non-customers who resemble our customers but, for various reasons, have not yet got around to buy our products. Nevertheless, they are worth contacting to stimulate a purchase. Therefore, models which perfectly separate customers and noncustomers are not sensible.
The quadratic predictive error gives evidence of the predictive accuracy of all records and is calculated as sum of squared deviations between real target value and predicted target value [7]. By squaring the deviations, a high predictive error contributes disproportionally more to the resulting measure.
Now the challenge consists in finding one or several measures to operationalize these characteristics. In our toolbox, we use four measures to evaluate the quality of the generated models. After describing their functionality we discuss their weaknesses and strengths.
A strength of this indicator is that false positive and false negative predictive errors are taken into account and that deviations will be registered exactly. But false positive and false negative predictive errors are not weighted differently, even if – in practice – the error
403
to assign a Mercedes buyer as non-buyer is worse than the other way round. Therefore, we also consider the squared errors of positive and negative cases separately. Furthermore, there is no benchmark what is a good or a bad predictive error.
attributes. Of course, the evaluation process is not always as linear as described here.
The weighted lift [10] pursues the same goal as the lift curve. It indicates whether the model assigns high scores to real positive targets. But in contrast to the graphics this measure weighs true positive targets with a high score much higher than those with a low score. This has high practical relevance because mostly only records with a high score are selected for actions. The maximum value for the weighted lift is 100. A weighted lift of 50 (in case of infinite number of partitions) corresponds to random selection.
None of the presented measures fulfills accuracy, discriminatory power, stability and plausibility at once. Only a combination of various numerical and graphical measures will give us a complete picture of the models' characteristics.
4.4.4 Summary of evaluation
Accuracy is well represented in its diverse facets by all measures. Discriminatory power cannot be calculated as indicator yet. Only the histogram of scores gives us an impression of the separability. But in order to compare hundreds of models easily there is a strong need of an proper indicator of separability. Until now, we are not able to measure stability and plausibility sufficiently exactly. Concerning the stability of models we will try to measure the confidence and the variability of results and integrate these as confidence bounds in the weighted lift and the predictive error. This probably allows us to decide whether it makes sense to try to improve the model.
One problem is the fact, that weighted lifts based on different number of partitions cannot be compared. The maximum lift of 100 can only be reached if all positive targets could theoretically fall in the first partition. For instance, if there are 20% positive targets in the set, there should not be more than 5 partitions. But the lower the number of partitions, the higher the random lift and the lower the advantage over random selection. In case of 5 partitions, the random weighted lift is 60. Besides, the weighted lift tells us nothing about the discriminatory power of the model, its stability or plausibility. Furthermore, there is the open question, how many partitions to use. In our applications, we use as many partitions as possible, as long as the size of a partition is larger than the number of true targets, usually between 5 and 10. In summary, we judge the weighted lift as a useful heuristics but not more.
So far, we ignored costs of misclassifications. Even if costs are relevant there exist high practical barriers to integrate them in model evaluation. The main problem is that they are mostly impractical to estimate reliably and consequently their impact can only be guessed. In general, we consider the evaluation of models in settings like ours to be a largely open and challenging research issue. As we mentioned at various places, there is no measure that tells us how a good predictive model will perform in the field.
4.4.3 Realization We have shown a few measures for model quality. They are applied in different stages of the process. In the early phase, we build a few models and inspect them carefully regarding potential data errors. For this purpose, we use rule sets and judge their plausibility. Accuracy and discriminatory power are assessed visually. Only when the obvious data errors are repaired, we start mass modeling.
4.5 Benefits Before applying a promising model for the selection of addresses to be included in the direct marketing campaign, its benefit must be evaluated. This will decide whether to deploy the model or to go back to a previous step. Furthermore, illustrating the benefit when using data mining enhances acceptance and further distribution of data mining within the enterprise. But what is the benchmark to calculate the benefit of predictive modeling?
Since it is infeasible to compare hundreds of models visually, we use the numerical model evaluation stream as shown in Figure 5. There, we calculate weighted lift, quadratic predictive error and in addition positive and negative quadratic predictive error. The stream loads all models and calculates the measures automatically. The output is a sorted list of model names and corresponding indicators.
Comparing the model to random selection - as it is usually done is practically not very relevant. This would mean that no sensible selection of prospects is done. But typically, some intelligent selection according to common business sense is already performed in the enterprise. It is often based on human experience and two or three attributes (like high income, less than two children and middle-aged). This intelligent selection should be chosen as benchmark. The aim is to find out which objects are assigned to the same class by both predictive model and intelligent selection, and which objects are assigned to different classes. Mainly the latter must be examined.
This way, many models can be compared. How many models can be dealt depends on the actual computer. The performance of Clementine Version 5.1 - the version we currently use - is mainly limited by physical memory. This means that the whole process must be broken down in steps so that Clementine can handle it efficiently.
We built a Clementine stream to compare the predictive model with such an intelligent selection. Figure 7 shows the results from one of our business projects. Instead of 328 buyers correctly recognized by intelligent selection, our predictive models assigned a high score to 491 buyers. The advantage of the predictive model over the intelligent selection increases when only objects with a score higher than 0.8 are contacted.
Then, we pick only the best models for closer inspection with streams for histogram of scores and cumulative gain chart. Finally, we test the stability of the best few models using resampling (10-fold cross-validation) and, in case of decision trees and rule sets, we check the plausibility of the results by inspection and with univariate analyses of the most relevant
404
The toolbox we use for our data mining applications in direct marketing is pragmatically useful but not yet complete. Currently, we are putting toolbox and experience documentation in a hypermedia experience base to make storage and retrieval more flexible and comfortable [2]. We are also developing additional tools and addressing the important open issues of evaluation. However, in this area, we require a lot more basic research which also takes the business constraints into account.
Intelligent Selection 191 buyers 1373 Non-Buyers
328 Buyers 1273 Non-Buyers
Predictive Model 28 Buyers 1624 Non-Buyers
Another blank area is the question of what potential is really hidden in the data. [11] describes an interesting heuristics for estimating campaign benefits. As the authors note, the estimation is valid only for applications where customer behavior is predicted using previous examples of exactly the same customer behavior. However, in our applications this assumption is rarely met, especially not for acquisition campaigns. In our acquisition campaigns we rarely know more than simple socio-demographic facts. Estimating the expected lift is a promising approach. However, the particular approach of [11] does not fit exactly our experiences and needs. There is more work needed along theses lines.
491 Buyers 1022 Non-Buyers
Figure 7. Benefit of Predictive Modeling compared to Intelligent Selection At this stage, plausibility of the results comes into play. Although neural nets produce comparable results in terms of evaluation measures discussed above, we favor decision trees or rules sets. For them, it is relatively easy to demonstrate to business people why they perform better than the common sense benchmark (besides of helping us finding errors). Let us illustrate this statement with an example from one of our applications. Usually, we achieve good results with C5.0 using boosting. Typically, the first classifier is very close to common sense selection, e.g., these are rules referring to specific income ranges or socio-demographic types, which are common selection criteria for manual scoring. The subsequent classifiers typically are specializations of these initial rules, e.g., socio-demographic types with additional conditions. This way the boosted model confirms the common sense selection and refines it.
As we tried to illustrate in this paper, it is not clear what kind of behavior we need to model in automotive direct marketing. Ideally, we want to model both dialogue affinity and propensity to buy. But we usually cannot get hold of data containing the necessary information. Typically, we have to make approximations. What we then end up with is that we learn a particular behavior of one population and try to adapt this to a different population. It is an open issue how to evaluate models in this setting.
Although this issue is of immediate practical importance, Clementine does not sufficiently support the illustration of boosted rules, requiring rather tedious manual post-processing of the rules. There is certainly room for improvement.
[2] Bartlmae, K., A KDD Experience Factory: Using Textual
5. CONCLUSIONS
[3] Berry, M. J. A. and G. Linoff, Data Mining Techniques –
6. REFERENCES [1] Bhattacharyya, S., Direct Marketing response models using Genetic algorithms, Proceedings of the 4th International Conference on Knowledge Discovery & Data Mining, pp. 144-148 (1998). CBR for Reusing Lessons Learned, Proceedings of the 11th International Conference on Database and Expert Systems Applications (2000). For Marketing, Sales and Customer Support, New York et al. (1997).
A data mining process is a living process which will be changed by future experiences. Therefore, all documentation, process models and software tools must be flexible and living as well. The Quick Reference Guide is an excellent framework for knowledge transfer, documentation, communication, and quality assurance. It structures the process while allowing for the necessary flexibility. Having a standardized process helps to realize economies of scale and scope, supports individual and organizational learning, and speeds up the learning curve.
[4] CRISP-DM: Cross Industry Standard Process Model for Data Mining, http://www.crisp-dm.org/home.html (2000).
[5] Fawcett, T. and F. Provost: Combining data mining and machine learning for effective user profile. Proceedings. of the Second International Conference on Knowledge and Data Mining, pp. 8-13 (1996).
[6] Gersten, W., Einbindung des Predictive Modeling-Prozesses
When choosing a data mining tool you have to consider the users of the tool. In our case, we chose Clementine because of its process support, its various techniques and its ability to automate streams via scripting. It was mainly meant to be used by data mining experts. Although Clementine is targeted at business end users, it requires quite skilled users. However, our stream library aims to reduce the skills necessary to use the tool and, thus, supports both data mining experts and marketing people.
in den Customer Relationship Management-Prozess: Formulierung von Key Performance Indicators zur Steuerung und Erfolgsmessung der Database MarketingAktionen eines Automobilproduzenten, Diploma Thesis, University of Dresden (1999).
405
[7] Hand, D. J., Construction and Assessment of Classification
[11] Piatetsky-Shapiro, G. and B. Masand, Estimating Campaign
Rules, Chichester (1997).
Benefits and Modeling Lift, Proceedings of the 5th International Conference on Knowledge Discovery & Data Mining, pp. 185-193 (1999).
[8] Kubat, M. and S. Matwin, Addressing the curse of unbalanced training sets, Proceedings of the 14th International Conference on Machine Learning (1997).
[12] Wirth, R. and J. Hipp, CRISP-DM: Towards a Standard Process Model for Data Mining, Proceedings of the 4th International Conference on the Practical Application of Knowledge Discovery and Data Mining, pp. 29-39 (2000).
[9] Lewis, D. and J. Catlett, Heterogeneous uncertainty sampling for unsupervised learning, Proceedings of the 11th International Conference on Machine Learning, pp. 148-156 (1994).
[10] Ling, C. X. and C. Li, Data Mining for Marketing: Problems and Solutions, Proceedings of the 4th International Conference on Knowledge Discovery & Data Mining, pp. 73-79 (1998).
406