A Dynamic Transfer Ensemble Model for Customer ... - IEEE Xplore

2013 Sixth International Conference on Business Intelligence and Financial Engineering

A Dynamic Transfer Ensemble Model for Customer Churn Prediction

Jin Xiao

Yuan Wang

Shouyang Wang

Business School Sichuan University Chengdu, China E-mail: [email protected]

Department of Industrial and systems engineering University of Florida Florida, USA E-mail: [email protected]

Academy of Mathematics and Systems Science Chinese Academy of Sciences Beijing, China E-mail: [email protected]

Chongqing, China, since a kind of credit card business is new, the number of active customers in the database is only 1244, and the number of churn customers is merely 104 (called target domain). In this case, it causes a challenge to build a satisfactory prediction model only by traditional resampling techniques. Noting that there are plenty of data available outside the target domain, and they often come from different districts, periods or businesses, etc. As for the example of the bank in Chongqing, except for the 1244 samples in target domain, we can also find some other data recording the customer information of another credit card business. The data from the two credit card businesses are very similar, however, they are subject to different distribution. Because most of the traditional methods assume that the training data and test data are subject to the same distribution, they cannot deal with such issue well in this case [4]. While the transfer learning techniques developed from machine learning area provide a good idea to solve this issue [5]. This study combines transfer learning, multiple classifiers ensemble [6], resampling techniques with group method of data handling (GMDH) [7], and proposes a GMDH based dynamic transfer ensemble model (GDTE) for customer churn prediction. It first transfers the data in related source domains to the target domain, and then adopts resampling technique to balance the class distribution of the training data. Finally, it trains a series of base classifiers and dynamically selects a proper classifier ensemble for each test sample by GMDH. The experimental results on two datasets show that GDTE can obtain better performance than some existing strategies. The structure of this study is organized as follows: it simply introduces some related theories including GMDH algorithm, multiple classifiers ensemble and transfer leaning in Section 2; presents the basic idea and detailed steps of GDTE in Section 3; carries out the experiment in Sections 4. Finally, the conclusions are in Section 5.

Abstract—It is difficult to get satisfactory churn prediction results by traditional models, because the available customer samples in target domain are usually few and the class distribution of customer data is imbalanced. This study proposes a group method of data handling (GMDH) based dynamic transfer ensemble (GDTE) model for churn prediction. It first transfers the data in related source domains to the target domain by transfer learning technique, and then adopts resampling technique to balance the class distribution of the training data. Finally, it trains a series of base classifiers and dynamically selects a proper classifier ensemble for each test sample by GMDH. The experimental results in two datasets show that the performance of GDTE is better than that of one traditional churn prediction strategy, as well as three transfer learning strategies. Keywords-customer churn prediction; transfer ensemble model; imbalanced class distribution; resampling method; group method of data handling

I.

INTRODUCTION

Customer churn is a frequently discussed issue in the field of customer relationship management (CRM). A survey on nine industries of U.S. illustrates that if the customer churn rate decreases by 5%, the industrial average profit would increase by 25% - 85% [1]. Therefore, the core competitiveness of enterprises can be improved by accurate customer churn prediction and timely customer retention. Customer churn prediction is a binary classification problem. The commonly used classification models include decision tree, artificial neural network, logistic regression, support vector machine (SVM), etc. In fact, the class distribution of customer data for churn prediction in practice is always imbalanced, i.e., the number of churn customers is much less than that of non-churn ones [2]. In this case, the traditional models mainly focus on the majority class samples and it is difficult to classify the minority samples (churn customers) correctly. However, the cost of misclassification of a churn customer is much more than that of a non-churn one in churn prediction problem [2]. To balance the training data, the resampling techniques, such as under-sampling and over-sampling, are widely used [3]. In theory, when the sample space is very large and the churn customer samples are enough, either the undersampling or the over-sampling technique can achieve satisfactory performance. However, the data for customer churn prediction are often limited in reality, especially for the churn customers. For instance, in a commercial bank in 978-1-4799-4777-5/14 $31.00 © 2014 IEEE DOI 10.1109/BIFE.2013.26

II.

RELATED THEORIES

A. GMDH Algorithm GMDH is a learning machine based on the principle of heuristic self-organizing proposed by Ivakhnenko in the 1960s. It equally divides the training set into model learning set A and model selecting set B at random. When we adopt GMDH to model, it starts from the initial model set, carries on parameter estimating by inner criteria (least squares) and

115

gets middle candidate models in set A; evaluates and chooses middle candidate models in set B by external criteria, until finds the optimal complexity model [7]. In recent years, the GMDH algorithm has been applied in a broad range of areas. GMDH builds the general relationship between output and input variables in the form of mathematical description, which is also called reference. Usually the description is considered as a discrete form of the Volterra functional series or Kolmogorov-Gabor polynomial: n n n n n n ,(1) Y

classification difficulties. Intuitively, if we adopt different classifiers for different test patterns, the classification performance may be better than that by SCE. This is also the basic idea of DCE. Further, DCE strategies contain dynamic classifier selection (DCS) and dynamic classifier ensemble selection (DCES) [9], in which the former selects a single best classifier for each test pattern, and the latter selects different ensemble solutions for different test patterns. Noting that the GDTE model proposed in this study belongs to DCES strategy.

f ( X1, X 2 ,, X n ) a0 ¦ai Xi ¦¦aij Xi X j ¦¦¦aijk Xi X j X k i 1

i 1 j 1

C. Transfer Learning The concept of transfer learning originates from the Psychology [5]. People can learn new knowledge directly and also can utilize the old knowledge to assist in learning new knowledge. Machine learning has always attempted to simulate the learning of people since it emerged. Learning new knowledge direct is the traditional machine learning paradigm that we are familiar with. Such method often suppose that the learning tasks are independent with each other, and they will discard the past learning experience and results in learning new task. Until the 1990s, transfer learning that utilizes the old knowledge from related source domains to assist in learning new knowledge in target domain has gained more and more attention with the development of machine learning. In the past decades, some transfer learning strategies have been proposed, TrBagg [10] and TrAdaBoost [11] are two typical ones. Although either TrBagg or TrAdaBoost has its own advantages, neither of them take the imbalanced class distribution of data into consideration. So they can hardly predict churn customers correctly when they are involved in CRM field. What’s more, the existing transfer learning strategies are seldom utilized in CRM.

i 1 j 1k 1

where Y is the output, X=(X1, X2, …, Xn) is input vector, and a is the vector of coefficients. In particular, the form of the first order (linear) K-G polynomial including n variables (neurons) is as follows:

f (X1, X2,..., Xn) a0 aX 1 1 a2 X2 an Xn.

(2) If the linear reference function like the form of Eq. (2) is chosen, the sub-sections of Eq. (2) are regarded as n initial models of the GMDH algorithm, that is, v1 a1X 1 , …, vn an X n . The detailed modeling process is as follows. As for the n initial models, suppose that they are fed in pairs at each unit and the reference function is the first order, then a total of Cn2 ( n ( n 1) / 2) candidate models with the form below are generated in the first layer: (3) w f ( X i , X j ); i, j 1,2, , n; i z j , where f() is partial function as in Eq. (2) and w is its estimated output. Then outputs of Q1 ( d C n2 ) functions are selected as per the external criterion value to pass on to the second layer as inputs in pairs. In the second layer we check the functions of the forms z f ( wi , w j ); i, j 1,2, , Q1 ; i z j . (4)

III.

The number of such functions is CQ21 . Outputs of Q2 ( d CQ21 ) functions are selected to pass on to the third layer. The process continues and is stopped after finding the optimal complexity model by the principle of termination [8]. In this way, algorithm can determine the input variables, structure and parameters of final model automatically, accomplish the process of self-organizing modeling, and also can avoid over-fitting.

DYNAMIC TRANSFER ENSEMBLE MODEL

A. The Basic Idea of GDTE The GDTE strategy proposed in this study consists of two phases: 1) Transfer data and train base classifiers. It first balances the distribution of source training set and target training set respectively by over-sampling, and then transfers the balanced source training data to target domain for constructing new training set. Further, it gets N training subsets by bootstrap technique used in Bagging algorithm [12] and trains a base classifier in each subset. 2) Classify the test data. For each test sample xi*, GDTE first finds its K nearest neighbors from the balanced target training set to construct its local area DK, classifies DK and xi* by N trained base classifiers respectively, and then selects a suitable classifier ensemble for xi* from the classifier pool by GMDH algorithm. After ensemble selection, we can get the optimal complexity ensemble model yopt. At last, the final classification result of xi* is obtained through yopt. In GDTE strategy, the symmetric regularity criterion (SRC) is selected as the external criterion, which has the following form: y y ( A) 2 y y ( B) 2 ,(5) d 2 (W ) '2 ( A) '2 ( B)

B. Multiple Classifiers Ensemble Because the data in real classification issues include much noise, it is hard to classify accurately in the whole pattern space with single classifier [6]. On the contrary, if we can integrate the classification results of some classifiers with multiple classifiers ensemble technology, and make each classifier play role in its dominant area, it is hopeful to improve the classification accuracy. The constructing of classifier ensemble strategies is a key step in multiple classifiers ensemble. The existing ensemble strategies can be divided into two types: 1) static classifier ensemble (SCE), which selects a unified ensemble scheme for all test patterns; 2) dynamic classifier ensemble (DCE). In fact, different test patterns usually have different

¦ tB

116

B

B

¦ tA

A

A

where, yB is the real class lable of subset B, yB(A) refers to the predicting output of set B by the model constructed in set A. Therefore, '2 ( A) in Eq. (5) indicates the predicting error in set B by the model constructed in set A, and '2 ( B) indicates the predicting error in set A by the model constructed in set B. The SRC utilizes the information in subsets A and B equally and focuses on the error of model in different parts of the training set. Sarychev [13] proved that SRC could be used as external criterion suitably.

B. Algorithm Description Suppose T is the target domain of a customer churn prediction issue, which includes m1 samples, and the related source domain S contains m2 samples, where m2>m1 usually. Further, Tr refers to the training set in target domain with n1 samples and Te is the test set in target domain with n2 samples (m1 = n1+n2). The class label of customers is 0 or 1, where 1 refers to churn and 0 refers to non-churn. In addition, the labels of customers are known in Tr and S while unknown in Te.

Target training set Tr Source domain S

Target test set Te For each sample x i * Te

Over-sample Tr and S to get U1 and U2 respectively

Find K neighbors from U1 to compose the local area DK Classify DK and xi* by N base classifiers, and obtain the classification results R and r

Merge U1 and U2 to construct new training set Tr'

Construct N training subsets by bootstrap sampling in Tr'

Select the optimal complexity ensemble model yopt in R by GMDH

Training a base classifier in each training subset

Take r into yopt and get the final classification result of xi*

Figure 1. The block diagram of GDTE strategy.

b) Classify DK and xi*by N base classifiers, get the results R = (R1, R2, …, RN) and r = (r1, r2, …, rN) respectively, and then divide R into two subsets A and B equally in the horizontal direction; c) Select K-G polynomial of degree 1 to describe the relationship between the class label Y of DK and R1, R2, …, RN : f (R1, R2, …, RN) = a1R1 + a2R2 +... + aNRN, and get N initial models of GMDH algorithm: v1 = a1R1, v2 = a2R2, …, vN= aNRN; d) Make all combinations of initial models to get CN2 middle candidate models in the first layer: wt = vi +vj = aiRi + ajRj, i, j = 1, 2, …, N, i j; t = 1, 2, …, CN2, where the coefficients ai and aj in each model are estimated by leastsquares (LS); e) Calculate the external criterion value for each candidate model according to Eq. (5), and choose a certain number of models with smaller SRC values to enter the next layer; f) Repeat d) and e) with layer = layer + 1 until the yopt is obtained; g) Put the classification results r = (r1, r2, …, rN) of xi* into yopt, and get the final classification result. End

The process of GDTE strategy is described in Fig. 1 and its pseudocode is as follows: Input: Target training set Tr, target test set Te, source domain data S, the number of nearest neighbors K in local area and the number of base classifiers N; Output: The final predicting results of the customers in Te; Phase I: Transfer data and train base classifiers Step 1. Divide Tr into two sets, Tr1 for churn customers and Tr2 for non-churn ones, and divide S into two sets S1 and S2 similarly; Step 2. Sample |Tr2|-|Tr1| instances randomly from Tr1 with replacement, add the sampled instances to Tr and get balanced target training set U1, where |Tr2| and |Tr1| denote the number of samples in Tr2 and Tr1 respectively; Step 3. Get balanced source training set U2 like Step 2, and regard Tr’ = U1ĤU2 as the new training set; Step 4. Construct N training subsets in Tr’ by bootstrap technique, and train a classifier in each subset to compose the base classifier pool C= {C1, C2, …, CN}; Phase II: Classify the test data Step 5. For each test sample xi*ęTe, i=1, 2, …, n2 a) Find K nearest neighbors of xi* from U1 and get a local area, DK= {x1, x2, …, xK};

117

IV.

business from the customer database. It includes 1802 nonchurn samples and 198 churn samples, and can be regarded as the source domain. To determine whether the distributions of the source domain and target domain in the two issues are different, we introduced the multivariate two-sample testing procedure proposed in [14]. Its null hypothesis is that there is no difference between the distributions of the source domain and target domain. In this study, we set D =0.05, and the test results are displayed in Table 1. One can reject the null

EXPERIMENTAL ANALYSIS

In order to evaluate the performance of GDTE, two customer churn prediction datasets were selected for experiments. At the same time, GDTE was compared with four strategies: (1) traditional classifier ensemble model (TCEM), it merges Tr and S to train N base classifiers directly, classifies the test data by all base classifiers and gets the final classification results by majority voting; (2) transfer learning strategy TrBagg [10]; (3) transfer learning strategy TrAdaBoost [11]; (4) transfer learning strategy STMS. Because the churn customers (minority class) are very few in target domain, this strategy simply transfers the minority class samples to target domain, and then trains the classifiers and obtains the classification results like TCEM strategy.

hypothesis if | tˆ |>| tˆ1000 *(1 0.05 ) |. As it is shown in Table 1, '

there is significant difference between the distributions of the target domain and source domain in both datasets. TABLE I.

A. Data Description (1) “Churn” dataset “Churn” dataset is from the famous machine learning database in University of California Irvine (http://www.ics.uci.edu/~mlearn/MLRepository.html). In this dataset, churn customer was defined as the person who gave up all the mobile services of one certain telecommunication company within three continuous months. The dataset includes 20 features. There are 3333 samples (2850 non-churn samples and 483 churn samples), and the proportion of non-churn and churn is 5.9006. Thus, the class distribution is highly imbalanced. In “Churn” dataset the customers come from 50 states and the District of Columbia in USA. And the sample data from different districts may be subject to different distribution. At the same time, the number of customers in each district is very small and the largest number is 106 (in West Virginia). Therefore, if we only regard one district as the target domain, the churn customers will be very few and the experimental results may be instable. Thus, to ensure there is no less than 10 churn customers in target test set and without loss of generality, we sorted all the samples according to the values of attribute State from A to Z and selected 400 customers from the last 5 states (Vermont, Washington, Wisconsin, West Virginia and Wyoming) to form the target domain T, and let the remaining customers form the source domain S. (2) “China-churn” dataset This dataset comes from a credit card customer database in a commercial bank in Chongqing, China, in which the data are collected from May to October 2010. In this dataset, a churn customer was defined as who canceled his or her credit card service between May and October 2010 or did not consume during 3 continuous months. According to the basic principle of attribute selection for customer churn prediction, 25 attributes were selected, among which 8 attribute are continuous and 17 discrete. After simple data cleaning, we got 1244 samples in the target domain, in which 1151 samples are non-churn and only 104 samples are churn, and the proportion is 11.06. Thus, it also belongs to highly imbalanced dataset. Obviously, it is hard to achieve satisfactory customer churn prediction performance with so few churn customer samples. Fortunately, there is a dataset of another credit card

TEST RESULTS OF MULTIVARIATE TWO-SAMPLE TESTING IN TWO DATASETS ' | tˆ950 |

Datasets

| tˆ |

Churn

74.6221

2.0088

China-churn

54.1384

11.5816

B. Experimental Setup As for the partition of Te and Tr in target domain, we adopted the method of random sampling without replacement. We sampled about 30% instances randomly from T to form the target test set Te, and the remaining data composed the target training set Tr. In this study, we adopted artificial neural network to generate the base classifiers. There are two main parameters which may influence the performance of GDTE: the number of nearest neighbors K in the local area and the number of base classifiers N. Through repeated experiment, we set K=5, N=40 because the GDTE algorithm can achieve satisfactory performance in this case. As for the other ensemble models TCEM, TrBagg, TrAdaBoost and STMS, we also let the size of base classifier pool be 40. Besides, for the strategies TCEM, TrBagg, TrAdaBoost and STMS, none of them considers the impact of class imbalance on performance. Therefore, in order to assure the fairness of comparison, we also balanced the class distribution of data by over-sampling technique before training the base classifiers. Finally, all the experiments were implemented in the platform Matlab 7.0 and the classification result of each strategy was the average of 10 experiments. C. Evaluation Criteria To evaluate the performance of strategies, a confusion matrix has been introduced (see Table 2). And then, we selected four commonly used criteria [15]: 1) Total accuracy = (D1 + D4) / (D1 +D2 +D3 +D4)×100%; 2) Type I accuracy = D4 / (D3 + D4) ×100%; 3) Type II accuracy = D1 / (D1 + D2) ×100%; 4) The area under the receiver operating characteristic curve (AUC). The receiver operating characteristic (ROC) curve is an important evaluation criterion of classification model in the data with imbalanced class distribution. However, sometimes it is difficult to compare ROC curves of different models directly, so AUC is more convenient and popular.

118

TABLE II.

EVALUATION MATRIX FOR CUSTOMER CHURN PREDICTION Predicted negative

Predicted positive

Total

Actual negative(non-churn)

D1

D2

D1+ D2

Actual positive (churn)

D3

D4

D3+ D4

Total

D1+ D3

D2+ D4

D1+D2+D3+D4

customer churn prediction datasets, and the results show that GDTE outperforms other strategies referred in this study. ACKNOWLEDGMENT This work is partly supported by the Natural Science Foundation of China under Grant Nos. 71101100 and 70731160635, New Teachers' Fund for Doctor Stations, Ministry of Education under Grant No. 20110181120047, Excellent Youth Fund of Sichuan University under Grant No. 2013SCU04A08, Frontier and Cross-innovation Fund of Sichuan University under Grant No. skqy201352, Soft Science Foundation of Sichuan Province under Grant No. 2013ZR0016, and China Postdoctoral Science Foundation under Grant Nos. 2011M500418 and 2012T50148.

D. Performance Comparison Table 3 shows the evaluation criteria values of five strategies in “Churn” dataset. The bold-face in the Table shows the maximum of each column. It is shown that GDTE has the highest AUC value and Type I accuracy. On the contrary, the Type I accuracy of other four strategies are lower or just close to 0.5. That is to say, all of them cannot handle the imbalanced class distribution issue well, even though the Total accuracy and Type II accuracy of TrAdaBoost and STMS are higher than those of GDTE. In fact, in customer churn prediction, what we concern more about is Type I accuracy and AUC value. Thus, we can conclude that GDTE is more suitable for customer churn prediction with imbalanced class distribution, and it outperforms the other four strategies in “Churn” dataset. TABLE III.

REFERENCES [1]

[2]

[3]

PERFORMANCE COMPARISON OF FIVE STRATEGIES IN “CHURN” DATASET

Strategies

Total accuracy

Type I accuracy

Type II accuracy

AUC

TCEM

0.6379

0.5257

0.6481

0.7375

STMS

0.9223

0.1791

0.9928

0.7505

TrAdaBoo

0.7502

0.3333

0.7982

0.7150

TrBagg

0.6017

0.4988

0.6225

0.7065

GDTE

0.6968

0.7501

0.6921

0.8396

[4]

[5]

[6]

[7]

Further, Table 4 shows the performance comparison of five strategies in “China-churn” dataset. As it is shown, the GDTE strategy performs best compared with other strategies. In terms of the ability to deal with the imbalanced class distribution, the Type I accuracy and AUC value of GDTE are 0.8661 and 0.9379 respectively, whereas those of other four strategies are actually low and unsatisfactory. TABLE IV. Strategies

[8]

[9]

[10]

PERFORMANCE COMPARISON OF FIVE STRATEGIES IN “CHINA-CHURN” DATASET

Total accuracy

Type I accuracy

Type II accuracy

AUC

TCEM

0.8605

0.8214

0.8651

0.8973

STMS

0.9422

0.7714

0.9602

0.9007

TrAdaBoo

0.8171

0.6451

0.8346

0.8404

TrBagg

0.8095

0.7847

0.8119

0.8637

GDTE

0.9302

0.8661

0.9378

0.9379

V.

[11]

[12] [13]

CONCLUSIONS

[14]

In this study, we combine transfer learning, multiple classifiers ensemble, resampling techniques with GMDH, and propose a GMDH based dynamic transfer ensemble (GDTE) strategy. The experiments are conducted in two

[15]

119

F. F. Reicheld and T. Teal, The Loyalty Effect: the Hidden Force Behind Growth, Profits and Lasting Value. Boston: Harvard Business School Press, 1996. S. A. Neslin, S. Gupta, W. Kamakura, J. Lu, and C. Mason, "Detection defection: Measuring and understanding the predictive accuracy of customer churn models," Journal of Marketing Research, vol. 43, 2006, pp. 204-211. L. Tong, Y. Chang, and S. Lin, "Determining the optimal re-sampling strategy for a classification model with imbalanced data using design of experiments and response surface methodologies," Expert Systems with Applications, vol. 38, 2011, pp. 4222-4227. C. Wei and I. Chiu, "Turning telecommunications call details to churn prediction: A data mining approach," Expert Systems with Applications, vol. 23, 2002, pp. 103-112. S. J. Pan and Q. Yang, "A survey on transfer learning," IEEE Transactions on Knowledge and Data Engineering, vol. 22, 2010, pp. 1345-1359. L. K. Hansen and P. Salamon, "Neural network ensembles," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 12, 1990, pp. 993-1001. A. G. Ivakhnenko, "The group method of data handling in prediction problems," Soviet Automatic Control c/c of Avtomatika, vol. 9, 1976, pp. 21-30. J. Xiao, C. Z. He, X. Y. Jiang, and D. H. Liu, "A dynamic classifier ensemble selection approach for noise data," Information Sciences, vol. 180, 2010, pp. 3402-3421. A. Ko, R. Sabourin, and A. Britto, "From dynamic classifier selection to dynamic ensemble selection," Pattern Recognition, vol. 41, 2008, pp. 1718-1731. T. Kamishima, M. Hamasaki, and S. Akaho, "TrBagg: A simple transfer learning method and its application to personalization in collaborative tagging," Proc. IEEE International Conference of Data Mining, IEEE Press, 2009, pp. 219-228. W. Y. Dai, Q. Yang, G. R. Xue, and Y. Yu, "Boosting for transfer learning," Proc. 24th International Conference on Machine Learning, ACM Press, 2007, pp.193-200. L. Breiman, "Bagging predictors," Machine Learning, vol. 24, 1996, pp. 123-140. A. P. Sarychev, "An averaged regularity criterion for the group method of data handling in the problem of searching for the best regression," Soviet Journal of Automation and Information Sciences c/c of Avtomatika, vol. 23, 1990, pp. 24-29. J. H. Friedman, "On multivariate goodness-of-fit and two-sample testing," Proc. Phystat, SLAC, 2003, pp. 1-3. W. Verbeke, K. Dejaeger, D. Martens, J. Hur, and B. Baesens, "New insights into churn prediction in the telecommunication sector: A profit driven data mining approach," European Journal of Operational Research, vol. 218, 2011, pp. 211-229.