Incorporating sequential information into traditional classification ...

FACULTEIT ECONOMIE EN BEDRIJFSKUNDE HOVENIERSBERG 24 B-9000 GENT Tel. Fax.

: 32 - (0)9 – 264.34.61 : 32 - (0)9 – 264.35.92

WORKING PAPER

Incorporating sequential information into traditional classification models by using an element/position- sensitive SAM

Anita Prinzie 1 Prof. Dr. Dirk Van den Poel 2

February 2005 2005/292

1

Corresponding Author: Department of Marketing, Tel: +32 9 264 35 20, fax: +32 9 264 42 79, email [email protected]. 2 Email: [email protected]

D/2005/7012/10

Abstract. The inability to capture sequential patterns is a typical drawback of predictive classification methods. This caveat might be overcome by modeling sequential independent variables by sequence-analysis methods. Combining classification methods with sequenceanalysis methods enables classification models to incorporate non-time varying as well as sequential independent variables. In this paper, we precede a classification model by an element/position-sensitive Sequence-Alignment Method (SAM) followed by the asymmetric, disjoint Taylor-Butina clustering algorithm with the aim to distinguish clusters with respect to the sequential dimension. We illustrate this procedure on a customer-attrition model as a decisionsupport system for customer retention of an International Financial-Services Provider (IFSP). The binary customer-churn classification model following the new approach significantly outperforms an attrition model which incorporates the sequential information directly into the classification method.

Keywords:

sequence analysis, binary classification methods, Sequence-Alignment Method, asymmetric clustering, customer-relationship management, churn analysis

2

1 Introduction In the past, traditional classification models like logistic regression have been applied successfully to the prediction of a dependent variable by a series of non-time varying independent variables [5]. In case there are time-varying independent variables, these are typically included in the model by transforming them into non-time varying variables [3]. Unfortunately, this practice results in information loss as the sequential patterns of the data are neglected. Hence, although traditional classification models are highly valid and robust for modeling non-time varying data, they are unable to capture sequential patterns in data. This caveat might be overcome by modeling time-varying independent variables by sequenceanalysis methods. Unlike traditional classification methods, sequence-analysis methods were designed for modeling sequential information. These methods take sequences of data, i.e., ordered arrays, as their input rather than individual data points. With the exception of marketing, sequence analysis is commonly applied in disciplines like archeology [32], biology [38], computer sciences [39], economics [19], history [2], linguistics [24], psychology [7] and sociology [1]. Sequence-analysis methods can be categorized depending on whether the sequences are treated as a whole or step by step [2]. Step-by-step methods examine relationships among elements or states in the sequences. Time-series methods are used to study the dependence of an interval-measured sequence on its own past. When the variable of interest is categorical, Markov methods are appropriate. The latter methods calculate transition probabilities based on the transition between two events [37]. Transitions from one prior category can be modeled by using event-history methods, also known as duration methods, hazard methods, failure analysis, and reliability analysis. The central research question studied is time until transition. Whole-sequence methods use the entire sequence as unit of analysis to discover similarities between sequences resulting in typologies. The central issue addressed is whether there are patterns in the sequences, either over the whole sequences or within parts of them. There are two approaches to this pattern question. In the algebraic approach, each sequence is reduced to some simplest form and sequences with similar ‘simplest forms’ are gathered under one heading. In the metric approach, a similarity measure between the sequences is calculated which is then subsequently processed by clustering, scaling and other categorization methods to extract typical sequential patterns. Methods like optimal matching or optimal

3

alignment are commonly applied within this metric approach. In an intermediate situation, local similarities are analyzed to find out the role of key subsequences embedded in longer sequences [49]. Given that traditional classification models are designed for modeling non-time varying independent variables and that sequence-analysis methods are well-suited to model dynamic information, it follows that a combination of both methods unites the best of both worlds and allows for building predictive classification models incorporating non-time varying as well as time-varying independent variables. One possible approach amongst others exists in preceding the traditional classification method by a sequence-analysis method to model the dynamic exogenous variables (cf. serial instead of parallel combination of classifiers, [26]). In this paper, we precede a logistic regression, as a traditional classification method, by a Sequence-Alignment Method (i.e., SAM), as a whole sequence method using the metric approach. The SAM analysis is used to model a time-varying independent variable. We identify how similar the customers are on the dynamic independent variable by calculating a similarity measure between each pair of customers, the SAM distances. These distances are further processed by a clustering algorithm to produce groups of customers which are relatively homogeneous with respect to the dynamic independent variable. As we cluster on a dimension influencing the dependent variable, the clusters are not only homogeneous in terms of the time-varying independent variable, but should also be homogeneous with respect to the dependent variable. This way, we make the implicit link between clustering and classification explicit. After all, clustering is in theory a special problem of classification associated with an equivalence relation defined over a set [36]. Including the clustermembership information as dummies in the classification model not only allows for modeling the dynamic independent variable in an appropriate way, it should even improve the predictive performance. In this paper, we illustrate the new procedure, which combines a sequence-analysis method with a traditional classification method, by estimating a customer-attrition model for a large International Financial-Services Provider (from now on referred to as IFSP). This attrition model feeds the managerial decision process and helps refining the retention strategy by elucidating the profile of customers with a high defection risk. A traditional logistic regression is applied to predict whether a customer will churn or not. This logistic regression is preceded by an element and position-sensitive

4

Sequence-Alignment Method to incorporate a time-varying covariate. We will calculate the distance between each customer on a sequential dimension, i.e., the evolution in relative account-balance total of the customer at the IFSP, and use these distances as input for a subsequent cluster analysis. The clustermembership information is incorporated in the logistic regression by dummies. We hypothesize that the logistic-regression model with the time-varying independent variable included as cluster dummy variables will outperform the traditional logistic regression where the same sequential dimension is incorporated by creating as many non-time varying independent variables as there are time points on which the dimension is measured. The remainder of this paper is structured as follows. In Section 2 we describe the different methods used. We discuss the basic principles of SAM and underline how the cost allocation influences the mathematical features of the resulting SAM distance measures determining whether a symmetric or asymmetric clustering algorithm is appropriate. We outline how a modification of Taylor’s clustersampling algorithm [42] and Butina’s cluster algorithm based on exclusion spheres [6] allows clustering on asymmetric SAM distances. In Section 3 we outline how the new procedure proposed in this paper is applied within a financial-services context to improve prediction of churn behavior. Section 4 investigates whether the results confirm our hypothesis on improved predictive performance. We conclude with a discussion of the main findings and introduce some avenues for further research.

2 Methodology 2.1 Sequence-Alignment Method (SAM) The Sequence-Alignment Method (SAM) was developed in computer sciences (text editing and voice recognition) and molecular biology (protein and nucleic acid analysis). A common application in computer sciences is string correction or string editing [47]. The main use of sequence comparison in molecular biology is to detect the homology between macromolecules. If the distance between two macromolecules is small enough, one may conclude that they have a common evolutionary ancestor. Applications of sequence alignment in molecular biology use comparatively simple alphabets (the four nucleotide molecules or the twenty amino acids) but tend to have very long sequences [49]. Conversely,

5

in marketing applications, sequences will mostly be shorter but with a very large alphabet. Besides SAM applications in computer sciences and molecular biology, there are applications in social science [1], transportation research [21] and speech processing [34]. Recently, SAM has been applied in marketing to discover visiting patterns of websites [18]. Sankoff & Kruskall [40], Waterman [48] and Gribskov & Devereux [14] are good references on Sequence-Alignment Method. SAM handles variable-length sequences and incorporates sequential information, i.e., the order in which the elements appear in a sequence, into its distance measure (unlike conventional position-based distance measures, like Euclidean, Minkowsky, city block and Hamming distances). The original sequence-alignment method can be summarized as follows. Suppose we compare sequence a, called the source, having i elements a=a [a1, …, ai] with sequence b, i.e., the target, having j elements b=b [b1, …, bj]. In general, the distance or similarity between sequence a and b is expressed by the number of operations (i.e., total amount of effort) necessary to convert sequence a into b. The SAM distance is represented by a score. The higher the score, the more effort it takes to equalize the sequences and the less similar they are. The elementary operations are insertions, deletions and substitutions or replacements. Deletion and insertion operations, often referred to as indel, are applied to elements of the source (first) sequence in order to change the source into the target (second) sequence. Substitution operations indicate deletion + insertion. Some advanced research involves other operations like swaps or transpositions (i.e., the interchange of adjacent elements in the sequence), compression (of two or more elements into one element) and expansion (of one element into two or more elements). Every elementary operation is given a weight (i.e., cost) greater than or equal to zero. It is common practice to make assumptions on the weights in order to achieve the metric axioms (nonnegative property, zero property, triangle inequality and symmetry) of mathematical distance (e.g., equal weights for deletions and insertions to preserve the symmetry axiom) [40]. Weights may be tailored to reflect the importance of operations, the similarity of particular elements (cf. element sensitive), the position of elements in the sequence (cf. position sensitive), or the number/type of neighboring elements or gaps [49]. A different weight for insertion and deletion as well as position-sensitive weights result in SAM distances which are no longer symmetric: cf. |ab| ~= |ba|. The latter has its implications on the clustering algorithm that could be used (cf. infra). Different meanings can be given to the word ‘distance’ in

6

sequence comparison. In this paper, we express the relatedness (similarity or distance) between customers on their evolution in relative account-balance total at the IFSP by calculating the weightedLevenshtein [29] distance between each possible pair of customers (i.e., pairwise-sequence analysis). The weighted-Levenshtein distance defines dissimilarity as the smallest sum of operation-weighting values required to change sequence a into b. This way a distance matrix is constructed and consecutively, used as input for a cluster analysis.

2.2 Cluster Analysis of weighted-Levenshtein SAM Distances We cluster the customers on the weighted-Levenshtein SAM distances, expressing how dissimilar they are on the sequential dimension. The cluster-membership information resulting from this cluster analysis is translated into cluster dummies, which represent the sequential dimension in a subsequent classification model. We hypothesize that a classification model including cluster indicators (operationalized as dummy variables) based on SAM distances will outperform a similar model where the same sequential dimension is incorporated by as many non-time varying independent variables as time points on which the dimension is measured. After all, these dummies are good indicators of what type of behavior the customer exhibits towards the sequential dimension (i.e., time-varying independent variable), as well as towards the dependent variable (cf. explicit typology of customers on time-varying covariate results in implicit typology on the dependent variable). A distance matrix holding the pairwise weighted-Levenshtein distances between customer sequences, is used as a distance measure for clustering. As discussed earlier, depending on how the weights (i.e., costs) for SAM are set, the distances in the matrix are symmetric or asymmetric. Most common clustering methods employ symmetric, hierarchical algorithms such as Wards, Single-, Complete-, Average-, or Centroïd linkage [15, 20, 25], non-hierarchical algorithms such as Jarvis-Patrick [21], or partitional algorithms such as k-means or hill-climbing. Such methods require symmetric measures, e.g. Tanimoto, Euclidean, Hamman or Ochai, as their inputs. One drawback of these methods is that they cannot capture important asymmetric relationships. Nevertheless, there exist many practical scenarios where the underlying relation is asymmetric. Asymmetric relationships are common in transportation research (cf. different distance between two cities A and B (|AB|~=|BA|) due to other routes (e.g. the

7

Vehicle Routing Problem [43])), in text mining (cf. word associations, e.g. most people will relate ‘data’ to ‘mining’ more strongly than conversely [44]), in sociometric ratings (cf. a person i could express a higher like or dislike rating to person j than vice versa), in chemoinformatics (cf. compound A may fit into compound B while the reverse is not necessarily true) and to a lesser extent in marketing research (cf. brand-switching counts [10], ‘first choice’-‘second choice’ connections [45] and the asymmetric price effects between competing brands [41]). A good overview of models for asymmetric proximities is given by Zielman and Heiser [50]. Although there are a lot of research settings involving asymmetric proximities, only a few clustering algorithms can handle asymmetric data. Most of these are based on a nearest-neighbor table (NNT). Krishna et al. [27] provide a clustering algorithm for asymmetric data (i.e., CAARD algorithm which closely resembles the Leader Clustering Algorithm (LCA) [16]) with applications to text mining. Ozawa [36] defines a hierarchical asymmetric clustering algorithm called Classic, and applies it on the detection of gestalt clusters. His algorithm is based on an iteratively defined nested sequence of NNRs (i.e., Nearest Neighbors Relations). MacCuish et al. [31] converted the Taylor-Butina exclusion region grouping- algorithms [6, 42] into a real clustering algorithm, which can be used for both disjoint or non-disjoint (overlapping), either symmetric or asymmetric clustering. Although this algorithm is designed for clustering compounds (i.e., the chemo-informatics field with applications like compound acquisition and lead optimization in high-throughput screening), in this paper it is employed to cluster customers on marketing-related information. More specifically, we apply the asymmetric, disjoint version of the algorithm to the asymmetric SAM distances obtained earlier. The asymmetric disjoint Taylor-Butina algorithm is a five-step procedure [30]: 1. Create the threshold nearest-neighbor table using similarities in both directions. 2. Find true singletons, i.e., data points (in our case customers) with an empty nearest-neighbor list. Those elements do not fall into any cluster. 3. Find the data point with the largest nearest-neighbor list. This point tends to be in the center of the k-th (cf. k clusters) most densely occupied region of the data space. The data point together with all its neighbors within its exclusion region, constitute a cluster. The data point itself becomes the representative data point for the cluster. Remove all elements in the cluster from all

8

nearest-neighbor lists. This process can be seen as putting an ‘exclusion sphere’ around the newly formed cluster [6]. 4. Repeat step 3 until no data points exist with a non-empty nearest-neighbor list. 5. Assign remaining data points, i.e., false singletons, to the group that contains their most similar nearest neighbor, but identify them as “false singletons”. These elements have neighbors at the given similarity threshold criterion (e.g. all elements with a dissimilarity measure smaller than 0.3 are deemed similar), but a ‘stronger’ cluster representative, i.e., one with more neighbors in the list, excludes those neighbors (cf. cluster criterion).

Representative Compound

False Singleton

Threshold=.15

Exclusion Regions diameter set by threshold value

Dissimilarity in both directions

True Singleton

Fig. 1. Asymmetric Taylor-Butina Schematic (MacCuish et al., 2003). 2.3 Incorporating Cluster Membership Information in the Classification Model After having applied SAM and cluster analysis using the asymmetric Taylor-Butina algorithm, we build a classification model to predict a binary target variable, in our application ‘churn’. As a classification method, we use binary logistic regression. We build two churn models using the logistic-regression method. One model includes the sequential dimension as cluster dummies resulting from clustering the SAM distances (from now on referred to as LogSeq). The second model incorporates the sequential dimension in a traditional way by as many non-time varying regressors as there are time points, on

9

which the dimension is measured (from now on referred to as LogNonseq). Both models are estimated on a training sample and subsequently validated on a hold-out sample, containing customers not belonging to the training sample. We compare the predictive performance of the LogSeq model with that of the LogNonseq model. In order to test the performance of the LogSeq model on the hold-out sample, we need to define a procedure to assign the hold-out customers to the clusters identified on the training sample. We define five sequences per cluster in the training sample as representatives. By default the grouping module of the Mesa Suite software package, which implements the Taylor-Butina clustering algorithm [30], returns only one representative for each identified cluster. We prefer to have more than one representative customer for each cluster in order to improve the quality of allocation of customers in the hold-out sample to the clusters identified on the training sample. Therefore, once we have found a good k-th cluster solution on the training sample, we apply the Taylor-Butina algorithm to the clusterspecific SAM distances in order to obtain a five-cluster solution delivering five representatives for the given cluster. This way, each cluster has five representatives. Next, we calculate the SAM distances of the hold-out sequences towards these groups of five cluster representatives and vice versa. Each holdout sequence is assigned to the cluster to which it has the smallest average distance (i.e., smallest average distance towards five cluster representatives). This cluster membership information is transformed into cluster dummy variables. The predictive performance of the classification models (in this case: logistic regression) is assessed by the Area Under the receiver operating Curve (AUC). Unlike the Percentage Correctly Classified (i.e., PCC), this performance measure is independent of the chosen cut-off. The Receiver Operating Characteristics curve plots the hit percentage (events predicted to be events) on the vertical axis versus the percentage false alarms (non-events predicted to be events) on the horizontal axis for all possible cut-off values [13]. The predictive accuracy of the logistic-regression models is expressed by the area under the ROC curve (AUC). The AUC statistic ranges from a lower limit of 0.5 for chance (nullmodel) performance to an upper limit of 1.0 for perfect performance [13]. We compare the predictive performance of the LogSeq model with the predictive accuracy of the LogNonseq model. We hypothesize that the LogSeq model will outperform the LogNonseq model.

10

3. A Financial-Services Application We illustrate our new procedure, which combines sequence analysis with a traditional classification method, on a churn-prediction case to support the customer-retention decision system of a major Financial-Services Provider (i.e., IFSP). Over the past two decades, the financial markets have become more competitive due to the mature nature of the sector on the one hand and deregulation on the other, resulting in diminishing profit margins and blurring distinctions between banks, insurers and brokerage firms (i.e., universal banking). Hence, nowadays a small number of large institutions offering a wider set of services dominate the financial-services industry. These developments stimulated bank assurance companies to implement Customer Relationship Management (CRM). Under this intensive competitive pressure, companies realize the importance of retaining their current customers. The substantive relevance of attrition modeling comes from the fact that an increase in retention rate of just one percentage point may result in substantial profit increases [46]. Successful customer retention allows organizations to focus more on the needs of their existing customers, thereby increasing the managerial insights into these customers’ needs and hence decreasing the servicing costs. Moreover, long-term customers buy more [12] and if satisfied, might provide new referrals through positive word-of-mouth for the company. These customers tend to be less sensitive to competitive marketing actions. Finally, losing customers leads to opportunity costs due to lost sales and because attracting new customers is five to six times more expensive than customer retention [4, 8]. For an overview on the literature in attrition analysis we refer to Van den Poel and Larivière [46]. Combining several techniques (just like in this paper) to achieve improved attrition models has already been shown to be highly effective [28].

3.1 Customer Selection In this paper, we define a ‘churned’ customer as someone who closed all his accounts at the IFSP. We predict whether customers still being customer at December 31st, 2002, will churn on all their accounts in the next year (i.e., 2003) or not. Several selection criteria are used to decide which customers to include into our analysis. Firstly, we only selected customers who became customer from January 1st, 1992 onwards because the information in the data warehouse before this date is less detailed. Secondly, we only select customers having at least three distinct purchase moments before January 2003. This

11

constraint is imposed because we wish to focus the attrition analysis on the more valuable customers. Given the fact that most customers at the IFSP only possess one financial service, the selected customers clearly belong to the more precious clients of the IFSP. Thirdly, we only keep customers still being customer on December 31st, 2002 (cf. prediction of churn event in 2003). This eventually results in 16,254 customers left among which 399 customers (2.45%) closed all their accounts in 2003. We randomly created a training and hold-out sample of 8,127 customers each, among which 200 (2.46%) and 199 (2.45%) churners respectively. There is no overlap between the training and hold-out sample.

3.2 Construction of the Sequential Dimension As discussed earlier, we want to include a sequential covariate in a traditional classification model. One such sequential dimension likely to influence the churn probability is the customers’ evolution in account-balance total at the IFSP. We define the latter variable as a sum of the customers’ total assets (i.e., total outstanding balance on short- and long-term credit accounts + total debit on current account) and total liabilities (i.e., total amount on savings and investment products + credit on current account + sum of monthly insurance fees). Although this account-balance total is a continuous dimension, it is registered in the data warehouse at discrete moments in time; at the end of the month for bank accounts and on a yearly basis for insurance products. We have reliable data for account-balance total from January 1st, 2002 onwards. We build sequences of relative difference in account-balance total (i.e., relbalance) rather than sequences of absolute account-balance total with the aim to facilitate the capturing of overall trends in account-balance total. Each sequence contains four elements (see Table 1): relbalanceJanMar, relbalanceMarJul, relbalanceJulOct, relbalanceOctDec.

12

Table 1 Four elements of the relative account-balance total dimension Dimension relbalance relbalanceJanMar relbalanceMarJul relbalanceJulOct relbalanceOctDec

Definition (account-balance total March 2002 – account-balance total January 2002) / account-balance total January 2002 (account-balance total July 2002 – account-balance total March 2002) / account-balance total March 2002 (account-balance total October 2002 – account-balance total July 2002) / accountbalance total July 2002 (account-balance total December 2002 – account-balance total October 2002) / account-balance total December 2002

Besides observing the account-balance total at discrete moments in time, we converted the ratio-scaled relative account-balance total sequence into a categorical dimension. The latter is crucial to ensure that the SAM analysis will find any similarities between the customers’ sequences. Based on an investigation of the distribution of the relative account-balance total, nine categories are distinguished representing approximately an equal number of customers (cf. to enhance discovery of similarities between customers): Table 2 Values for the categorical relative account-balance total dimension and element-based costs Element 0 1 2 3 4 5 6 7 8

Values of relbalance 0 - 0.5 < relbalance < 0 -2.5 < relbalance