Proceedings of International Conference on Intelligent Systems & Data Processing (ICISD 2011) 24 - 25 January, IT Department, G H Patel College of Engineering & Technology, Gujarat, India
Customer Segmentation for Direct Marketing through Regionalisation Clustering in Spatial Database Lokesh Kumar Sharma and P. V. V. S. Srinivas
Abstract— Direct mail is an important direct marketing tool that encompasses a wide variety of marketing materials. Customer segmentation play vital role to apply the direct mailing strategies. Campaign cost is also an important parameter for direct mailing or Direct Marketing. Direct Mailing cost can be minimized to identify homogenous spatially dense customer segments. In this paper, we propose an application of clustering technique for regionalisation of a spatial database in direct mailing. This technique uses spatial density and correlation method to find out spatial dense and non-spatial homogeneous customer segment. The quality of the resulting customer segment is sensitive to the radius of neighbouring region. Therefore median based heuristic approach to determine the radius of neighbouring regions is used. We applied this algorithm to find out densely located and homogenous customer segment using social variables such as ‘Buying Power’. Index Terms— Spatial Cluster, Regionalisation, Direct Marketing, Direct Mail.
I. INTRODUCTION
M
use direct marketing campaigns to communicate a message to their customers through mail, the Internet, e-mail, telemarketing, and other direct channels in order to prevent churn (attrition) and to drive customer acquisition and purchase of add-on products. More specifically, acquisition campaigns aim at drawing new and potentially valuable customers away from the competition. Cross-/deep-/up-selling campaigns are implemented to sell additional products, more of the same product, or alternative but more profitable products to existing customers[2]. Finally, retention campaigns aim at preventing valuable customers from terminating their relationship with the organization. Direct mail encompasses a wide variety of marketing materials including postcard, catalog, brochures and dimensional direct mail, with the intention to have an immediate impact on ARKETERS
L. K. Sharma is with the Department of Information Technology and MCA, Rungta College of Engineering and Technology, Bhilai- INDIA. (Phone: +91-788-6666666; e-mail: lksharmain@ gmail.com). P. V. S. S. Srinivas is with Department of Computer Science and Engineering, G. D. Rungta College of Engineering and Technology, BhilaiINDIA (e-mail:
[email protected]).
customers’ perception on a product or service that will trigger or drive purchases by customers. Customer segmentation is required for effective implementation of direct mailing. Customer segmentation is the process of dividing the customer base into distinct and internally homogeneous groups in order to develop differentiated marketing strategies according to their characteristics. Customer segmentation can be extracted by applying the data mining technique such clustering. Market researchers always want to minimize the campaign cost for direct mailing. Most of market researchers use the K-means, the TwoStep, and the Kohonen network like algorithm for customer segmentation [2][8]. These algorithms produce scatter and scatter and convex shape of customer segment [10]. These algorithms do not give the assurance for dense and homogenous customer segmentation [10]. Dense and homogenous customer segments can minimize the direct mailing cost in case of door to door marketing. This purpose regionalisation cluster technique in spatial data base for direct mailing is used. Regionalisation one such metric points and is a classification procedure applied to spatial objects with an area representation, which groups them into homogeneous contiguous regions. This technique applies spatial density and correlation method to find out spatial dense and non-spatial homogeneous customer segment. Density based method uses spatial variable and it helps to find out dense and arbitrary shape cluster. A correlation method uses non-spatial social variable(s) such as ‘Buying Power’, ‘Percentage of families’, ‘House holds area’, ‘Stability’ etc. and it identifies local subgroups of data objects sharing similar property. The precise task is discussed in detail in the following sections. Section 2 presents related work on regionalisation problem. In Section 3, we discuss concepts for a density and homogeneity based clustering approach. Experiment and result are discussed in Section 4. II. RELATED WORK ON REGIONALISATION There are mainly four approaches to regionalisation. The first approach is a two-step algorithm. In the first step, a conventional clustering procedure is performed using nonspatial attributes. In the second step, objects in the same cluster with no spatial contact will be split, forming different regions. This solution enables a quick evaluation of spatial dependence among objects. However, this method does not
978-1-6123-3002-0 357
Proceedings of International Conference on Intelligent Systems & Data Processing (ICISD 2011) 24 - 25 January, IT Department, G H Patel College of Engineering & Technology, Gujarat, India
capture the spatial adjacency condition directly, resulting in limited capacity to capture spatial patterns [3] [4]. The second approach used in regionalisation considers both the geographical position and non-spatial features of those objects. The clustering algorithm uses the coordinates of the area’s centroid as extra attributes, and measures similarity between objects as a weighted mean of nearness in the feature space and nearness in the geographical space. To work well, the algorithm needs a choice of weights that produces connected clusters. This weighted means approach is used by the regionalisation method of the SAGE system (Spatial Analysis in a GIS Environment) [5]. The third approach includes algorithms that use adjacency relations as constraints to the clustering procedure. The AZP (Automatic Zoning Procedure), proposed by Openshaw [6] is an example. This method starts with performing a random partition of n objects in k regions. Then, through trial and error, it seeks to reallocate objects over regions to minimize an objective function, subject to the adjacency constraint. Improvements to AZP led to the ZDES automatic zoning system [7]. Since the AZP algorithm is computationally expensive, it is useful to consider other techniques that use the adjacency relations of the objects and are computationally efficient. These three approaches have got a big disadvantage: The apriori regions can be inappropriate, that is their borders must not coincide with the borders of the actual social area [1]. This causes inhomogeneity, because the actual spatial distribution of social characteristics is not taken into account. Therefore, regions are not sure to be socially homogeneous. This problem is commonly called Modifiable Areal Unit Problem (MAUP) in spatial statistics. Also it is not sure that the neighbourhoods of all living places of a region are inhabited and dense, because the spatial point distribution of living places is also not taken into account. In the result, two neighbouring living places inside of a region can be far away from each other [1]. The fourth approach is density based algorithm [9] [10]. In our previous study we proposed efficient clustering technique for regionalisation of a spatial database (RCSDB) [1]. This algorithm combines the ‘spatial density’ and a covariance based method to inductively find spatially dense and nonspatially homogeneous clusters of arbitrary shape. RCSDB tackled above mention problems by a special clustering method which takes into account spatial point distributions as well as the distribution of several non-spatial characteristics. RCSDB classify a database of geographical locations into homogeneous, planar and density-connected subsets called “regions”. It finds internal density connected sets (that is density-connected sets which allow to “touching” other clusters, but do not allow for “overlapping”). Furthermore, these sets have to own a certain minimal homogeneity. This can be measured by a normalised variance-covariance based parameter which takes into account local and global variances as well as the “extravagance” of a cluster. Furthermore, homogeneity has outlier robust and in order to increase the
clustering quality, RCSDB follows avoid and reconsider noise and merge cluster heuristics. Therefore in this study also I consider RCSDB for Direct Mail Marketing. III. NOTATION OF REGIONALISATION AS A SPATIAL CLUSTERING PROBLEM
The notion of regionalisation clustering in [1] is defined as the follow: Suppose a spatial database D of geo-referenced addresses (point data) is given. Let X={X1 …Xj…Xm} be a set of variables associated with D, so that each address oi ∈ D has got the m-tuple (x1i, …, xji, …, xmi) of values. Definition 1: Let CL = {C1, …, Ck} be a (not necessarily maximal) mutually exclusive set of nonempty subsets of D, denoting the result of a regionalisation clustering, so that each cluster is defined to be a regionalisation cluster, but not each possible regionalisation cluster is part of CL. The noise can be defined in D with respect to a given clustering CL as the set of objects in D not belonging to any cluster in CL, noise = D\(C1∪ …∪ Ck). Let Npred be a reflexive and symmetric binary predicate on D meaning that two points are spatial neighbours. Let Card be a function returning the cardinality of a subset of D, and MinC be a minimum cardinality. Definition 2: internally directly density reachable iddr(): An object p is internally directly density reachable from an object q with respect to Npred, MinC, and CL, iddr(p, q), if Npred(p, q)
(neighbourhood condition)
Card({o ∈ D| Npred(o, q)}) > MinC (core object condition) ∀ o ∈ D: Npred(o, q) ⇒ ∃Ci∈CL: o ∈ Ci ∧ q ∈ Ci (planarity condition) This binary predicate is not symmetric and means that p is part of an inhabited and dense neighbourhood of q which entirely belongs to one cluster. Based on this predicate, we define “internally density reachable” idr() and “internally density connected” idc() accordingly. These definitions imply the ones in Sander et al, but they do not follow from them, so idc(p,q) implies dc(p,q), but dc(p,q) does not imply idc(p,q). Furthermore, let H be a homogeneity predicate, meaning that a subset of D is homogeneous with respect to a variable Xj and a minimum homogeneity MinH (see section 4). Definition 3. A regionalisation cluster Ci in D with respect to a set of variables X={X1 …Xm} is a nonempty subset of D, satisfying the following formal requirements: •
For all addresses p, q from Ci, p∈ Ci ∧ q ∈Ci: p is internally density connected to q (internal density connectivity)
•
The addresses of Ci are homogeneous with respect to each variable in X, so: ∀Xj∈X: H({o∈D| o∈ Ci }, Xj) (homogeneity).
978-1-6123-3002-0 358
Proceedings of International Conference on Intelligent Systems & Data Processing (ICISD 2011) 24 - 25 January, IT Department, G H Patel College of Engineering & Technology, Gujarat, India
Outlier robust homogeneity can be measured by following formula. H comb (C , X ) = (H NLC (C , X ))(H NLC (C , X )) + (1 − H NLC (C , X ))(H NLV (C , X )) Varlocal −local H comb (C , X ) = 1 − Var local − global
1 − Varlocal −local Var local − global
Varlocal −local + Var local − global
Varlocal −local 1 − Var global
Definition 4: Let MinH ∈ [0...1] be a fixed normalized homogeneity minimum, for example 0.7. Let Qu and Ql be an upper and lower quartile of a variable Xj in cluster C, for example Q80% and Q20%. Then we consider the cluster C of D to be homogeneous with respect to Xj, H(C, Xj, MinH), if: Hcomb(C', Xj) > MinH, with C' = {oi ∈ C | Ql- 1.5*(Qu-Ql) < xji < Qu+ 1.5*(Qu-Ql)}, which means the predicate is true, if the cluster shows a minimal homogeneity for an outlier-free subset of its values. IV. CUSTOMER SEGMENTATION FOR DIRECT MAILING The most common form of direct marketing is direct mail used by advertisers who send paper mail to all postal customers in an area or to all customers on a list. Direct mail is most popular marketing strategy while able to reach a huge number of people in a given geographic area and present detailed information and high quality graphics about products and services offered to attract end customer. Segmentation is an important task for direct mails. There are some drawbacks to use Direct mails. Direct mails are price-oriented mediums it can be rather expensive, the cost per thousand is high in view of the highly targeted nature of its impact and Wastage of paper. In our opinion it can be reduced by using the RCSDB. RCSDB can create data-driven behavioral customer segments. It can analyze behavioral data, identify the natural groupings of customers, and suggest a solution founded on observed data patterns. It can also be used for the development of segmentation schemes based on the current or expected/estimated value of the customers. RCSDB identifies customer those live nearby location with similar behavior. Therefore the distribution cost of mail can reduce.
segmentation types based on the specific criteria or attributes used for segmentation [2]. Most widely used segmentations are Value based, behavioral, propensity based, Loyalty based, Socio demographic and life stage etc. Segmentation is used in strategic marketing to support multiple business tasks. The starting point should always be the particular business situation. According to business situation social variable can be chosen for RCSDB. In this experiment we have chosen “buying power” to measure the homogeneity or customer behavior. As per business requirement also it may possible to choose more than one variables. Table I shows the result of customer segmentation for Bonn city in Germany using variable ‘Buying Power’. Figure 1 visualizes customer segments on Bonn city map. TABLE I. QUALITY OF THE CUSTOMER SEGMENTS FOR BONN USING VARIABLE ‘BUYING POWER’. Quality measure
RCSDB
RCSDB with merging
Number of customer segments
95
85
% of noise
7%
7%
Standard deviation
0.93
0.93
Weighted average homogeneity
0.75
0.75
V. EXPERIMENT AND RESULT The regionalisation task of residential areas can be started with deriving residential areas from addresses. We synthesize residential areas from atomic spatial building blocks, so that the clusters are spatially compact and homogeneous according to the chosen non-spatial characteristics. In our case the atomic building blocks are addresses of buildings, and the essential variables to characterize residential areas are relative location, housing structure and area usage. The relative location is a spatial characteristic to force spatially dense circular shapes around a city centre. It normally is measured in distance to the city centre, and it is used to classify different residential areas according to urbanization and centrality. In our approach, we consider the spatial density instead. This is because spatially dense addresses can be considered to have the same relative location. Furthermore, residential areas are supposed to be circular, but don’t necessarily have to be, and in this way we can allow for arbitrary shapes. There are many different
Fig. 1. Result of Customer Segments implemented on Bonn City, depicting various segments.
VI. CONCLUSION Customers are the most important asset of an organization. There cannot be any business prospects without satisfied customers who remain loyal and develop their relationship with the organization. Customer segment based metric points are being developed in order to affect the shift from globalization to regionalisation in the marketing strategy. All businesses need to know where their best customers are located, how much Buying power they posses and how far any given customer must travel to nearest sales or service point. The implementation on the Bonn city can be done on other places/cities. Also an automated procedure of customer
978-1-6123-3002-0 359
Proceedings of International Conference on Intelligent Systems & Data Processing (ICISD 2011) 24 - 25 January, IT Department, G H Patel College of Engineering & Technology, Gujarat, India
segmentation with RCSDB would allow many special purpose regionalisations, dependent on the kind of homogeneity one is interested in. It would be helpful for many applications such as geo-marketing, socio-economic and census data analysis. REFERENCES [1]
L. K. Sharma, S. Scheider, W. Kloesgen and O. P. Vyas, “Efficient clustering technique for regionalisation of a spatial database”, Int. J. Business Intelligence and Data Mining, Vol. 3, No. 1, pp. 66-81, 2008. [2] K. Tsiptsis and A. Chorianopoulos, “Data Mining Techniques in CRM”, John Wiley & Sons, Ltd, 2009. [3] C. D. Juan, R. Raul, S. Jordi, “Supervised Regionalization Methods: a Survey”, Research Institute of Applied Economics, 2006. [4] R. M. Assuncao, M. C. Neves, G. Câmara, and C. C. Freitas, "Efficient regionalization techniques for socio-economic geographical units using minimum spanning trees," Int. J. of Geographical Information Science, vol. 20, no. 7, pp. 797-811, August 2006. [5] S. Wise, R. Haining and J. Ma, “Providing spatial statistical data analysis functionality for the GIS user: the SAGE project”, Int. J. of Geographical Information Science, vol. 15, no. 3, pp. 239–254, 2001. [6] S. Openshaw and L. Rao, “Algorithms for reengineering 1991 census geography”, Environment and Planning, vol. 27, no. 3, pp. 425–446, 1995. [7] S. Alvanides, S. Openshaw and P. Rees, “Designing your own geographies. In The Census Data System”, P. Rees, D. Martin and P. Williamson (Ed.), pp. 47-65, 2002. [8] R. Xu and D. Wunsch, “Survey of Clustering Algorithms”, IEEE Tran. on Neural Networks, vol. 16, no. 3, May 2005, pp. 645-678. [9] S. Brecheisen, H. P. Kriegel and M. Pfeifle, “Multi-Step Density-Based Clustering”, Knowledge and Information System vol. 9, no. 3, 2006, pp. 284-308. [10] Sander, J., Ester, M., Kriegel, H. P. and Xu, X. (1998) ‘Density- Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications’, Int. J. of Data Mining and Knowledge Discovery, vol. 2, pp. 169-194.
978-1-6123-3002-0 360