Customer segmentation in a large database of an online customized ...

Robotics and Computer-Integrated Manufacturing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Contents lists available at ScienceDirect

Robotics and Computer-Integrated Manufacturing journal homepage: www.elsevier.com/locate/rcim

Customer segmentation in a large database of an online customized fashion business Pedro Quelhas Brito a,n, Carlos Soares b,1, Sérgio Almeida b, Ana Monte a, Michel Byvoet c a

LIAAD-INESC TEC, Faculdade de Economia, Universidade do Porto, Rua Dr. Roberto Frias, 4200-464 Porto, Portugal INESC TEC, Faculdade de Engenharia, Universidade do Porto, Rua Dr. Roberto Frias, s/n, 4200-465 Porto, Portugal c Bivolino, Wetenschapspark 1-Lab 9, Campuslaan 1, 3590 Diepenbeek, Belgium b

art ic l e i nf o

a b s t r a c t

Article history: Received 24 March 2014 Received in revised form 29 October 2014 Accepted 29 December 2014

Data mining (DM) techniques have been used to solve marketing and manufacturing problems in the fashion industry. These approaches are expected to be particularly important for highly customized industries because the diversity of products sold makes it harder to find clear patterns of customer preferences. The goal of this project was to investigate two different data mining approaches for customer segmentation: clustering and subgroup discovery. The models obtained produced six market segments and 49 rules that allowed a better understanding of customer preferences in a highly customized fashion manufacturer/e-tailor. The scope and limitations of these clustering DM techniques will lead to further methodological refinements. & 2015 Elsevier Ltd. All rights reserved.

Keywords: Customized manufacturing Fashion industry Segmentation Clustering Subgroup discovery

1. Introduction Segmentation is a classical marketing strategy, justified and explained in every handbook devoted to business [1–3]. Both conceptually and practically, managers know that they cannot satisfy every customer entirely. Therefore, the rationale is to split customers into groups, referred to as market segments, and then target the marketing efforts to the most attractive segment. In this case attractiveness means profitability and sustainability. The segmentation task separates the market (i.e. the consumers) into several groups that are internally homogeneous and heterogeneous vis-à-vis the external members [4]. That process varies according to the segmentation method in use. However, it is important to stress that regardless of which method is used, the final choice is rarely automatic or fully data driven. Many decisions, including which and how many segments to identify, and their relative sizes, are mostly based on the judgment of the business manager, because the segmentation experts rarely have a clear grasp of such domain-dependent, qualitative dimensions [4,5,44]. The strategic application of market targeting is to ensure that we n Corresponding author at: Faculdade de Economia, Universidade do Porto, Rua Dr. Roberto Frias, 4200-464 Porto, Portugal. E-mail addresses: [email protected] (P.Q. Brito), [email protected] (C. Soares), [email protected] (S. Almeida), [email protected] (A. Monte), [email protected] (M. Byvoet). 1 Part of this study was carried out while Carlos Soares was at LIAAD-INESC TEC, Faculdade de Economia, Universidade do Porto.

can anticipate consumer reactions to the marketing mix. To obtain an interesting return on investment in the communication effort as well as in R&D and product design, we must guarantee the efficiency of those marketing instruments. This only happens if those tools perfectly match the market segment profile and as a result the target segment responds as expected – for example, appreciating the product, becoming an involved audience to the brand advertisement and consistently choosing the desired distribution channel [6]. The outcome of segmentation heavily depends on the input variables, which could be demographic, psychographic, geographic, life-style, etc. Nevertheless, when behavioral data are available concerning what customers purchase, the type of products they prefer, their total expenses, their buying frequency or whether they respond or not to sales promotions, then it is possible to implement a more refined segmentation [4,7,8]. Methodologically, the current estimation approaches essentially include multivariate analysis. The most common clustering statistical techniques can be classified into three major categories: non-overlapping hierarchical analysis, non-overlapping non-hierarchical analysis, and overlapping and fuzzy methods [4,9,10]. In the late 1990s, some artificial intelligence algorithms, i.e. neural networks, were also applied to solve clustering problems [11,12]. Applications to mobile ecommerce environment requires specific methods some of them combining traditional time series segmentation with data mining adaptative models or enhanced algorithms [42,45].

http://dx.doi.org/10.1016/j.rcim.2014.12.014 0736-5845/& 2015 Elsevier Ltd. All rights reserved.

Please cite this article as: P.Q. Brito, et al., Customer segmentation in a large database of an online customized fashion business, Robotics and Computer Integrated Manufacturing (2015), http://dx.doi.org/10.1016/j.rcim.2014.12.014i

P.Q. Brito et al. / Robotics and Computer-Integrated Manufacturing ∎ (∎∎∎∎) ∎∎∎–∎∎∎

2

Generally, the nature of the clustering problem and the type of data – quantitative or qualitative – will determine the method to be used. More recently, the size of the databases has become an issue. The data sets generated by online transitions can be complex in nature (due to the quantity and variety of variables) and size, which lead to new methodological solutions having to be tested to cope with those large databases. The application of algorithms from the fields of data mining and artificial intelligence enables new insights to be gained or makes some analyses possible which would not be viable otherwise [13,14,43]. The research goals of this paper comprise both methodological orientations and managerial/marketing concerns. The former should contribute to defining a segmentation strategy. Overall, the first research objective consists in applying two different DM approaches to a segmentation problem and testing the extent to which they can complement each other by explaining different aspects of the market. We used the K-Medoids clustering (Study 1) and the CN2-SD subgroup discovery (Study 2) methods. The former is used to obtain market segments in a traditional sense, while the latter allows the characterization of subgroups of observations with rare distributions, and to the best of our knowledge has not yet been used for this purpose. The second research goal is essentially managerial. The benefits of the segmentation are twofold: (1) externally, it is expected to help the company to redefine their communication strategy, particularly with regard to their sales promotions; (2) internally, by matching the products to the customer's preferences it will help to redefine product design, adjusting manufacturing process and speeding up delivery. To achieve those benefits, it is crucial to obtain the market segments' characterization based on preferred product attributes as well as on customer profile. In the remainder of this paper, we will describe the research framework (Section 2), the methodologies used in the two studies (Section 3) and the results (Section 4), ending with a discussion of the conclusions (Section 5).

2. Research framework Fashion manufacturers/retailers are closely linked to consumer lifestyle(s) [41]. Fashion is a very dynamic business, since it is necessarily innovative – new collections should be issued at least twice a year. Moreover, it is dependent on individual taste and subject to many influences, such as those of magazines, and movie and rock stars. The variety of products offered, together with the speed of response to changes in demand, can be crucial for the success of this business. To be able to compete successfully in a very segmented market, manufacturers and retailers have to develop products with appealing designs. Ultimately, the garment can be fully customized according to customer preferences, thus providing high added value [15]. This is certainly the case in the study addressed in this project, involving a fashion industry company which is a manufacturer of specific garment – shirts – and, at the same time, operates an e-business which means that it operates globally. Their customers can choose a range of distinct garment attributes that are pre-defined on the web site. In this context the designers' [designer's] task is complex, particularly in areas of activity in which customer preferences change quickly, as is the case with the fashion industry. Hence, to enable companies to guide their strategies towards customer satisfaction, models that characterize customer preferences and purchasing behavior must be incorporated into their product design processes. Currently, the most common means of obtaining these models is by using data analysis methodologies. Typically, the more data that are available, the better are the models. E-commerce is a business activity that potentially generates huge databases. A

website platform offers a highly structured customer–business interface. Furthermore, since all visitor–website interaction can be monitored, the data collected provide a rich and complex account of that interaction. As mentioned above, large data sets are best tackled by data mining techniques. In DM, data collected from past customer interactions can be used to identify patterns and trends [16,17]. These patterns can be useful for understanding past and current customer preferences and behavior, as well as for forecasting what will happen in the future. This approach is typically used in areas such as marketing and sales. For instance, Gamberger and Lavrac [39] and Lavrac [19] used data mining techniques to obtain models to help decision support systems in marketing campaigns. DM techniques have also been widely used in the textile industry. One example is Thomassey's study [15], which aims to ascertain common practices in this industry. Data mining is also used to provide knowledge that will assist in the evaluation of sales forecasting methods used by some companies [18]. However, to the best of our knowledge, DM has not been used for segmentation purposes to aid marketing strategies in highly-customized textile industries.

3. Methodology In this section, we first describe the database used in this project (Section 3.1) and then the two data mining (DM) techniques employed (Section 3.2). 3.1. Nature of the database The data owner is bivolino.com, a manufacturer of custommade shirts. Customers can access the Bivolino website or one of its affiliates and purchase a unique shirt by selecting from its many components presented on menus with a large number of alternatives. This renders the production and logistics processes of the company very complex, and also presents complex challenges to the company, specifically those with regard to achieving and maintaining its profitability, efficiency and productivity goals while keeping up with the needs and trends of its customers in order to satisfy them. According to the Bivolino managers, overweight customers constitute an important market demand with regard to this kind of fashion garment. Although the obese in Europe have increased significantly in number in the last few years, many traditional brands have not adapted in order to cater for these customers. Thus, many obese prefer to purchase their clothes online from trusty companies that provide them with wellfitting garments. Bivolino.com has successfully developed a system that ensures this, and has therefore gained a significant share in that specific market. The total number of customer orders available in our database is 10,775. Table 1 shows the type of variables used as segmentation criteria inputs in this research. 3.2. DM methods In this section, we present the two algorithms used in this study, namely K-Medoids (Section 3.2.1) and CN2-SD (Section 3.2.2). Given that subgroup discovery is a lesser known DM task, we will discuss it in more detail below. 3.2.1. Study 1-K-Medoids algorithm Clustering can be defined, in operational terms, as follows: “Given a representation of n objects, find k groups based on a measure of similarity such that the similarities between objects in the same group are high while the similarities between objects in different groups are low” [20].



3

Table 1 Segmentation criteria – input variables. Product characteristics

Demographic and biometric (who they are)

Geographic (where they live)

Psicographic (how they behave)

Behavioral (why they buy)

Type of fabric: Sheffield, Kiwi_9, Greenwich, London4,

Gender: Men (91%), women (9%)

Country/Nationality: United Kingdom, France, Germany

Lifestyle: Activities-work; social events or entertainment Purpose-work; fashion

Price sensitivity: Submit voucher (yes; no)

Fabric color Collar type Fabric structure: Herringhone, Twill Cuff, placket, pocket

Age groups: [25–34], [35–44], [45–54] Collar size BMI (body mass index)

According to Velmurugan and Santhanam [21], the basic strategy of the K-Medoids clustering algorithm is to find k clusters in n objects by (1) arbitrarily finding a representative object (the medoid) for each cluster; (2) associating each remaining object with the medoid to which it is most similar; (3) updating the medoids by choosing the most representative object in each of the k clusters; and (4) repeating Steps 2 and 3 until convergence or a stopping criterion is met. The K-Medoids algorithm was used instead of the more common method, K-Means, because the latter can only be applied to numerical data. The K-Medoids method is based on the dissimilarities between pairs of objects, which can be obtained with any dissimilarity function. Therefore it can be applied to mixed data, such as the data available in this project. Additionally, it uses representative objects as reference points while the ones obtained with the K-Means method may be unobservable. The algorithm takes k, the number of clusters, as the input parameter. Due to lack of space, we omit the details of the algorithm, which can easily be found in the literature (e.g. [21]). We have used the implementation available in the Rapid Miner data mining tool.2 3.2.2. Study 2 – CN2-SD algorithm We commence by introducing the subgroup discovery task. Next we discuss existing applications. Finally, we describe the CN2-SD algorithm, which has been used in this project. 3.2.2.1. Subgroup discovery. Kloesgen [22] and Wrobel [23] present subgroup discovery as a data mining technique for discovering relations between different objects with respect to certain properties of a target variable. The subgroup discovery task aims to discover subgroups of the population that are statistically more interesting and unusual, i.e. with statistical distributions that show unique features with respect to the global distribution of the property (i.e. the target variable) under investigation [24]. Similar definitions have been presented by other authors (e.g. [25]). In this context, Gamberger [33] formalizes rules (R) that describe a subgroup obtained by induction, which is described as follows:

R: Cond → Classvalue

http://www.rapidminer.com.

3.2.2.2. Subgroup discovery applications. In marketing, the application of the EXPLORA algorithm was briefly mentioned by Kloesgen [26], in connection with an analysis of the German financial market, using data from different German institutions. To study this problem, it was necessary to use various preprocessing methods in order to obtain interesting subgroups. Gamberger [39] applied the SD algorithm for the market analysis of certain brands. Lavrac [19] also studied other marketing problems with the algorithm CN2-SD. In both studies, the goal was to find potential customers for different brands in the market. Flach [27] also studied this issue by using an earlier version of the algorithm CN2-SD. In a study related to road accidents, Kavsek [28,29] presented comparisons between SubgroupMiner, CN2-SD and APRIORI-SD algorithms. Zelezny [30] applied the RSD algorithm to a database containing information about the traffic of calls in an organization. A total of four different versions of the RSD algorithm with different combinations of quality measurement were used in this study. Another study, with the goal of minimizing problems due to voltage drops in power distribution, applied the CN2-SD algorithm in order to discover interesting and unusual subgroups [31].

(1)

where Classvalue is the value of the target variable (i.e. the variable of interest) in the subgroup discovery task and Cond is a set of attributes that describe the statistical distribution of the subgroup in question. To illustrate this concept, let us assume a simplified version of the case study addressed in this work: we are seeking subgroups of customers relative to their weight category, which is either “normal” or “overweight.” Let us further assume that the distribution of weight categories in the population is 70/30. The rule R: has _monogram → overweight is interesting if the number of shirts sold with a monogram is significant (say, 5%) and the target variable distribution is quite different from the whole dataset (say, 40/60). From the business point of view, this means that 2

overweight customers seem to have a preference for monograms. Subgroup discovery algorithms can be classified into 3 groups [40]: algorithms based on classification such as EXPLORA, MIDOS, SubgroupMiner, SD, CN2-SD and RSD; algorithms based on association, including APRIORI APRIORI-SD, SD4TS, SD-MAP, DpSubgroup, Merge-SD and IMR; and evolutionary algorithms, such as SDIGA, MESDIF and NMEEF-SD. Quality measures are a key factor in knowledge extraction, as they quantify the interest in the results obtained. In subgroup discovery, the quality measures assess the importance and interest of the obtained subgroups. Various measures can be used for this purpose. However, as the concepts of importance and interest are hard to define objectively, there is no bibliographical consensus concerning which best suits subgroup discovery [19,22]. In this study, we use WRAcc, which is the measure of unusualness that the algorithm used, CN2-SD, tries to optimize. This measure is described later in this section.

3.2.2.3. The CN2-SD algorithm. The CN2-SD algorithm is an adaptation of the CN2 algorithm [18,32] for subgroup discovery. Therefore, we describe the latter algorithm first. The CN2 algorithm tries to induce classification rules [18,32] in the form of “if Cond then Classvalue” (also represented as “Cond-Classvalue”) where the condition (Cond) refers to the attributes/variables with their values and Classvalue refers to the value, category or class of the target variable. The main difference between classification and subgroup discovery, which justifies the differences between the algorithms for them, is essentially that classification is a predictive task, while subgroup discovery is a descriptive task. In classification, a good model should be able to assign the correct class value to a new example. In subgroup discovery, the main goal is to find interesting patterns in the training data. For instance, if two rules with different conditions and the same consequent cover an



4

Table 2 Clustering results based on product attributes. Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cluster 6

1282 (12%)

4889 (45%)

872 (8%)

1570 (15%)

500 (5%)

1662 (15%)

Work_shirt

Work_shirt

Fashion_shirt

Work_shirt

Fashion_shirt

Fashion_shirt

Fabric color: Multicolor Fabric: Sheffield

Fabric: Greenwich

Fabric structure: Twill Fabric: Red & Bordeaux

Collar: Italian Semi-spread Fabric: London_4

Collar/Cuff white: Yes Fabric: Kiwi_9

Fabric: Miro_3 Fabric structure: Herringbone

example, then one of them is redundant when the goal is prediction. However, for subgroup discovery purposes, they may both be interesting, since they may provide two different and equally interesting perspectives on the problem. The implementation of CN2 consists of two parts: a low level part, which carries out a beam search in order to find a single rule with high discriminative power regarding the training data, and a high level task that performs a control procedure, which repeatedly executes the low level task to induce a set of rules, until a satisfactory model, i.e. a set of rules, is obtained. A rule is discriminative if it covers many examples of a single class and only a few of the others. It is quantified by the class entropy of the examples that it covers, i.e. the examples for which the condition (Cond) is verified:

of the WRAcc heuristic. More information can be found in the original CN2-SD paper [34]. Two variations of the CN2-SD algorithm have been proposed. The first can only be used with binary target variables [34], while the other is able to deal with categorical target variables [35]. In this project, we have used CN2-SD because it is the only subgroup discovery algorithm available in the selected tool, Rapid Miner.

Entropy (Cond → Classvalue )

4.1. Study 1

=−

∑ p (Classvalue |Cond) log2((p (Classvalue |Cond)) i

(2)

To avoid the high level task from finding the same rule repeatedly, the examples that are covered by a rule, i.e. that verify the condition in that rule, are removed. The two most important changes in the adaptation of CN2 for subgroup discovery, referred to as CN2-SD, are: the use of the WRAcc accuracy measure to assess the quality of the rules and the use of the weighted coverage algorithm. According to Wrobel [23], the Weighted Relative Accuracy heuristic (WRAcc) can be defined as

WRAcc (Cond → Classvalue ) = p (Cond)(p (Classvalue |Cond) − p (Classvalue ))

(3)

This is used in CN2-SD to assess the quality of the rules instead of entropy. It combines two components: the measure of the generality of the rule or relative size of a subgroup, given by p(Cond); and the unusualness measure of the distribution or relative accuracy. The relative accuracy is given by the difference between the accuracy of the rule, p(Classvalue|Cond), and a priori probability of class Classvalue, defined by p(Classvalue). In practical terms, the aforementioned heuristic defines a rule as interesting if its accuracy improves in comparison to the a priori probability of the corresponding class. Given the example above, describing what an interesting subgroup is, WRAcc is clearly more useful for this task than accuracy. Additionally, the removal of the examples that are covered by each new rule, as in CN2, is not suitable for a subgroup discovery task, for two reasons. Firstly, as explained earlier, an example may be part of several subgroups. Secondly, the rules extracted later in the learning process may not represent an interesting subgroup. In fact, they may cover examples that were removed earlier, and so the quality of the rule is incorrectly estimated. A solution to this problem is the weighted coverage algorithm [33]. Instead of removing the examples covered by a rule, this algorithm reduces their weight. While this ensures that these examples are still taken into account, it leads the algorithm to pay more attention to yet uncovered examples by reducing their importance. We note that the use of the weighted coverage algorithm requires an adaptation

4. Results We begin by discussing the results of the first study (Section 4.1) and go on to discuss those of the second (Section 4.2).

This study is divided into two steps. In the first step we undertook a clustering analysis based on product characteristics. In the second step the clustering analysis was extended by including customer characteristics in the data. 4.1.1. Clustering product characteristics The aim of this step was to identify the most relevant fashion trends in shirts based on customer choices regarding the characteristics of the product. Table 2 summarizes the results of the clustering performed on 10,775 shirts orders described by 29 attributes (4 numerical/scale and 25 categorical). All the clusters obtained have medoids with at least one attribute that is clearly distinctive, i.e. that identifies each cluster in a unique manner. The number of clusters identified was k¼6. This value was chosen on the basis of some preliminary experiments. The medoids are described in Table 2 by the most “typical” value of each attribute in the corresponding cluster. For numerical attributes, the typical value is the mean, while for the categorical ones it is the most common value, i.e. the mode. The second line contains the number of observations and the corresponding percentage. The third line shows the distinctive features of the typical shirt in each cluster. Overall, the six clusters can be arranged into two main groups. Clusters 1, 2 and 4, which comprise the first group, share the same type of fashion garment: work shirts, which represent 65% of the total number of shirt orders. Clusters 3, 5 and 6 (the second group) correspond to fashion shirts and represent 35% of total shirt orders. Analyzing these groups in more detail, we further observe the following traits: customer choices are conditioned by a certain formal business dress code; Cluster 2 represents the most common choices in terms of shirt attributes, and the binary3 attributes (e.g. has pocket, has monogram) assume the value “no” in the majority of cases. In the second group, also taking into account customer age, we noticed that more mature men were more likely to show preference for shirts with pockets than younger ones (see Fig. 1). These observations can be useful for the product designers 3

Binary attributes in Rapid Miner's terminology are binomial attributes.



5

Fig. 1. Relation between age and shirt pocket. The xx axis represents the type of pocket and the colors represent the age group. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

because understanding customer preferences and how they evolve will assist in the development and design of products to satisfy their tastes. 4.1.2. Clustering customer characteristics In this second step of the clustering analysis, the data was extended with base variables concerning the following dimensions of the characterization of customers: demographic, biometric, geographic, psychographic and behavioral. Since 96.6% of total shirt orders relate to five countries – Belgium, Germany, France, the Netherlands and the United Kingdom – we excluded from analysis the residual data originating in other countries. In terms of gender, women accounted for only 9.6% of [the] orders, while in the case of 80% of these purchases, promotional coupons were redeemed. In a similar way to Table 1, Table 3 depicts the attribute values that identify each cluster in a unique way. In this case, we can reorganize the clusters into 3 groups. They can be labeled as work shirts (Clusters 1, 4 and 5), party shirts (Clusters 2 and 3) and fashion shirts (Cluster 6). From this grouping, it becomes evident that young men (aged between 25 and 34) prefer slim-fitting shirts, while more mature men (aged between 45 and 54) prefer comfort fit shirts. Cluster 2 was mostly constituted by customers from the female market segment. The characteristics of the shirts in this specific segment allow the company to differentiate between the product designs. For instance, they may include new distinctive features in order to meet the different expectations of men and women. We also observe that it is important for the company to segment its market according to the age of its customers, since the biometric changes exert considerable impact on fashion choices. The level of shirt tightness in relation to the body demonstrates the influence of biometric changes, which are not exclusively driven by age. Young customers tend to prefer the “super slim fit”, while older customers opt for the “comfort fit”, although the “regular fit” remains the most common choice. This relation is illustrated in Fig. 2. Age difference also seems to explain customer preference for the shirt pocket. This applies to older customers who are quite keen to choose a particular type of pocket. The company can also segment the market geographically by country or post code, since customers from different areas have different requirements for fashion and clothing products, their choices often being influenced by social and cultural values. Another possibility for segmenting the bivolino.com customers is their psychographic profile. Lifestyle and purpose, together with

Table 3 Clustering results (distinct medoids attributes values). Cluster 1

Cluster 2

Cluster 3

2.687 (25%)

5116 (47%)

2316 (21%) 80 (1%)

Work_shirt Party_shirt Collar group: o36

Gender: Women

Cluster 4

Cluster 5

Cluster 6

564 (5%)

12 (1%)

Party_shirt Work_shirt Work_shirt Fashion_shirt Postal Code: sg49aq

Postal Code: co45bq

Postal Code: cv313nd

Country: Germany

Has voucher: Yes Country: France Affiliate: Bivolino

dress code usage determine the choice of categories such as “work shirt”, “fashion shirt” and “party shirt” (configurator or collection type). Hence we can infer that their choice was probably motivated by professional requirements (e.g. Segments 1, 4 and 5), interest in fashion (e.g. Segment 6), or social events requirements (e.g. Segments 2 and 3). Price sensitivity was another relevant behavioral segmentation variable, especially among females. This could be measured by gift voucher redemption in order to obtain immediate discount during the online checkout payment. We concluded that women are more price-sensitive than men, given that more than 82% used a voucher for payment. They are generally more receptive to promotions of this type and more willing to experiment with new products than men. However, the reason for this could also be related to a particular promotion whereby the collection is presented in such a way as specifically to target the female segment. This would show that the outcome was positive, since that new segment responded massively to that incentive. 4.2. Study 2 This study used a database of 7066 orders because it excluded orders from females, and only the five countries accounting for 96.6% of shirt orders were considered. The orders were characterized by 19 variables, where the target variable was the Body Mass Index (BMI) variable The results were evaluated according to the following measures: size of the subgroups, deviation of the distribution of the



6

Fig. 2. Relation between shirt fit and customer age. The xx axis represents the age group and the colors represent the type of fit of the shirt. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

subgroups in relation to the full dataset. Additionally, a more subjective analysis was undertaken to evaluate the usefulness of the rules for supporting the company's design and marketing processes. These criteria will assess the ability of those rules in order to provide new knowledge that is not trivial or obvious. We used the implementation of the CN2-SD algorithm available in Rapid Miner. The model obtained consisted of 54 rules. Each obtained rule describes a population subgroup whose distribution differs from the complete population distribution. To assess the unusualness of a rule, we calculated the difference between the proportion of examples in the subgroup that belonged to the class in the consequent of that rule and the proportion of the examples of the same class in all the orders. In addition to better interpret and evaluate the results in terms of usefulness for the design of new products, we proceeded with the interpretation and analysis of the subgroups found, classifying them according to the following criteria:

i. Uninteresting subgroups verify one of the following conditions: ○ Obvious subgroup from the standpoint of common sense. ○ Subgroup whose class distribution has a small deviation in relation to the total distribution of orders. ○ Small subgroup, representing less than 0.5% of total orders. ii. Interesting subgroup for marketing, since it provides useful information in this area, namely as regards the country, affiliate and configurator in which orders are placed. This type of information is less interesting to the design because it contains no new information about the characteristics of the ordered shirts. iii. Interesting subgroups for design verify one of the following conditions: ○ Large subgroup, representing at least 30% of [the] total orders. ○ Subgroup representing at least 5% of total orders, and representing a high deviation in class probability in relation to the total distribution of orders.

Table 4 Some rules obtained using the CN2-SD operator in Rapid Miner. Model (by CN2-SD algorithm)

1. - If country ¼UK then obese 2. - If fit ¼regular then obese 4. - If fit ¼regular and collar white¼n and cuff white¼ n then normal weight 7. - If affiliate¼ retailer 1 and cuff white¼ n and country ¼UK then morbidly obese 8. - If back yoke contrast¼y and collar size ¼ o 36 then obese 9. - If cuff ¼ round single and collar white¼ n and cuff white¼ n then overweight 14. - If hem ¼curved hem and fabric ¼ FabricID12186 then obese 15. - If age ¼35/44 then overweight 21. - If placket¼ real front and affiliate¼ bivolino then obese 22. - If weight¼ 45–75 then overweight 23. - If fit ¼super slim fit then normal weight 33. - If fit¼ regular and weight¼ 95–115 and collar white¼n and cuff white¼ n then normal weight 36. - If weight¼ 115–135 then morbidly obese 39. - If hem ¼ straight hem and fit ¼comfort fit and collar white¼n then overweight 40. - If collar ¼ Italian semi-spread and affiliate¼ bivolino and pocket¼ no pocket and collar white¼n and cuff white¼ n then obese 46. - If height in cm¼ 200/210 and weight¼115–135 and collar white¼ n then overweight 47. - If collar ¼classic point and height in cm¼ 190/200 and pocket ¼mitred and weight¼ 115–135 and placket ¼real front and cuff white¼ n then overweight 49. - If height in cm¼ 160/170 and weight¼75–95 and placket¼ folded and hem ¼curved hem then obese else normal weight

Uninteresting subgroup

Interesting subgroup for marketing

Interesting subgroup for design

Yes Obvious Population 450% Yes Population 443% Deviation o |15%| Population 430% Yes Yes Obvious Obvious Population 410%, Deviation 4 |55%| Obvious Population 45%, Deviation 4 |242%| Population 45%, Deviation o |98%| Population o 0.5% Population o 0.5% Population o 0.5%



Table 4 shows some of the rules of the model obtained with the CN2-SD algorithm. Although the rules found describe unusual distributions, as would be expected, not all provide interesting and useful knowledge. Some rules represent obvious subgroups, as in the case of a subgroup of individuals who are overweight and choose regular fit shirts (rule 2), or the subgroup in which the shirts are of the Super Slim Fit type, which tend to be bought by customers of average weight (rule 23). However, there is also important knowledge for marketing, such as that gained from a subgroup which suggests that customers from the United Kingdom tend to be more obese (rule 1), and that if UK customers make their purchases at Retailer 1, then they tend to be morbidly obese (rule 7), or that customers who are aged between 35 and 44 tend to be overweight. Finally, some rules provide very interesting knowledge as to the design of new products: the subgroup of individuals with collar size less than 36 cm, who choose back yoke contrast, tend to be obese (rule 8). This is therefore a very specific and interesting subgroup, enjoying a similar contrast. Another interesting subgroup chooses curved hems with a particular fabric, FabricID12186 (rule 14). This group of customers also tends to be obese. Perhaps this is the subgroup of greatest interest to the design, due to its size, deviations and detail of the knowledge it represents.

5. Conclusions and future work In highly customized industries, such as Bivolino.com, which produces tailor-made shirts, the diversity of products that are sold makes it harder to find clear patterns of customer preferences. In this project we investigated two different data mining (DM) approaches for customer segmentation: clustering and subgroup discovery. The DM algorithms used, namely K-Medoids and CN2SD, were valuable instruments enabling us to better understand consumer tastes and preferences, thus allowing companies to be more efficient and responsive to customer requests and gain a competitive advantage, particularly in highly-customized fashion manufacturing. As hypothesized, these instruments provided different and complementary perspectives on the customers and the products they buy. The results proved useful both for product development and marketing. Despite having presented a case study, we believe that the approach described can be useful for marketing in other areas. The types of variables used in the segmentation made here (product characteristics, demographic and biometric, geographic, psicographic and behavioral) are common in many other business areas, such as banking and automotive. Thus, the process can easily be adapted. The subgroup discovery approach can be applied to other areas, as long as there is a variable describing the customer that is of particular interest to the business. For instance, in the banking sector, this could be the number of years the person has been a customer of the institution while, in the automotive industry, it could be the income of the customer. The generality of the approach indicates that it could be interesting to develop a tool to support this kind of analysis, independently of the domain of application. In this case study, a diverse set of tools was used which implied that some effort was required to combine them, in particular, to move data from one tool to the other. From a DM perspective, this research gave rise to several challenges. The first one was related to the limitations of the method adopted. The K-Medoids clustering algorithm, despite being less sensitive to outliers than K-Means, due to the use of the median instead of the mean, still requires the a priori definition of the number of clusters. Deciding the optimal k number of clusters is known to be a difficult task [36]. Common approaches to this problem consist in running the algorithm multiple times with

7

different parameter values (i.e., k), the best configuration obtained from all the runs being used as the final clustering. This method is extremely time-consuming. The “best” method, statistically speaking, is not necessarily the most strategically feasible in the implementation of real marketing practice. The interpretation of the outcome at each step is not straightforward, since the validation relies not only on the judgment of the data mining expert but also of the manager or other domain expert involved in the project. Another challenge was presented by the categorical nature of many of the variables used (data separable into categories that are mutually exclusive, such as the collar type). Most of the clustering algorithms available on Rapid Miner do not process categorical data. Nevertheless, the algorithm selected proved to be suitable for the problem under analysis. One final challenge is related to the computational effort required by the selected algorithms. The complexity of the implementations of the K-Medoids and CN2-SD algorithms available in Rapid Miner is quadratic on the number of objects [46,47]. This issue is particularly important in Big Data domains, involving not only large amounts of data but data that flows at high speed [48]. Although we have dealt with a relatively small amount of data in this project, the domain has potential to become a Big Data application. Therefore, this issue should be addressed in the future. In particular, we plan to consider variations of the algorithms used that significantly reduce their complexity [46,47]. Given that Rapid Miner is an open source tool, they can be implemented to address this issue. From the application perspective, this study also raised some challenges. It confirmed that in this domain, as in many other fields – such as manufacturing, finance or marketing – the development of DM projects requires a time consuming trial-anderror strategy to prepare the data and a fine-tuning of the methods and DM techniques [37,38]. Moreover, it confirmed that the close involvement of the domain experts is essential for the success of the project. Taking into account the difficulties encountered during the investigation and the limitations these imposed on it, further research is needed in the following areas:

Developing heuristic approaches to find the optimal number of clusters.

Defining a distance measure that is specific to the business problem and in line with the evaluation criteria.

Applying and testing other subgroup discovery algorithms. Transforming the data to enable the use of different algorithms, comparing the different results and deciding the optimal one.

Acknowledgments The research leading to these results has received funding from the European Union's Seventh Framework Programme (FP7/2007) Agreement no. [260169] (Project CoReNet), from the ERDF – European Regional Development Fund through the COMPETE Programme (operational programme for competitiveness), from QREN (2012/24708-NORTE-07-0124-FEDER-000057) (National Strategic Reference Framework) as part of projects PTXXI and CreativeRetail (Novas tecnologias e paradigmas da computação em ambientes inteligentes na criação de um produto inovador para o retalho – 2012/24708) supported by Fundo Europeu de Desenvolvimento Regional (FEDER), and also Projects “NORTE-070124-FEDER-000059” and “NORTE-07-0124-FEDER-000057”, financed by the North Portugal Regional Operational Programme (ON.2 – O Novo Norte), under the National Strategic Reference Framework (NSRF), through the European Regional Development



8

Fund (ERDF), and by national funds, through the Portuguese Funding Agency, Fundação para a Ciência e a Tecnologia (FCT).

References [1] S.R. Smith, Product differentiation and market segmentation as alternative marketing strategies, J. Mark. 21 (1) (1956) 3–8. [2] H.J. Claycamp, W.F. Massy, A theory of market segmentation, J. Mark. Res. 5 (4) (1968) 388–394. [3] K.S. Moorthy, Market segmentation, self-selection, and product line design, Mark. Sci. 3 (4) (1984) 288–307. [4] M. McDonald, I. Dunbar, Market Segmentation: How to Do It, How to Profit From It, Goodfellow Publishers Ltd., 1995. [5] J.B.E.M. Steenkamp, F.T. Hofstede, International market segmentation: issues and perspectives, Intern. J. Res. Mark. 19 (2002) 185–213. [6] D. Yankelovich, D. Meer, Rediscovering market segmentation, Harv. Bus. Rev. 84 (2) (2006) 122–131. [7] W.A. Kamakura, G.J.A. Russell, Probabilistic choice model for market segmentation and elasticity structure, J. Mark. Res. 26 (1989) 379–390. [8] L.G. Debo, L.B. Toktay, L.N.V. Wassenhove, Market segmentation and product technology selection for remanufacturable products, Manag. Sci. 51 (8) (2005) 1193–1205. [9] R.J. Kuo, L.M. Ho, C.M. Hu, Integration of self-organizing feature map and K-means algorithm for market segmentation, Comput. Oper. Res. 29 (2002) 1475–1493. [10] W. Desarbo, V. Ramaswamy, S.H. Cohen, Market Segmentation with choicebased conjoint analysis, Mark. Lett. 6 (2) (1995) 137–147. [11] C.Y. Tsai, C.C. Chiu, A purchase-based market segmentation methodology, Expert Syst. Appl. 27 (2004) 265–276. [12] A. Vellido, P.J.G. Lisboa, K. Meehan, Segmentation of the on-line shopping market using neural networks, Expert Syst. Appl. 17 (1999) 303–314. [13] A. Bhatnagar, S. Ghose, Segmenting consumers based on the benefits and risks of Internet shopping, J. Bus. Res. 57 (2004) 1352–1360. [14] S.-h Liao, Y.-j Chen, Yi-t Lin, Mining customer knowledge to implement online shopping and home delivery for hypermarkets, Expert Syst. Appl. 38 (2011) 3982–3991. [15] S. Thomassey, Sales forecasts in clothing industry: the key success factor of the supply chain management, Int. J. Prod. Econ. 128 (2010) 470–483. [16] R. Delmater, M. Hancok, Data Mining Explained – A Manager's Guide to Customer-Centric Business Intelligence, Digital Press, 2001. [17] U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, From Data Mining to Knowledge Discovery in Databases., American Association for Artificial Intelligence, AI Magazine (1996) 37–54 (Fall). [18] P. Clark, T. Niblett, The CN2 induction algorithm, Mach. Learn. 3 (4) (1989) 261–283. [19] N. Lavrac, B. Cestnik, D. Gamberger, P. Flach, Decision support through subgroup discovery: three case studies and the lessons learned, Mach. Learn. 17 (3) (1996) 37–51. [20] A.K. Jain, Data clustering: 50 years beyond K-Means, Pattern Recognit. Lett. 31 (8) (2009) 651–666. [21] T. Velmurugan, T. Santhanam, Computational complexity between K-Means and K-Medoids clustering algorithms for normal and uniform distributions of data points, J. Comput. Sci. 6 (3) (2010) 363–368. [22] W. Kloesgen, Explora: A Multipattern and Multistrategy Discovery Assistant. Advances in Knowledge Discovery and Data Mining, American Association for Artificial Intelligence, Palo Alto-Califronia, USA (1996) 249–271. [23] S. Wrobel, An algorithm for multi-relational discovery of subgroups, in: Proceedings of the 1st European Symposium on Principles of Data Mining and Knowledge Discovery, vol. 1263, 1997, pp. 78–87. [24] S. Wrobel, Relational Data Mining, Springer, Berlin, 2001. [25] N. Lavrac, Subgroup discovery techniques and applications, in: Proceedings of the 9th Pacific-Asia Conference on Knowledge Discovery and Data Mining, vol. 3518, 2005, pp. 2–14. [26] W. Kloesgen, Applications and research problems of subgroup mining, in: Proceedings of the 11th International Symposium on Foundations of

Intelligent Systems, 1999, pp. 1–15. [27] P. Flach, D. Gamberger Subgroup evaluation and decision support for a direct mailing marketing problem, in: Proceedings of the 12th European Conference on Machine Learning and 5th European Conference on Principles and Practice of Knowledge Discovery in Databases, 2001, pp. 45–56. [28] B. Kavsek, N. Lavrac, Analysis of example weighting in subgroup discovery by comparison of three algorithms on a real-life data set, in: Proceedings of the 15th European Conference on Machine Learning and 8th European Conference on Principles and Practice of Knowledge Discovery in Databases, 2004, pp. 64–76. [29] B. Kavsek, N. Lavrac, Using subgroup discovery to analyze the UK traffic data, Metodoloski Zv. 1 (1) (2004) 249–264. [30] F. Zelezny, N. Lavrac, S. Dzeroski Constraint-based relational subgroup discovery, in: Proceedings of the 2nd Workshop on Multi-relational Data Mining, 2003, pp. 135–150. [31] V. Barrera, B. López, J. Meléndez, J. Sánchez, Voltage sag source location from extracted rules using subgroup discovery, Front. Artif. Intell. Appl. 184 (2008) 225–235. [32] P. Clark, R. Boswell, Rule induction with CN2: some recent improvements, in: Proceedings of the Fifth European Working Session on Learning, Springer, 1991, pp. 151–163. [33] D. Gamberger, N. Lavrac, Expert-guided subgroup discovery: methodology and application, J. Artif. Intell. 17 (2002) 501–527. [34] N. Lavrac, B. Kansek, D. Flach, L. Todorovski, Subgroup discovery with CN2-S, J. Mach. Learn. Res. 5 (2004) 153–188. [35] N. Lavrac, P. Flach, B. Kavsek, L. Todorovski, Adapting classification rule induction to subgroup discovery, in: Proceedings of the Second IEEE International Conference on Data Mining, vol. 3518, 2002, pp. 266–273. [36] A.K. Jain, M.N. Murty, P.J. Flynn, Data clustering: a review, ACM Comput. Surv. 31 (3) (1999) 264–323. [37] W. Plinke, M. Kleinaltenkamp, Marketing advantages via new manufacturing technologies, Robot. Comput.-Integr. Manuf. 7 (1–2) (1990) 127–131. [38] S.M. Mousavi, R. Tavakkoli-Moghaddam, B. Vahdani, H. Hashemi, M.J. Sanjari, A new support vector model-based imperialist competitive algorithm for time estimation in new product development projects, Robot. Comput.-Integr. Manuf. 29 (2013) 157–168. [39] D. Gamberger, N. Lavrac, Generating actionable knowledge by expert-guided subgroup discovery, in: Proceedings of the 6th European Conference on Principles and Practice of Knowledge Discovery in Databases, vol. 2431, 2002, pp. 163–174. [40] F. Herrera, C.J. Carmona, P. González, M.J. del Jesus, An overview on subgroup discovery: foundations and applications, Knowl. Inf. Syst. 29 (3) (2011) 495–525. [41] M. Sundström, J. Balkow, J. Florhed, M. Tjernström, P. Wadenfors, Impulsive Buying Behaviour: The Role of Feelings When Shopping for Online Fashion.,17th The European Association for Education and Research in Commercial Distribution 2013; retrieved in 〈http://bada.hb.se/handle/2320/13004〉. [42] T. Guo, Z. Yan, K. Aberer, An adaptive approach for online segmentation of multi-dimensional mobile data, in: Proceedings of the 11th International ACM Workshop on Data Engineering for Wireless and Mobile Access (MobiDE), 2012, pp. 7–14. [43] R.S. Wu, P.H. Chou, Customer segmentation of multiple category data in e-commerce using a soft-clustering approach, Electron. Commer. Res. Appl. 10 (3) (2011) 331–341. [44] F.M. Hsu, L.P. Lu, C.M. Lin, Segmenting customers by transaction data with concept hierarchy, Expert Syst. Appl. 39 (6) (2012) 6221–6228. [45] X. Deng, An enhanced artificial bee colony approach for customer segmentation in mobile e-commerce environment, Int. J. Adv. Comput. Technol. 5 (1) (2013) 139–148. [46] H.D. Park, C.H. Jun, A simple and fast algorithm for K-medoids clustering, Expert Syst. Appl. 36 (2) (2009) 3336–3341. [47] J.R. Cano, F. Herrera, M. Lozano, S. García, Making CN2-SD subgroup discovery algorithm scalable to large size data sets using instance selection, Expert Syst. Appl. 35 (4) (2008) 1949–1965. [48] F. Provost, T. Fawcett, Data science and its relationship to big data and datadriven decision making, Big Data 1 (1) (2013) 51–59.


Customer segmentation in a large database of an online customized ...

Customer segmentation in a large database of an online customized ...

Suggest Documents

Database Segmentation using Share of Customer - CiteSeerX

Customer Segmentation in Mobile Services Industry A

Project 11: Creating A Customized Database

Customer-oriented benefit segmentation: an ... - Semantic Scholar

Customer Satisfaction in a Large Urban Fire ... - Wiley Online Library

Online Auction Customer Segmentation Using a Neural ... - CiteSeerX

Online Auction Customer Segmentation Using a Neural ... - CiteSeerX

Customer-oriented benefit segmentation

Customer Segmentation Toolkit - CGAP

Multichannel customer segmentation

Large hydraulic cylinders: Customized solutions

Multichannel customer segmentation

EPIMHC: a curated database of MHC-binding peptides for customized ...

Customer Segmentation For a Mobile Telecommunications Company ...

Customer Segmentation For a Mobile Telecommunications Company ...

Segmentation for a Customer-Centric Approach

Customized Online Aggregation & Summarization

Automated correlative segmentation of large

Customer Behavior in an Online Ordering Application: A ... - CiteSeerX

A Database-driven Decision Support System: Customized ... - MDPI

CiVeDi: A Customized Virtual Environment for Database ... - AIR Unimi

Customer Segmentation in Mobile Services Industry- a Cluster and ...

Customer Segmentation and Capable-to-promise in a ... - CiteSeerX

Customized Hough Transform for Robust Segmentation of ... - CiteSeerX