Int. J. Data Analysis Techniques and Strategies, Vol. 2, No. 1, 2010
Mining important association rules based on the RFMD technique Yoones Asgharzadeh Sekhavat* Industrial Engineering Department Iran University of Science and Technology Narmak, Tehran, Iran Fax: +98–21–22039071 E-mail:
[email protected] *Corresponding author
Mohammad Fathian and Mohammad Reza Gholamian Industrial Engineering Department Iran University of Science and Technology Narmak, Tehran, Iran E-mail:
[email protected] E-mail:
[email protected]
Somayeh Alizadeh Industrial Engineering Department Khaje Nasir Toosi University Vanak, Tehran, Iran E-mail:
[email protected] Abstract: The method of association rule mining has been used by marketers for many years to extract marketing rules from purchase transactions. Marketers and managers employ these rules in order to predict customer needs for future sales. Extracting effective rules is one of the major problems of marketers. Effective rules can help them to make better marketing decisions. On the other hand, the Recency, Frequency, Monetary value and Duration (RFMD) method is one of the popular methods used in market segmentation that indicate profitable groups of customers. In this paper, a novel method is proposed that takes advantage of the RFMD method to extract effective association rules from profitable segments of purchase transactions. In other words, in the first step, raw data are classified based on the RFMD technique; and in the second step, effective association rules are extracted from sections with high RFMD values. The proposed method employs a new Maximum Frequent Itemset Extractor (MFIE) algorithm that outperforms the classic algorithm (Apriori) in extracting frequent itemsets from a large number of transactions. In addition, unlike most of the previous central methods, the proposed method is designed for extracting association rules from distributed databases.
Copyright © 2010 Inderscience Enterprises Ltd.
1
2
Y.A. Sekhavat, M. Fathian, M.R. Gholamian and S. Alizadeh Keywords: association rules; recency, frequency, monetary value and duration; RFMD; maximum frequent itemset; data analysis; data mining; marketing. Reference to this paper should be made as follows: Sekhavat, Y.A., Fathian, M., Gholamian, M.R. and Alizadeh, S. (2010) ‘Mining important association rules based on the RFMD technique’, Int. J. Data Analysis Techniques and Strategies, Vol. 2, No. 1, pp.1–21. Biographical notes: Yoones Asgharzadeh Sekhavat received his BS degree in Computer Engineering from Amirkabir University of Technology, Tehran, Iran in 2006 and his MS degree in Information Technology from Iran University of Science and Technology, Tehran, Iran in 2009. His research interests are security, data privacy and data mining. He has published several papers in international journals and conferences. Dr. Mohammad Fathian is an Associate Professor in the Department of Industrial Engineering of Iran University of Science and Technology, Tehran, Iran and received his MS and PhD degrees in Industrial Engineering from the same university. Dr. Fathian is working in the areas of information technology, e-commerce and knowledge management. He has more than 20 research papers and five books in the areas of industrial engineering and information technology. Dr. Mohammad Reza Gholamian received his PhD degree in Industrial Engineering from Amirkabir University of Technology, Tehran, Iran in 2005. He is currently an Assistant Professor in the Department of Industrial Engineering at the Iran University of Science and Technology at Tehran, Iran. His research interests are intelligent systems and multicriteria decision-making. He has published 4 books, 22 papers for international conferences and 7 papers in international journals. Dr. Somayeh Alizadeh received her BS degree in Computer Engineering from Sharif University of Technology, Tehran, Iran in 1997, and her MS and PhD degrees in Industrial Engineering from the Iran University of Science and Technology, Tehran, Iran in 2002 and 2008, respectively.
1
Introduction
A useful data mining technique is the association rule mining technique, which has been employed widely in marketing. When association rules are extracted from purchase transactions, each rule demonstrates association among purchased items, wherein, when some items are purchased in one transaction, the others are purchased too. Market basket analysis is one of the typical uses of association rules in marketing. Customer habits are extracted by finding associations among purchased items. The discovery of such associations not only can help in developing better marketing strategies, but can also be employed in catalogue design, cross-marketing and customer-behaviour analysis (Han and Kamber, 2006). Classic association rule mining techniques consider all of the purchase transactions in extracting association rules. However, this approach has two main drawbacks: First, extracted rules may have the same value and decision makers cannot decide which ones are more important. Second, considering all transactions in the rule mining process is very time consuming and sometimes impractical. These two potential problems have
Mining important association rules based on the RFMD technique
3
motivated us to present a new approach for the rule mining process. In order to overcome these problems, we employ the Recency, Frequency, Monetary value and Duration (RFMD) technique to indicate profitable customers. The success of the RFMD technique among marketing professionals has promoted further investigations on the relevance of the four dimensions (recency, frequency, monetary value and duration) in forecasting repeat purchases. The first contribution of this paper is in employing the RFMD technique before the association rule mining process. In other words, in the first step, transactions are classified based on RFMD parameters; and in the second step, only transactions with a high level of RFMD values are entered into the rule mining process. This technique has two important advantages: First, extracted rules have different values and, as a result, they have different levels of importance for decision makers. Second, only effective rules are extracted and less important rules are ignored. As a second contribution, we propose a new Maximum Frequent Itemset Extractor (MFIE) algorithm which is an improved version of Apriori, a basic classic algorithm for extracting maximum frequent itemsets (Han and Kamber, 2006). Extracting maximum frequent itemsets and generating association rules from maximum frequent itemsets are two important steps in generating association rules, and MFIE is employed in the first step of this process. Experiments show that MFIE outperforms Apriori in extracting frequent itemsets from a large number of transactions. Generally, classic association rule mining techniques are designed for a single database. Traditionally, each branch is locally responsible for managing its customers and transactions, and because of the existence of multiple databases, marketers have problems extracting overall association rules from distributed databases over different branches (Adhikari and Rao, 2008). The distribution of databases among different branches proves the necessity of studying rule mining in distributed databases. Extracting association rules from distributed databases is important from two aspects: 1
with the growth of data, databases are distributed and each branch has its own database
2
most of the popular data mining tools are not able to process large amounts of data.
As a third contribution, the proposed algorithm supports the extraction of association rules from distributed databases. This paper continues as follows: In Sections 2 and 3, background theory and related works are described. Section 4 describes the proposed architecture and the new algorithm for extracting maximum frequent itemsets is presented. Finally, in Section 5, we analyse the efficiency of the proposed method.
2
Background theory
2.1 Association rule mining An association rule is in the form of X ⇒ Y, where X and Y are a set of items and X ∩ Y = ∅. Each transaction T is a set of items that may contain X, Y or both of them, and association rules are extracted from transactions. X is known as an antecedent and Y is known as a consequent (Yen and Lee, 2006). When an itemset i exists in a transaction, it means that transaction supports i. The support ratio of i is the number of
4
Y.A. Sekhavat, M. Fathian, M.R. Gholamian and S. Alizadeh
the transactions that supports i to the total number of transactions. Itemset i is called frequent itemset when the support of i is greater than or equal to the minimum support (a decimal number between 0 and 1) (Han and Kamber, 2006). The minimum support is a threshold for deciding whether an itemset is frequent or not. The ratio of the support of itemsets X ∪ Y to the support of itemset X is defined as the confidence of the rule (X ⇒ Y). The rule X ⇒ Y is known as an association rule when it has specified the minimum support and minimum confidence (Yen and Lee, 2006).
2.2 RFMD technique The RFMD technique is one of the popular methods in market segmentation. Customers who have recently purchased (recency), customers who purchase many times (frequency), customers who spend more money (monetary value) and customers who spend more time on a seller’s website (duration) usually can be the best prospects for more advertisements (McCarty and Hastak, 2007). On the other hand, those who have not purchased recently, purchase occasionally and do not spend large amounts of money and time with a company are less likely to purchase in the future. Based on this technique, managers and marketers can indicate profitable customers and they may focus their activities on these customers. RFMD parameters are behavioural. Generally, behavioural patterns provide more knowledge of each customer’s actual spending preferences than other segmentation variables. Some marketers believe that behavioural measures provide information on how customers think and shop (Hughes, 2000; Morwitz and Schmittlein, 1998). The RFMD technique is utilised in many ways. In the scoring approach to RFMD, weights are assigned to RFMD parameters and the weighted score for each customer is calculated (Drozdenko and Drake, 2002). Higher scores for each segment represent a higher potential of customers to purchase. Assigned weights can be achieved by the experience of marketers about the RFMD parameters. The second common method of RFMD is called independent RFMD. In this method, customers are sorted separately based on RFMD values and each sorted list is evaluated independently. In this technique, because of the relationships between RFMD parameters, the overall analysis of results is very difficult for marketers (Ha, 2007). Finally, in the cellular RFMD technique, customers are first clustered together into same-size sections based on recency. In the next step, customers in each section are divided into new same-size sections based on frequency of purchase. Steps 3 and 4 are based on monetary value and duration. As a result, each cellule is defined with different values of recency, frequency, monetary value and duration. According to the importance of each parameter to the company, these values can be employed by decision makers in formulating appropriate decisions for each segment of customers.
3
Related works
Recent researches on data mining and marketing methods have demonstrated that combining data mining methods and market segmentation models can achieve better results. In 2000, Kohenen used Self-Organisation Maps (SOMs) in market segmentation and reported an improvement in the turnover of a duty free shop, and Quilans used C4.5 classification trees (Alencar et al., 2006). Some papers have used three parameters Recency, Frequency and Monetary value (RFM) and ignored the parameter of
Mining important association rules based on the RFMD technique
5
Duration (D). For example, Ansari used an extended version of RFM in evaluating customer churn rates and profitability. Suh et al. (1999) combined RFM analysis with neural networks and logistic regression to evaluate the profitability of direct marketing. In the same way, Baesens et al. (2002) combined RFM analysis with Bayesian neural networks, and introduced a method to forecast purchases in European companies. Ghazanfari et al. (2008) proposed a clustering technique for country segmentation based on RFM variables. Wang et al. (2005) presented a method based on association rules and market segmentation to predict the value of future customers. They employed association rule mining to extract marketing rules in order to build a prediction model for customer value. Association rule mining from multiple databases has been recently recognised as an important research topic in the Knowledge Discovery and Data (KDD) mining cup community (Adhikari and Rao, 2008). Liu et al. (2001) proposed a method that searches only the relevant databases in mining association rules from distributed databases. In this study, the measure of relevance is extracting regularities for specific attributes. Yin and Han (2005) presented a novel technique for extracting association rules from relational heterogeneous databases. Their proposed method required two database scans to extract frequent itemsets. In the first scan, transactions D are divided into N nonoverlapping partitions. Han proved that any itemset i that is potentially frequent in D must be a frequent itemset in at least one of the partitions. In the second scan, the final support of each candidate is determined in order to find global frequent itemsets (Han and Kamber, 2006). Wu et al. (2005) presented a weighting method to synthesise high frequent association rules from distributed databases. Aronis et al. (1997) proposed a model that uses spreading activation to enable inductive learning from multiple tables in multiple databases spread around the network. Zhang et al. (2003) proposed a local pattern analysis to synthesise global patterns in multiple databases. Five years later, Adhikari and Rao (2008) extended this model. The extended model is proposed to synthesise global patterns from local patterns in distributed databases. In order to increase the efficiency of association rule mining algorithms, different ideas have been presented. Bagui et al. (2009) presented a dependency-based association rule mining method that reduces the execution time of the rule mining process by decreasing the number of full database scans. Ashrafi et al. (2007) decreased the execution time of association rule mining algorithms by removing redundant rules during the rule mining process. However, since most current methods generate repetitive frequent itemsets, they have problems in mining association rules from a large number of transactions.
4
Proposed method
This paper presents a new association rule mining method that employs the RFMD technique to extract important and effective rules from distributed databases. The RFMD technique can indicate profitable and important sections of purchase transactions. Association rules that are extracted from these important segments of customers are significantly more valuable than those that are extracted without considering the segmentation parameters recency, frequency, monetary value and duration.
6
Y.A. Sekhavat, M. Fathian, M.R. Gholamian and S. Alizadeh
Figure 1
Architecture of the proposed method (see online version for colours)
The overall architecture of the proposed method is shown in Figure 1. There are five important modules in this architecture: 1
Database Preparer (DP)
2
Sorter and Selector (SS)
3
Frequent Itemset Extractor (FIE)
4
Rule Generator (RG)
5
Rule Aggregator and Pruner (RAP).
As shown in this figure, in the first step, DP prepares data for association rule mining and creates an appropriate data structure for the data mining process. In the next step, SS creates four sorted collections based on RFMD parameters and selects only the first quarter of each sorted collection. FIE receives sorted collections and creates maximum frequent itemsets for each sorted collection. This module employs the MFIE algorithm (which is described later in Section 4.3) in order to extract frequent itemsets. Then, RG generates association rules from frequent itemsets and calculates the significance of each rule. Finally, RAP aggregates generated rules and prunes repetitive rules.
Mining important association rules based on the RFMD technique
7
4.1 Database Preparation (DP) module The first module used in this architecture is DP. Data preprocessing and preparing data for data mining is the first step of all data mining processes. Data preparation methods like data cleaning, data transformation, data integration and data reduction can be used to produce such collections. Based on the first level of normalisation in database design, and in order to prevent repeated values, usually customer information, order information, and product information (items) are stored separately in different tables. Recency and monetary values are usually available in the orders table. Frequency can be obtained from the orders table by extracting the number of orders for a specific customer in a specified period. Finally, duration is available in the log files of the web server (start time and end time of each session is stored by the web server in log files). The database preparation module combines this information and generates the data structure shown in Figure 2. Figure 2
Structure of data after data preparation
The output of this module is Prepared Data Collection (PDC). In order to clarify the proposed method, sample data are used to describe the details of each module. The sample data, which are shown in Table 1, are employed in Sections 4.2 to 4.6. Table 1 CustomerId
Sample data to track the proposed method TransactionId
Items
Recency
Frequency
Monetary value
Duration
1
1
{A,B}
11
1
3000
3
1
2
{A,B,C,D}
8
2
6000
6
1
3
{C,D}
3
9
2000
2
2
4
{B,C}
2
4
1000
8
2
5
{B,C,D}
6
7
4000
4
3
6
{A,B,C,D}
12
8
3000
6
3
7
{C,D,E}
9
1
5000
1
3
8
{A,B,D}
4
4
8000
9
4
9
{B,C,D,E}
5
6
7000
5
5
10
{A,B,C,D}
1
3
4000
2
5
11
{A,B}
7
6
2000
4
5
12
{B,C}
10
3
5000
2
4.2 Sorter and Selector (SS) module This module is responsible for creating four sorted collections based on recency, frequency, monetary value and duration. Only the top quarter of each sorted collection is selected. In the sorted collection based on recency, the PDC is sorted in an ascending order based on the most recent purchase date. In the Recency Data Collection (RDC),
8
Y.A. Sekhavat, M. Fathian, M.R. Gholamian and S. Alizadeh
the earliest purchases are listed at the top and the oldest are listed at the bottom. The second sorted collection is Frequency Data Collection (FDC) (sorted collection based on frequency), where transactions are sorted in a descending order based on the number of interactions in a specified period. The Monetary value Data Collection (MDC) is the third sorted collection in descending order based on the monetary value. Finally, the Duration Data Collection (DDC) is a sorted collection in descending order based on the time duration that customers spend when they browse the seller’s website. Based on the sample data shown in Table 1, four sorted data collections are created which are shown in Tables 2, 3, 4 and 5 respectively. In our sample, since there are 12 records in the sample data in Table 1, only the top quarter of these records (three records) are entered to each collection. Table 2 CustomerId
Recency Data Collection (RDC) TransactionId
Items
Recency
Frequency
Monetary value
Duration
5
10
{A,B,C,D}
1
3
4000
2
2
4
{B,C}
2
4
1000
8
1
3
{C,D}
3
9
2000
2
Table 3 CustomerId
Frequency Data Collection (FDC) TransactionId
Items
Recency
Frequency
Monetary value
Duration
1
3
{C,D}
3
9
2000
2
3
6
{A,B,C,D}
12
8
3000
6
2
5
{B,C,D}
6
7
4000
4
Table 4 CustomerId
Monetary value Data Collection (MDC) TransactionId
Items
Recency
Frequency
Monetary value
Duration
3
8
{A,B,D}
4
4
8000
9
4
9
{B,C,D,E}
5
6
7000
5
1
2
{A,B,C,D}
8
2
6000
6
Table 5 CustomerId
Duration Data Collection (DDC) TransactionId
Items
Recency
Frequency
Monetary value
Duration
3
8
{A,B,D}
4
4
8000
9
2
4
{B,C}
2
4
1000
8
1
2
{A,B,C,D}
8
2
6000
6
4.3 Frequent Itemset Extractor (FIE) In this stage, the four sorted collections (RDC, FDC, MDC and DDC) enter the FIE Module. As described earlier, extracting the maximum frequent itemsets is a preface activity in generating association rules. The extracted itemsets are stored separately in the Recency Itemset (RIS), Frequency Itemset (FIS), Monetary value Itemset (MIS) and
Mining important association rules based on the RFMD technique
9
Duration Itemset (DIS). This module is responsible for extracting frequent itemsets based on specified minimum support (min_sup). FIE uses the MFIE algorithm to extract maximum frequent itemsets from transactions. MFIE is developed based on Apriori property and is an improved version of the Apriori algorithm (a classic, basic algorithm used to extract maximum frequent itemsets). According to the Apriori property, “all nonempty subsets of a frequent itemset must also be frequent” (Han and Kamber, 2006). Outputs of both Apriori and MFIE are frequent itemsets and both of these algorithms are designed based on the Apriori property. The proposed algorithm (MFIE) extracts frequent itemsets spirally and, in each run, frequent subitemsets are created. Calculating the support of itemsets and joining subitemsets are two main functions in MFIE algorithms. MFIE employs binary operations in these functions and it decreases the execution time of this algorithm. The overall steps of MFIE are shown in Figure 3. Figure 3
Notations and steps of the MFIE algorithm
MFIE is developed with three main procedures that are shown in Figures 4, 5 and 6.
10
Y.A. Sekhavat, M. Fathian, M.R. Gholamian and S. Alizadeh
Figure 4
Main procedure of MFIE that extracts maximum frequent itemsets
Figure 5
Procedure for indicating whether the given item has minimum support or not
Mining important association rules based on the RFMD technique Figure 6
11
Procedure for calculating the joining of given itemsets
In this section, the execution of MFIE on FDC (sorted transaction collection based on frequency which is shown in Table 3) is detailed. After executing Steps 1 and 2 of the algorithm, we have the collection in Table 6. Table 6
Results after execution of Steps 1 and 2
arr_Transactions_Collection
arr_Transaction
Items
TransactionId
{{00110},{11110},{01110}}
00110
{C,D}
3
11110
{A,B,C,D}
6
01110
{B,C,D}
5
In Step 3, the support of each item i in the arr_Transactions_Collection is calculated. Usually, business experts inside an organisation are responsible for indicating the minimum support. The minimum support used in our sample is 0.65. The support of items B (01000), C (00100) and D (00010) are greater than this minimum support and these items are added to arr_Previous_itemsets. In Step 4, the matching itemset for each itemset in arr_Previous_Itemsets is indicated. For this purpose, items are scanned in a loop. If there are N items in arr_Previous_Itemsets, for each item with the index of i, the algorithm starts from (i + 1) to (i + N – 1 mode N) to find the matching itemset. For example, if there are 5 items (I = 1 to 5), for the first index (i = 1), itemsets with the index of 2,3,4,5 are checked. For the second item (i = 2), items 3,4,5,1 are checked and so on. When the support of a new itemset (which is created with the joining of selected itemset and matching itemset) is greater than or equal to the minimum support, it is added to arr_New_Itemsets and the search stops for that itemset. In the first run of Step 4, three itemsets are available in arr_Previous_Itemsets {(01000), (00100) and (00010)}. For the itemset with index of 1 ({01000}), the algorithm starts from index 2 to find the matching itemset in such a way that the support of their joint set is greater than or equal to the minimum support. The itemset with index of 2 ({00100}) is selected and the joint set (newItemset = {01000} OR {00100} = {01100}) is added to the arr_New_Itemsets. For the itemset with index 2 ({00100}), the algorithm starts from index 3 to find the matching itemset. The itemset with index 3 is selected as the matching itemset for the itemset with index 2 and their joint set (newItemset = {00100} OR {00010} = {00110}) is added to the arr_New_Itemsets. For the itemset with index 3, the algorithm starts from index 1 to find the matching itemset. For this itemset, the itemset with index 1 is selected as a matching itemset and their joint set (newItemset = {00010} OR {01000} = {01010}) is added to the arr_New_Itemsets.
12
Y.A. Sekhavat, M. Fathian, M.R. Gholamian and S. Alizadeh
At the end of the first run of Step 4, arr_Previous_Itemset has three itemsets and Step 4 should be run again with the new value for arr_Previous_Itemsets ({{01100} {00110} {01010}}). The selection process is shown in Figure 7 for three runs of Step 4. Figure 7
Three runs of Step 4 (see online version for colours)
In the second run of Step 4, the new itemset ({01110} = joint ({01100}, {00110})) is added to arr_New_itemsets. But for other itemsets, nothing is added to arr_New_itemsets (because of the prohibition of adding repetitive itemsets). At the end of this run, arr_New_itemsets has only one itemset {{01110}} and as a result, it is returned as a maximum frequent itemset ({{01110}} represents {BCD}). By the execution of MFIE on RDC, FDC, MDC and DDC, frequent itemsets (RIS, FIS, MIS and DIS), which are shown in Table 7, are produced. The efficiency analysis of MFIE is presented in Section 4. Table 7 DIS
Frequent itemsets generated by FIE (see online version for colours) MIS
FIS
RIS
{11010}
{11010},{01110}
{01110}
{01100},{00110}
{A,B,D}
{A,B,D},{B,C,D}
{B,C,D}
{B,C},{C,D}
Set name Binary representation Items
4.4 Rule Generator (RG) module Association rule mining is a two-step process. In the first step, frequent itemsets are extracted; and in the second step, rules are generated from frequent itemsets. In the rule generation step, all combinations of frequent itemsets are checked to decide which one is greater than or equal to the minimum confidence. The minimum confidence is the threshold indicated by experts inside an organisation and it is a decimal number between 0 and 1. For example, for frequent itemsets {A, B and C), we check the confidence of (A ⇒ B, C), (A, B ⇒ C), (A, C ⇒ B), (B ⇒ A, C), (B, C ⇒ A), (C ⇒ A, B). This module is responsible for generating rules from frequent itemsets (RIS, FIS, MIS and DIS) based on min_Confidence. The min_Confidence used in our sample is 0.7. Algorithms such as K-optimal pattern (Webb and Zhang, 2005) can be used to extract rules from frequent itemsets. This algorithm extracts rules in the form of Antecedent_Items and Consequent_Items. Antecedent_Items is the set of items in the antecedent of a rule, and Consequent_Items is the set of items in its consequent.
Mining important association rules based on the RFMD technique
13
Also, this module calculates the significance of the rules based on the type, support and confidence of rules. The type of rule (R, F, M or D) indicates from which collection they are extracted. For Rulei, the significance of a rule is calculated as follows: RuleSignificancei = RuleTypei * Supporti * Confidencei where RuleType value is (R = α, M = δ, F = β, D = λ). Rule type is the parameter indicated by experts of an organisation. According to the importance of R, F, M and D for that business, they may assign different values. For example, when recency has more importance for that business than monetary value, they assign a higher value for R than for M. In our sample, we assume R = 7, F = 5, M = 4 and D = 2. This means that R has the maximum importance for that business and F, M and D are in the next levels. Employing this parameter, ideas of experts are entered to the rule significance formula. After calculating the significance of rules, they are stored in the four new rules collections (RRC, FRC, MRC and DRC). In our sample, the created rules based on RIS, FIS, MIS and DIS are shown in Table 8. Table 8
Rules collections created by RG
RuleSignificance
Consequent_Items
Antecedent_Items
RuleId
Rule collection RRC
4.62
{C}
{B}
1
4.62
{C}
{D}
2
3.3
{D}
{B,C}
1
3.3
{C}
{B,D}
2
3.3
{C,D}
{B}
3
2.64
{D}
{A,B}
1
2.64
{B}
{A,D}
2
2.64
{B,D}
{A}
3
2.64
{D}
{B,C}
4
2.64
{B}
{C,D}
5
2.64
{B,D}
{C}
6
1.32
{D}
{A,B}
1
1.32
{B}
{A,D}
2
1.32
{A}
{B,D}
3
1.32
{B,D}
{A}
4
1.32
{A,B}
{D}
5
FRC
MRC
DRC
4.5 Rule Aggregator and Pruner (RAP) In this stage, four rules collections (RRC, FRC, MDC and DRC) are aggregated and pruned. The module of RAP is used for this purpose. As shown in Figure 1, rules collections enter RAP and this module aggregates them and stores them in a single repository. The second part of this module prunes the rules. Since rules are extracted
14
Y.A. Sekhavat, M. Fathian, M.R. Gholamian and S. Alizadeh
from different collections of transactions, the RG module may create some repetitive rules. Two rules, R1 and R2, are repetitive when the antecedent of R1 is equal to the antecedent of R2 and the consequent of R1 is equal to the consequent of R2. In this stage, RAP is responsible for pruning rules by merging repetitive rules. When rules are merged, the significance of the merged rule will be the sum of the significance of the individual rules. RuleSignificancei = ∑ ( RuleSignificance of repetitive rules). The Final rules Collection (FC) is shown in Table 9. As shown in this table, three repetitive items are merged. This is the collection that stores all effective association rules, and their importance are indicated by their significance. Since these association rules are extracted from profitable sections of transactions and with high R, F, M and D values, they can be used by experts to make better and effective decisions. Table 9
Final rules collection created by RAP
RuleSignificance 4.62
Consequent_Items
Antecedent_Items
RuleId
{C}
{B}
1
4.62
{C}
{D}
2
5.94
{D}
{B,C}
3
3.3
{C}
{B,D}
4
3.3
{C,D}
{B}
5
3.96
{D}
{A,B}
6
3.96
{B}
{A,D}
7
3.96
{B,D}
{A}
8
2.64
{B}
{C,D}
9
2.64
{B,D}
{C}
12
1.32
{A}
{B,D}
11
1.32
{B,D}
{A}
12
1.32
{A,B}
{D}
13
4.6 Generating the Total Rules Collection (TRC) In the final step, all FRCs from different branches are aggregated in order to form the Total Rules Collection (TRC). The Final Rule Aggregator and Pruner (FRAP) is used for this purpose (Figure 8). The numbers of transactions in different FRCs are not necessarily equal and because of this difference, the significance of these rules should be normalised. Let FRCi and NTi i = 1…k be the final rule collection and the number of transactions corresponding to databases of each branch (DBi), respectively. Then, the new significance value for each rule is: k
NewSignificancei = ( PreviousSignificancei * NTi ) / ∑ ( NTi ). i =1
Mining important association rules based on the RFMD technique Figure 8
15
Aggregation of local rules collections (see online version for colours)
Like RAP, FRAP is responsible for pruning the rules inside the aggregated rules by merging the repetitive rules. In this step, when rules are merged, the significance of the merged rules will be the summation of the significance of the individual rules: RuleSignificancei = ∑ ( RuleSignificance of repetitive rules).
5
Efficiency analysis
In this section, in order to analyse the efficiency of MFIE, a comparison between MFIE and Apriori (the classic algorithm for extracting maximum frequent itemsets) is proposed. Generally, finding support of itemsets in transactions and calculating the joint set of itemsets are two major procedures in extracting maximum frequent itemsets. Both of these procedures are heavy procedures and improvements in them can enhance the efficiency of algorithms. The execution time of maximum frequent itemset algorithms is strongly dependent on four important parameters: 1
number of transactions in the database (D)
2
average size of transactions (T)
3
total number of items (N)
4
average size of maximum frequent itemsets (I).
In order to analyse the efficiency of the proposed algorithm (MFIE) and the previous classic algorithm (Apriori), we consider both theoretical analysis and real experiments.
5.1 Theoretical analysis In this section, we take into account 12 basic arithmetic operations which are used in MFIE and Apriori. These basic operations are shown in Table 10.
16
Y.A. Sekhavat, M. Fathian, M.R. Gholamian and S. Alizadeh
Table 10
List of basic operations (see online version for colours)
Operation
Identifier
Insert an item to array collection
A1
Assign one item of collection to an integer number
A2
Multiplication of two integer numbers
A3
Summation of two integer numbers
A4
Assign an integer number to a variable
A5
Comparison between two integer numbers
A6
Checking whether an item exists in a collection or not
A7
Copy collection of integer items to another empty collection
A8
Binary And operation between two binary collections
A9
Binary OR operation between two binary collections
A10
Comparison between two collections
A11
Join two collections
A12
MFIE has three main procedures: P1 FindMaximumFrequentItemSetCollection() P2 HasMinimumSupport() P3 Join() where P1 is the main procedure which uses P2 and P3. The execution time of MFIE based on basic arithmetic operations is: E (P3) = T * A10 E (P2) = A5 + D * (T * A9 + A11 + A5) + A3 + A6 E (P1) = A2 + N * (P2 + A1 + 2A2) + I * (A5 + (I / 2) * (A5 + (I / 2) * (A4 + A5 + A6 + A7 + P3 + P2 + A7 + A1)). As a result, the execution time of MFIE will be: E (MFIE): A2 + N * ((A5 + D * (T * A9 + A11 + A5) + A3 + A6) + A1 + 2A2) + I * (A5 + (I / 2) * (A5 + (I / 2) * (A4 + A5 + A6 + A7 + T * A10 + (A5 + D * (T * A9 + A11 + A5) + A3 + A6) + A7 + A1)). Since A1 to A12 are constant numbers, in order to simplify the execution time of MFIE, we only consider D, T, N and I. As a result, the simplified form of execution time of MFIE will be: E (MFIE): k0 + k1 * N + k2 * I + k3 * I2 + k4 * N * D + k5 * I * T + k6 * D * I + k7 * D * I2 + k8 * T * I2 + k9 * N * D * T + k10 * D * T * I + k11 * D * I2 * T (k0 to k11 are constant coefficients).
Mining important association rules based on the RFMD technique
17
On the other hand, in order to compute the execution time of Apriori, we should consider three procedures: K1 FindfrequentItemset() K2 Apriori_Gen() K3 has_infrequent_subset() where K1 uses K2 and K2 uses K3. These procedures are shown in Figures 9, 10 and 11. Figure 9
K1: Main procedure of Apriori
Figure 10 K2: Procedure of generating frequent itemset
Figure 11 K3: Procedure of deciding whether an itemset has an infrequent subset or not
The execution time of Apriori based on the basic arithmetic operations shown in Table 10 is: E (K3) = (I – 1) * I E (K2) = I * I * I *(A2 + A10 + A12 + K3 + A1) E (K1) = I * ((A5 + K2) + D * (T * I + I * A4) + (A2 + A6)) E (Apriori) = I * ((A5 + (I * I * I * (A2 + A10 + A12 + (I – 1) * I + A1)) + K2) + D * (T * I + I * A4) + (A2 + A6)).
18
Y.A. Sekhavat, M. Fathian, M.R. Gholamian and S. Alizadeh
Like MFIE, the simplified form of this algorithm will be: E (Apriori) = k0 + k1 * I6 + k2 * I5 + k3 * I4 + k4 * D * T + k5 * I * D.
5.2 Experiments and results E (Apriori) and E (MFIE) are polynomials with four variables (D, T, I and N). These polynomials are strongly dependent on four independent variables. Based on these polynomials, it is not easy to decide which one outperforms the other and it is not possible to compare them with regular methods. In order to have better comparison between MFIE and Apriori, we implemented both of these algorithms and we conducted three experiments to compare their execution time with different types of input data. All experiments were executed on a 2.0 GHz Duo Core2 Pentium processor with 1 GB of memory and VC++.Net (version 8). The experiments shaped on four sample data sets were created by using the test data generator from the IBM Almaden Quest research group. This generator can generate a sample data set based on the given parameters (D, T and I). The outputs of experiments are shown in Figures 12, 13 and 14. We conducted experiments to study the execution time by varying the number of transactions (D), average size of transactions (T) and average size of frequent itemsets (I). As shown in Figure 12, if the number of transactions (D) increases, the execution time of both MFIE and Apriori increases, provided that the other two parameters (T and I) remain constant. However, in this experiment, MFIE shows better results in a large number of transactions. If the average length of the transactions (T) increases, the execution time of MFIE and Apriori increases, provided that the other two parameters (D and I) remain constant. As shown in Figure 13, MFIE and Apriori have similar behaviour when the average sizes of maximum frequent itemsets are smaller than eight, but MFIE has better performance in larger transactions. If the average size of frequent itemset (I) increases, the execution time of MFIE and Apriori increases, provided that the other two parameters (D and T) remain constant. Experiment results shows that MFIE has better performance than Apriori in a large average size of maximum frequent itemset. This behaviour is shown in Figure 14. Figure 12 Experiment results on T(20–40)I6D50k (see online version for colours)
Mining important association rules based on the RFMD technique
19
Figure 13 Experiment results on T20I(1–10)D10k (see online version for colours)
Figure 14 Experiment results on T20I6D(10k–100k) (see online version for colours)
These three experiments demonstrate that, with the increase in the number of calculations (with an increase in D, T or I), the proposed algorithm (MFIE) shows better performance. We believe this is due to the use of binary operations in calculating the support of itemsets and finding joined items. As the number of transactions increases, as the average size of transactions increases and as the average size of frequent itemset increases, MFIE works faster than Apriori.
6
Conclusions and future works
In this paper, we presented a novel method that combines RFMD analysis with the association rule mining technique to extract effective rules from distributed databases. In extracting maximum frequent itemsets – the first step in generating association rules – a novel binary algorithm is proposed that uses binary operations to compute the
20
Y.A. Sekhavat, M. Fathian, M.R. Gholamian and S. Alizadeh
support of items and create joined itemsets. Because association rules are extracted from significant sections of purchase transactions, the extracted rules are more effective, and in addition, as the experiments illustrate, the proposed algorithm is more efficient in extracting frequent itemsets from a large number of transactions, with a great average size of transactions and with great size of frequent itemsets. By using the presented method, different RFMD values can be used for each local branch according to the importance of each parameter. Finally, the proposed method supports extracting rules from distributed databases and so can be used by large markets with many local branches. We believe that combining market segmentation techniques with the association rule mining method can promote marketing researches. Based on this idea, future works will be in the direction of applying other market segmentation techniques in extracting effective rules from purchase transactions. In addition, we aim to present new methods of creating association rules without generating frequent itemsets. To achieve this goal, we will apply neural networks to estimate and predict the support count of items without computing them. This way, we can improve the efficiency of the association rule mining method significantly.
References Adhikari, A. and Rao, P.R. (2008) ‘Synthesizing heavy association rules from different real data sources’, Pattern Recognition Letters, Vol. 29, No. 1, pp.59–71. Alencar, A.J., Ribeiro, E.M., Ferreira, A.L., Schimitz, E.A., Lima, P.M. and Manso, F.S.P. (2006) ‘Optimized RFV analysis’, Marketing Intelligence and Planning, Vol. 24, No. 2, pp.106–118. Aronis, J., Kolluri, V., Provost, F. and Buchanan, B. (1997) ‘The WoRLD: knowledge discovery from multiple distributed databases’, Proc. The 10th International Florida AI Research Symposium, pp.337–341. Ashrafi, M.Z., Taniar, D. and Smith, K. (2007) ‘Redundant association rules reduction techniques’, Int. J. Business Intelligence and Data Mining, Vol. 2, No. 1, pp.29–63. Baesens, B., Viaene, S., Poel, D.V., Vanthienen, J. and Dedene, G. (2002) ‘Bayesian neural networks learning for repeat purchase modeling in direct marketing’, European Journal of Operational Research, Vol. 138, No. 1, pp.191–211. Bagui, S., Just, J. and Bagui, S.C. (2009) ‘Deriving strong association rule mining rules using dependency criterion, the lift measure’, Int. J. Data Analysis Techniques and Strategies, Vol. 1, No. 3, pp.97–312. Drozdenko, R.G. and Drake, P.D. (2002) Optimal Database Marketing: Strategy, Development, and Data Mining, Sage Publications. Ghazanfari, M., Mohamadi, S.M. and Alizadeh, S. (2008) ‘Data mining application for country segmentation based on RFM model’, Int. J. Data Analysis Techniques and Strategies, Vol. 1, No. 2, pp.126–140. Ha, S.H. (2007) ‘Applying knowledge engineering techniques to customer analysis in the service industry’, Advanced Engineering Informatics, Vol. 21, pp.293–301. Han, J. and Kamber, M. (2006) Data Mining Concepts and Techniques, Morgan Kaufmann. Hughes, A.M. (2000) Strategic Database Marketing, McGraw-Hill. Liu, H., Lu, H. and Yao, J. (2001) ‘Toward multi-database mining: identifying relevant databases’, IEEE Trans. Knowledge Data Engineering, Vol. 13, No. 4, pp.541–553. McCarty, J.A. and Hastak, M. (2007) ‘Segmentation approaches in data-mining: a comparison of RFM, CHAID, and logistic regression’, Journal of Business Research, Vol. 60, No. 1, pp.656–662.
Mining important association rules based on the RFMD technique
21
Morwitz, V.G. and Schmittlein, D.C. (1998) ‘Testing new direct marketing offerings: the interplay of management judgment and statistical models’, Management Science, Vol. 44, No. 1, pp.610–628. Suh, E.H., Noh, K.C. and Suh, C.K. (1999) ‘Customer list segmentation using the combined response model’, Expert Systems with Applications, Vol. 17, No. 1, pp.89–97. Wang, K., Zhou, S., Yang, Q. and Yeung, J.M.S. (2005) ‘Mining customer value: from association rules to direct marketing’, Data Mining and Knowledge Discovery, Vol. 11, No. 1, pp.57–79. Webb, G.I. and Zhang, S. (2005) ‘K-optimal rule discovery’, Data Mining and Knowledge Discovery, Vol. 10, No. 1, pp.39–79. Wu, X., Zhang, C. and Zhang, S. (2005) ‘Database classification for multi-database mining’, Information Systems, Vol. 30, No. 1, pp.71–88. Yen, S.J. and Lee, Y.S. (2006) ‘An efficient data mining approach for discovering interesting knowledge from customer transactions’, Expert Systems with Applications, Vol. 30, No. 1, pp.650–657. Yin, X. and Han, J. (2005) ‘Efficient classification from multiple heterogeneous databases’, Proc. Ninth European Conference on Principles and Practice of Knowledge Discovery in Databases, pp.404–416. Zhang, S., Wu, X. and Zhang, C. (2003) ‘Multi-database mining’, IEEE Computer Intelligence, Vol. 2, No. 1, pp.5–13, http://www.dbis.informatik.hu-berlin.de/dbisold/lehre/WS0405/KDD/ paper/ZWZ03.pdf.