Deriving support threshold values and membership functions using ...

3 downloads 483 Views 739KB Size Report
Feb 2, 2013 - Deriving support threshold values and membership functions using the ... Mining the fuzzy association rules on multiple concept levels helps ...
Soft Comput (2013) 17:1227–1239 DOI 10.1007/s00500-012-0973-7

FOUNDATIONS

Deriving support threshold values and membership functions using the multiple-level cluster-based master–slave IFG approach Mojtaba Asadollahpour Chamazi • Behrouz Minaei Bidgoli • Mahdi Nasiri

Published online: 2 February 2013  Springer-Verlag Berlin Heidelberg 2013

Abstract Today, development of e-commerce has provided many transaction databases with useful information for investigators exploring dependencies among the items. In data mining, the dependencies among different items can be shown using an association rule. The new fuzzygenetic (FG) approach is designed to mine fuzzy association rules from a quantitative transaction database. Three important advantages are associated with using the FG approach: (1) the association rules can be extracted from the transaction database with a quantitative value; (2) extracting proper membership functions and support threshold values with the genetic algorithm will exert a positive effect on the mining process results; (3) expressing the association rules in a fuzzy representation is more understandable for humans. In this paper, we design a comprehensive and fast algorithm that mines level-crossing fuzzy association rules on multiple concept levels with learning support threshold values and membership functions using the cluster-based master–slave integrated FG approach. Mining the fuzzy association rules on multiple concept levels helps find more important, useful, accurate, and practical information.

Communicated by V. Loia. M. A. Chamazi (&) Computer Society of Iran, Tehran, Iran e-mail: [email protected] B. M. Bidgoli  M. Nasiri Department of Computer Engineering, Iran University of Science and Technology, Tehran, Iran e-mail: [email protected] M. Nasiri e-mail: [email protected]

Keywords Level-crossing fuzzy association rules  Multiple-level IFG mining  Cluster-based master–slave technique

1 Introduction Today, as e-commerce has developed, many transaction databases are accessible in which the dependencies among the items may provide useful information. In data mining, an association rule can be used to show dependencies among the items. An association rule is an implication of the form X ) Y, where X and Y are sets of items called itemsets, and X \ Y ¼ ; (Agrawal and Srikant 1994).‘‘confidence’’ and ‘‘support’’ are two measures used to appraise the quality of a rule. The rule X ) Y holds in the transaction set D with confidence c if c % of transactions in D that contain X also contain Y. The rule X ) Y supports S in the transaction set D if s % of transactions in D contain X [ Y (Agrawal and Srikant 1994). Researchers Agrawal and Srikant (1994), Agrawal et al. (1993), Savasere et al. (1995), Houtsma and Swami (1995), Hong et al. (1999), and Hong and Chen (1999) have proposed association rule mining algorithms in large databases with a single support threshold. Since a single support threshold value is used for the whole database, it assumes that all data items are of the same nature and/or have similar frequencies (Dunham et al. 2011). In reality, some items may be very frequent while others may rarely appear. However, the latter may be more informative and more interesting than the earlier (Dunham et al. 2011). To this end, researchers Lee et al. (2004, 2008); Liu et al. (1999), Wang et al. (2000), Cai et al. (1998), Shu Yue et al. (2000) have proposed algorithms in which multiple support threshold values are used for mining association

123

1228

rules from items in the database. Determining multiple support threshold values by user is, of course, difficult. Many previous studies focused on mining association rules from transaction databases with Boolean attributes (Agrawal and Srikant 1994; Qodmanan et al. 2011; Hadian et al. 2010). However, real-world transaction databases usually contain quantitative attributes. Hence, several algorithms (Charu et al. 1998; Srikant and Agrawal 1996; Nasiri et al. 2010; Nasiri et al. 2011; Moslehi et al. 2011a; Moslehi et al. 2011b) were posed that discovered quantitative association rules from a transaction database containing quantitative attributes. In recent years, researchers have frequently used fuzzy set theory in intelligent systems because of its simplicity and similarity to human reasoning (Lee et al. 2008; Hong et al. 2003, 2006; Chen et al. 2006; Alcala´-Fdez et al. 2009). Fuzzy sets extend the types of relationships that can be represented between items in a database, facilitate the use of linguistic terms to interpret rules, and avoid unnatural boundaries when attribute domains are partitioned (Alcala´-Fdez et al. 2009). Some researchers (Hong et al. 1999, 2003; Hong and Chen 1999; Lee et al. 2004, 2008; Shu Yue et al. 2000) have designed fuzzy association rule mining (FARM) algorithms for managing quantitative data. Most FARM algorithms consider the membership functions to be predefined in advance, while membership functions affect rule discovery. As a result, fuzzy-genetic (FG) approaches (Hong et al. 2004, 2006, 2009; Alcala´-Fdez et al. 2009; Chen and Hong 2007; Chen et al. 2006, 2007a, b, 2008, 2009; Kaya and Alhajj 2005, 2006; Kaya 2006) have been proposed for genetically learning the membership functions (and support threshold values) necessary for mining association rules from quantitative transactions. Three important advantages are associated with using the FG approach: (1) the association rules can be extracted from the transaction database with a quantitative value; (2) extracting correct membership functions and support threshold values with the genetic algorithm will exert a positive effect on the mining process results; (3) expression of the association rules in a fuzzy representation is more understandable for humans. The integrated fuzzy-genetic (IFG) mining approaches (Hong et al. 2006, 2009; Chen et al. 2006, 2007b, 2008; Kaya and Alhajj 2005; Kaya 2006) are designed to develop an initial population of chromosomes and manipulate that population until an optimal general solution is obtained. Each chromosome in the population includes the membership functions of all itemsets. The IFG approaches are simple and easy to use and have few constraints with respect to their fitness functions (Hong et al. 2009). In the past, using IFG approaches, a database should be scanned at least once during a fitness evaluation of any chromosome of the population, a time-consuming process. Hence, researchers

123

M. A. Chamazi et al.

have proposed a master–slave parallel IFG algorithm (Hong et al. 2005) and cluster-based sequential IFG algorithms (Chen et al. 2006, 2008). In the master–slave parallel algorithm (Hong et al. 2005), the master processor uses a single population, just as a simple genetic algorithm does, and distributes the fitness evaluation tasks to the slave processors (Hong et al. 2005). In the cluster-based sequential algorithms (Chen et al. 2006, 2008), accuracy is maintained while the runtime is reduced significantly by using k-means clustering. The number of database scans during the fitness evaluation of the chromosomes was reduced to the number of clusters. Although this leads to faster execution of the algorithm in the fitness evaluation of representative chromosomes, if the database is large, evaluating the representative chromosome in clusters sequentially will still be time-consuming. We thus can integrate the master–slave parallel technique with the clustering technique to further reduce the runtime of the algorithm in evaluating the chromosomes. However, implementing this method requires a multiple-processor system. Many of the mentioned approaches mine association rules only from a single-concept level. However, mining association rules on multiple concept levels may lead to finding more informative and refined knowledge from data (Han and Fu 1995). Although Hong et al. (2003), Lee et al. (2008), and Chen et al. (2011) proposed algorithms that discovered multi-level fuzzy association rules from quantitative transaction database, Hong et al.’s algorithm (2003) uses a single predefined support threshold value and predefined membership functions; Lee et al.’s algorithm (2008) uses predefined support threshold values for primitive items and predefined membership functions; and Chen et al.’s algorithm (2011) uses a single predefined support threshold value and the same derived membership functions for all items in a category. In this paper, we proposed a comprehensive and fast algorithm based on the IFG approach that discovers levelcrossing fuzzy association rules on multiple concept levels from a quantitative transaction database with learning support threshold values and membership functions using a cluster-based master–slave technique. Our proposed algorithm is a combination and modification of the algorithms proposed in Lee et al. (2008), Hong et al. (2003, 2005), Chen et al. (2007b, 2008, 2009, 2011), Han and Fu (1995), and Lozano et al. (2004). In the proposed approach for each level, first, for simplicity, the items on a level will be encoded (Sect. 2.1) with a sequence of digits and asterisks. Then a population of chromosomes will be generated (Sect. 2.2), each one of which is a possible solution and maintains support threshold values and membership functions of all items on that level. At the next step, chromosomes are evaluated

Deriving support threshold values and membership functions

(Sect. 2.3) to be selected (Sect. 2.5) two by two for the cross (Sect. 2.6). During the evaluation of chromosomes, the runtime can be highly reduced using the cluster-based master–slave technique (Sect. 2.4). With crossing the chromosomes two-by-two (Sect. 2.6), offspring chromosomes that inherit the chromosome data will be generated. Then, through the mutation operation (Sect. 2.7) of the offspring chromosomes, some of their information will change. Thus, the genetic algorithm manipulates the chromosomes so that the global optimal solution of that level is achieved. At the next step, level-crossing large itemsets are formed correspondingly by using their final support threshold value and membership functions of the global optimal solution. At the next stage, level-crossing fuzzy association rules on multiple concept levels are extracted from mined level-crossing large itemsets. Experimental results for this algorithm are presented in Sect. 5.

2 Basic concepts 2.1 Predefined taxonomy and encoding its items Han and Fu proposed an algorithm (Han and Fu 1995) for mining level-crossing association rules at multiple concept levels in which the relations among itemsets are represented as a predefined taxonomy. Giving the position of items in the predefined taxonomy, they encoded items with a sequence of digits and asterisks (where each digit represents the number of a branch). For example, suppose the predefined taxonomy shown in Fig. 4; thus, the ‘‘BakingSoda’’ item and the ‘‘Zamen BakingSoda’’ item are encoded with G‘‘*1’’ and ‘‘12’’, respectively. Then the names of the items in the transaction database are replaced by these codes. The proposed algorithm by Lee et al. (2008) and Hong et al. (2003) has been used in the Han and Fu’s

1229

(1995) approach. In the same direction, the predefined taxonomy is received as an input in our proposed algorithm, and then the items are encoded according to the Han and Fu approach. Afterwards, the names of the items are replaced by these codes in the database, and then these codes are dealt with. 2.2 Chromosome and population In the genetic algorithm, each chromosome is an indicator of an individual in the population. Any individual is an initial solution for the problem to be solved. The genetic algorithm manipulates them until an optimum and proper solution for this problem is obtained. Different methods are used to encode the initial solution of the problem in a chromosome. In this study, each fixed-length chromosome in the population is represented as a string of real numbers. This string of real numbers includes the following two parts (Chen et al. 2007b, 2008, 2009): (part I) the membership functions and (part II) the support threshold values of all items on a certain level. We consider the membership functions of the items as an isosceles trapezoid to simplify the task. Figure 1 shows membership functions of Ijl ; where Ilj is jth item on level l, l and Rlk j is the kth fuzzy region of item Ij . As can be seen in Fig. 1, each membership function can be encoded with an ordered triplet (center, SmallerBase, (LargerBase-Smalerlk lk Base)/2) or ðclk j ; bj ; dj Þ in the chromosome. Figure 2 shows a chromosome representation for items on level l in which any item has m fuzzy regions and one support threshold value alj . As can be seen in Fig. 2, the length of each chromosome P is equal to n þ ð3  nj¼1 mlj Þ, where n is the number of

items at level l and mlj is the number of fuzzy regions of Ijl : Figure 3 shows the initial population of chromosomes on level l developed by the proposed multiple-level clusterbased master–slave IFG algorithm.

Membership functions

Quantity

Fig. 1 Membership functions of Ijl

123

1230

M. A. Chamazi et al.

Fig. 2 A chromosome representation in a population

Fig. 3 The initial population including p chromosomes

2.3 Evaluation of chromosomes The fitness function defined in Hong et al. (2006), Chen et al. (2007b, 2008) is used to evaluate each chromosome. The fitness value of chromosome Cqi is defined as (Hong et al. 2006; Chen et al. 2007b): f ðcq Þ ¼

RSðCql Þ ; suitablityðCql Þ

where RSðCql Þ is the requirement satisfaction and is defined P l l as RSðCql Þ ¼ nj¼1 RSðCqj Þ; where RSðCqj Þ represents the closeness of the number of derived linguistic large 1-itemsets for item Ijl ðjLjl1 jÞ in chromosome Cql to the required number of large 1-itemsets that a user wants to get from item Ijl ðRNLlj Þ and is defined as: 8    jl  >   > L  >  jl  > 1 > if L1   RNLlj >   < RNLjl 1 l RS Cqj ¼ l   > RNL > >   j if RNLl \Ljl  > > 1 j > : Ljl  1 RNL is used to reflect the user’s preference for the derived knowledge and is defined as: j k RNLlj ¼ mlj  pRNL where mlj is the number of fuzzy regions of Ijl and pRNL is the predefined percentage to reflect users’ preference for the required number of large 1-itemsets. Suitablity(Cql Þ represents the shape suitability of Cql . Suitablity(Cql Þ is

defined

123

as:

Suitablity(Cql Þ ¼ overlap factorðCql Þ þ

coverage factorðCql Þ :The overlap factor is used to avoid redundant shapes, and is defined as:   overlap factor Cql 00 2 3   1 1 n X overlap Rlk ; Rlij X j 6 BB C C 7 ¼ blk A; 1A  15; 4max@@ blij j lk li j K\i min 2 þ dj ; 2 þ dj li where n is the number of items, overlap ðRlk j ; Rj Þ is the lk l overlap length of Rlk j and Rj ; and l is the level number of Ij in the predefined taxonomy.The coverage factor is used to avoid too separate shapes, and is defined as: n   X 1 coverage factor Cql ¼ ; l2 lm rangeðRl1 j ;Rj ;...;Rj Þ j¼1 maxðIj Þ

l2 lm where range (Rl1 j ; Rj ; . . .; Rj Þ is the coverage range of the fuzzy regions, m is the number of fuzzy regions for Ijl ; and

maxðIjl Þ is the maximum quantity of Ijl in the transactions. 2.4 Using the cluster-based master–slave technique As mentioned above, in IFG approaches, the database had to be scanned at least once during an evaluation of any chromosome in the population, was a time-consuming process. Hence, researchers have proposed a master–slave parallel IFG algorithm (Hong et al. 2005) and cluster-based sequential IFG algorithms (Chen et al. 2006, 2008). In the master–slave parallel algorithm (Hong et al. 2005), the master processor uses a single population, just as a simple genetic algorithm does, and distributes the fitness evaluation tasks to the slave processors (Hong et al. 2005). In the

Deriving support threshold values and membership functions

cluster-based sequential algorithms (Chen et al. 2006, 2008), accuracy is maintained while the runtime is reduced significantly by using k-means clustering. The number of database scans during the chromosome evaluation was reduced to the number of clusters. Although this method leads to faster execution of the algorithm in the fitness evaluation of chromosomes, if the database is large, evaluating the representative chromosome in clusters sequentially is still time-consuming. In this paper, we integrated the master–slave parallel technique with the clustering technique to further reduce the runtime of the algorithm in evaluating the chromosomes. However, implementing this method requires the use of a multiple-processor system. The new method proceeds as follows. In the multiple-processor system, one processor is determined as the master and other processors are determined are slave processors. After clustering the chromosomes and determining the representative chromosome in each cluster according to the schema stated in Sect. 3 (Step 5.1–Step 5.2), the master processor sends the representative chromosomes to the slave processors (Sect. 3, Step 5.3). The slave processors also simultaneously calculate the requirement satisfaction of their representative chromosomes according to the schema stated in Sect. 3 (Step 5.4) and return the results to the master processor (Step 5.5). After receiving the requirement satisfaction from the representative chromosome in each cluster, the master processor calculates the fitness value of the chromosomes of that cluster according to the schema stated in Sect. 3 (Step 5.6). 2.5 Selection technique This technique helps convert the fitness value of each chromosome to the select probability value of that chromosome (Bavi and Salehi 2008). Depending on the type of problem, various techniques select chromosomes in each generation. We use the roulette-wheel selection technique in this paper. In this technique, the distance from zero to the total sum of the fitness values of the chromosomes is formed by putting together the fitness values of the chromosomes in a population. The fitness value of each chromosome represents the allocated interval to that chromosome in this distance. This distance is a circumference. Now a number, from zero to the total sum of the fitness values of the chromosomes, is selected. Hence, a chromosome for which this number is in the chromosome’s interval is selected (Bavi and Salehi 2008). 2.6 Crossover operator The crossover operator crosses two parent chromosomes to produce offspring chromosomes. Any new offspring

1231

chromosome inherits part of its own information from its parent chromosome. The parent-centric BLX-a (PBX-a) crossover operator (Lozano et al. 2004) is used in this algorithm. Suppose that we intend to cross the following real-coded parent chromosomes with the PBX-a operator (Lozano et al. 2004): C p1 ¼ ðc1 ; . . .; cZ Þ and     C p2 ¼ c01 ; . . .; c1 z0 ; ch ; c0h 2 ½ah ; bh   < ; h ¼ 1. . .z : Then the following two offspring chromosomes are produced: 1.

2.

C o1 ¼ ðC1o1 ; . . .; Czo1 Þ; where Cho1 is a randomly (uniformly) chosen number from the interval ½l1h ; u1h ; with l1h = max{ah, Ch– Ih}, u1h = min{bh, Ch ? Ih}, Ih = jCh  Ch0 ja. C o2 ¼ ðC1o2 ; . . .; Czo2 Þ; where Cho2 is a randomly (uniformly) chosen number from the interval ½l2h ; u2h , with l2h = max{ah, Ch0 - Ih}, u2h = min{bh, Ch0 ? Ih}.

In this paper, the value of parameter a has been set as the constant value 0.3. 2.7 Mutation operator Using this operator revives hope of regenerating the previously lost good chromosomes. Likewise, apart from the scattering of the initial population in the problem space, searching each point of the problem space is possible using this operator (Bavi and Salehi 2008). In this study, a one-point mutation operator is used to create a mutation in the chromosomes. In this type of mutation operator, one gene of the chromosome is selected randomly and then experiences a mutation after a random value e is added in the allowed region. If the changed gene is the center of a membership function, the arrangement among the centers of the membership functions should be checked after the mutation. This operator has been used in Hong et al. (2004, 2005, 2006) and Chen et al. (2006, 2007a, b, 2008, 2009, 2011).

3 The proposed algorithm Here our proposed algorithm is offered by combining and modifying the algorithms proposed in Lee et al. (2008), Hong et al. (2003, 2005), Chen et al. (2007b, 2008, 2009, 2011), Han and Fu (1995), and Lozano et al. (2004) Input: t quantitative transaction data, a predefined taxonomy with the items assigned their own number of linguistic terms, a confidence threshold value k, a parameter c for k-means clustering, a population size P, a crossover

123

1232

M. A. Chamazi et al.

rate rc, a mutation rate rm, and a percentage of required number of large 1-itemsets pRNL. Output: A set of level-crossing fuzzy association rules on multiple concept levels with the associated MFs and support threshold values. Step 1 According to the schema stated in Sect. 2.1, encode the item names in the transaction database with the master processor. Step 2 Set l = 1 and z = 1 with the master processor; where l represents the number of levels being processed and z represents the number of regions stored in the current large itemsets. Step 3 For each transaction Di ¼ ð1  i  tÞ; with the master processor, group the items with the same first l digits, and add the amounts of the items in the same groups in Di. Step 4 According to the schema stated in Sect. 2.2, randomly generate an initial population of P chromosomes with the master processor. Each chromosome in the population is composed of possible support threshold values and membership functions for all n items on level number l. The possible support threshold value of an item is a randomly chosen value from the range between 0 and the fraction of the transactions that contain that item. Step 5 Calculate the fitness value of each chromosome with the following substeps: Step 5.1 With the master processor, partition the chromosomes into c clusters with the k-means clustering method according to the three factors (Chen et al. 2008) [(1) coverage factor, (2) overlap factors, and (3) support factor (the average support threshold value of items in the chromosome)] as the similarity criteria. Step 5.2 In each cluster, consider the nearest chromosome to the cluster center as the representative chromosome with the master processor. Step 5.3 Distribute the representative chromosomes from the master processor to the slave processors. Step 5.4 Perform the following substeps to calculate the requirement satisfaction RS(Cql Þ for each representative chromosome with each corresponding slave processor: Step 5.4.1 For each transaction datum Di ¼ ð1  i  tÞ; and for each item Ijl ; ð1  j  nÞ transform the quantitative value vlij of Ijl into a fuzzy set   fijl1 fijlm l fij represented as Rl1 þ    þ Rlm using the correj

j

sponding membership functions represented by the representative chromosome. m is the number of

123

linguistic terms for Ijl ; Rlh j is the hth term of Ijl ; ð1  h  mÞ; vlij is the amount of the jth group Ijl for Di at level l, and fijlh is vlij ’s fuzzy membership value in region Rlh j . Step 5.4.2 Calculate the scalar cardinality countlh j of each fuzzy region Rlh j in the transaction data as Pt lh lh countj ¼ i¼1 fj : lh Step 5.4.3 If the value countlh j of Rj is larger than or equal to the corresponding support threshold value represented by the representative chromosome, put l Rlh j into the large 1-itemset for level l, ðL1 Þ. That is, n o lh l lh l Ll1 ¼ Rlh j j countj sj ; Rj 2 C1

Step 5.4.4 For each representative chromosome, set the requirement satisfaction RS(Cql Þ with the formulas in Sect. 2.3. Step 5.5 Send the requirement satisfaction RS(Cql Þ of each representative chromosome from each slave processor to the master processor. Step 5.6 With the master processor, calculate the fitness value of each chromosome using the requirement satisfaction of its representative chromosome and its own suitability value with the formulas in Sect. 2.3. Step 6 Generate the next population with the master processor: Step 6.1 Select the parent chromosomes two by two using the roulette-wheel selection technique. Step 6.2 Cross each pair using the PBX-a crossover operator. Step 6.3 Mutate the offspring chromosomes using the one-point mutation operator. Step 7 If the termination criterion is not satisfied, go to Step 5; otherwise, do the next step. Step 8 According to the schema stated in Step 5.4.1–Step 5.4.3, with the master processor, generate Ll1 using the support threshold values and the membership functions of all items in the best chromosome of the last generation. Step 9 If Llz is null, perform the next step; otherwise, go to Step 11. Step 10 Set l = l ? 1 with the master processor. If l is larger than the last level number of the taxonomy, go to Step 14 to generate the interesting level-crossing fuzzy association rules on multiple concept levels; otherwise, go to Step 3. Step 11 In the mining process, assign the support threshold of an itemset as the maximum of the support

Deriving support threshold values and membership functions

1233

threshold values of the items contained in the itemset. With the master processor, generate the candidate set l Czþ1 using the following substeps:

Step 14 With the master processor, generate the interesting level-crossing fuzzy association rules with following substeps:

Step 11.1 If z = 1 generate the candidate set Czl from L11 ; L21 ; . . .; L1l to find level-crossing large itemsets; otherwise, join Llz with itself in a way similar to that in the a priori algorithm. That is, the algorithm joins two itemsets of Llz ; if their (z - 1) regions are similar and the other one is different. To reduce the number of candidate itemsets generated, each newly formed (z ? 1)-itemset in the following five cases may be pruned. If the new z ? 1-itemsets are not pruned in the following five cases, additional candidate itemsets will be formed which by calculating their support, we will find out that all of them are not large. This will take up additional space of the memory and will require more time. Additional description can be found in Lee et al. (2008):

Step 14.1 Generate all level-crossing fuzzy association rules on multiple concept levels for each levelcrossing large s-itemset A with regions ða1 ; a2 ; . . .; as Þ s C 2, in the following way:

Case 1 If any of its subset is not large. Case 2 If the fuzzy regions contained in it have the same item name. Case 3 If its regions possess the hierarchical relation in the given item taxonomy. Case 4 If the support value of any fuzzy region in the itemset is smaller than the maximum of the support threshold values of the items. Case 5 If its count value is smaller than the maximum of the support thresholds of the items included in it. l Step 12 Czþ1 is a superset of Llzþ1 With the master processor, form Llzþ1 by following substeps:

Step 12.1 Calculate the scalar cardinality of each l (z?1) -itemset A with regions ða1 ; a2 ; . . .azþ1 Þ in Czþ1 in all the transaction data as countA ¼

t X

fiA ;

i¼1

fiA is the fuzzy value of A in Di and is calculated as

X ) Y ðX; Y  A; X \ Y ¼ ;Þ: Step 14.2 Calculate the confidence values of all levelcrossing fuzzy association rules with Pt f Pi¼1 iA : ðfix Þ Step 14.3 Check the confidence values of all levelcrossing fuzzy association rules to determine their interestingness. Determine rules with confidence values larger than or equal to the predefined confidence value k as interesting.

4 Comparing the proposed approach with similar fuzzy data-mining approaches As shown in Table 1, our proposed approach is comprehensive and uses the cluster-based master–slave technique in evaluating chromosomes, in addition to including the important characteristics of the previous fuzzy data-mining approaches. Next, we provide a brief explanation regarding the features listed in Table 1, and then, we present the table. 4.1 Quantitative database In real-world applications, transaction databases usually include quantitative attributes. Therefore, algorithms that can process quantitative databases are required to mine association rules from the databases. 4.2 Using multiple support threshold values

zþ1

fiA ¼ min fiak k¼1

where fiak is the membership value of region ak in Di. Step 12.2 Check whether the value countA is larger than or equals the threshold sA ,which is the maximum of the support threshold values of the regions contained in it. If A satisfies the threshold, put it into Llzþ1 : Step 13 Set z = z ? 1 by the master processor; go to STEP 9.

Since a single support threshold value is used for the entire database, it assumes that all items in the data are of the same nature and/or have similar frequencies (Dunham et al. 2001). In reality, some items may be very frequent while others may rarely appear; however, the latter may be more informative and more interesting than the earlier ones (Dunham et al. 2001). To this end, in a mining algorithm, multiple support threshold values should be used for mining association rules from items in the database.

123

1234

M. A. Chamazi et al.

Table 1 Comparing the proposed approach with similar fuzzy data-mining approaches Multiplelevel fuzzy mining

Using multiple support thresholds

Lee et al.’s approach (2008) Alcala´-Fdez et al.’s approach (2009)

4

4

Hong et al.’s approach (2003)

4

Optimizing support threshold values

Considering the user’s preference

Using the master– slave technique

Using the clusterbased master– slave technique

4

4

4

Hong et al.’s approach (2005) 4

4

Processing quantitative transactional database 4

4

Chen et al.’s approach (2007b)

Chen et al.’s approach (2011) Proposed approach

Optimizing membership functions

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4.3 Optimization of membership functions

5 Case study

The appropriate membership functions can have a great impact on the final mining results; however, determining appropriate membership functions for all items is an expensive and time-consuming process. Consequently, appropriate membership functions should be learned using the genetic algorithm.

Using C# language, we implemented our algorithm in Microsoft Visual Studio 2010 installed on a PC equipped with an Intel Core i5 2.4 GHz (2,410 M) processor and CACHE 3.0 MB and RAM 4 GB DDR3. Our transaction dataset is derived from an actual quantitative transaction dataset of a supermarket that included 12,000 transactions. There are 24 types of items on level 2 in this transaction dataset, each of which has appeared in a multi-level form in the transactions. In each transaction in this dataset, the number of purchases of each item is mentioned beside the name of that item. In addition, an item cannot appear twice in a transaction.

4.4 Optimization of support threshold values In the proposed approach, for each item, its own support threshold value is used; however, determining the values of all appropriate support threshold values is difficult. Thus, the support threshold values need to be learned using the genetic algorithm.

5.1 Initialization of the input parameters of the algorithm

4.5 Consideration of the user’s preferences By applying this criterion, the knowledge obtained from the mining process will be influenced by the user’s preferences. 4.6 Using the cluster-based master–slave technique Using the cluster-based master–slave technique, the runtime of the algorithm can be reduced without reducing the quality of the solution.

123

The hierarchy tree of items is also taken from an actual sample. To easily understand and analyze the algorithm results, the hierarchy tree was considered only at two conceptual levels. This tree is shown in Fig. 4. In the hierarchy tree, at the first level, the name of the items are shown, and at the second level, the item brands are represented. In each transaction in the dataset, beside the name of an item, the brand is also saved. For example, regarding the item ‘‘1&1 Verjuice’’, ‘‘Verjuice’’ is the item

1235

Average Fitness Values

a

of Level-1 chromosomes

Deriving support threshold values and membership functions

1 0.8 0.6 0.4

The number of C for clustering: -

0.2 0

The number of C for clustering: 4 100

300

500

Fig. 4 The predefined taxonomy for the items

5.2 Experimental results and analysis 5.2.1 Using the cluster-based master–slave technique to reduce the runtime In genetic algorithms, the fitness value of a chromosome represents the quality of the solution presented in that chromosome. Figure 5a, b shows the curves related to the

900

1100

1300

1500

Generations of Level-2 chromosomes

Average Fitness Value

b name, and ‘‘1&1’’ is the name of the company that produced item. This item is saved in a transaction under the expression ‘‘1&1 Verjuice’’. The population size in a genetic algorithm is usually between 10 and 100 chromosomes. Here, we considered a population equal to 40 with 4 clusters. In addition, by setting the mutation rate at 0.01 and the crossover rate at 0.8 and a rate at 0.3, we observed that the algorithm moved toward the global optima with the appropriate speed as well as with gentle non-descending growth. However, depending on the user preference for rules’ interestingness to be mined, determining the value of the confidence threshold (k) is a matter of taste. We considered this coefficient equal to 0.1. The coefficient of the required number of large 1-itemsets (pRNL) also varies from 0 to 1 depending on the user opinion on having fewer or more rules. Since, here we intend to move the algorithm toward chromosomes that generate more rules, this ratio was 0.8. In this paper, according to the distribution of the quantitative values of items in the transaction dataset, the number of fuzzy regions for the items was 2 on level 1 and 3 on level 2. Usually to terminate the genetic algorithm, various criteria are used that mainly include ‘‘runtime of algorithm’’, ‘‘a certain amount of fitness values of chromosomes’’, or ‘‘a certain number of generated generations’’. Here, to better analyze the quality of the chromosomes along with successive generations, we used the criterion ‘‘achieving a certain number of generations’’ and used 1,500 generations. From this number of generations on, no significant change was observed in the quality of membership functions and support threshold values of items and the fitness values of the chromosomes.

700

1 0.8 0.6 0.4 The number of C for clustering: 0.2 0

The number of C for clustering: 4 100

300

500

700

900

1100

1300

1500

Generations

Fig. 5 a Average fitness value of level-1 chromosomes with/without using the master–slave technique in evaluating them. b Average fitness value of level-2 chromosomes with/without using the master– slave technique in evaluating them

average fitness value of chromosomes along with consecutive generations for level-1 and level-2 chromosomes. The red curve is for the state in which we used the cluster-based master–slave technique in evaluating the chromosomes, and the blue curve is for the state in which the cluster-based master–slave technique was not used in evaluating the chromosomes. Comparing the red curve with the blue curve in these figures, almost from 1,400th generation on, the curves get very close and converge on one. This means that the quality of the solution presented in both modes will get very close, after 1,400 generations. Now, looking at the algorithm runtime in Table 2, when we used the cluster-based master–slave technique, the algorithm runtime decreased 15.8 times as much as the case in which we did not use this technique. Thus, we conclude that using the cluster-based master– slave approach in evaluating the chromosomes, the algorithm runtime will greatly decrease without reducing the quality of the solution. 5.2.2 Optimizing support threshold values and membership functions To show how the membership functions and support threshold values of the items are deformed and their values change after being optimized by the algorithm, we randomly selected the item ‘‘1&1 Verjuice’’ as one of the dataset items. Figure 6a, b shows the initial membership functions and support threshold value of the item ‘‘1&1

123

1236

M. A. Chamazi et al.

Table 2 Relationship between the use of the cluster-based master– slave technique in evaluating chromosomes and the runtime The number of C

Runtime (min)

With the cluster-based master–slave technique

4

5.80

Without the cluster-based master–slave technique



91.68

Fig. 7 a Final membership functions and support threshold value of one of the items on level 1 (* Verjuice). b Final membership functions and support threshold value of one of the items on level 2 (1&1 Verjuice)

Fig. 6 a Initial membership functions and support threshold value of one of the items on level 1 (* Verjuice). b Initial membership functions and support threshold value of one of the items on level 2 (1&1 Verjuice)

Verjuice’’ on level 1 (* Verjuice) and level 2 (1&1 Verjuice). Figure 7a, b shows the final membership functions and support threshold value of the same item on level 1 and level 2 after the algorithm is executed. As shown in the final membership functions of the item ‘‘1&1 Verjuice’’ on level 1 and level 2, due to the coverage factors and overlap factors in the denominator of the fitness function, the final membership functions are not separate, and there is not irrational redundancy. Given the high value of pRNL, the algorithm moves toward the chromosomes in which the membership functions and the support threshold values of the items are optimized so that they could result in generating additional large 1-itemsets and, thus, more rules. 5.2.3 Level-crossing fuzzy association rules The proposed algorithm not only generates the membership functions and final support threshold values of the items

123

but also mines level-crossing fuzzy association rules on multiple concept levels. In Table 3 are several examples of mined level-crossing fuzzy association rules along with the related confidence. Since the items in the hierarchy trees have two levels, the mined rules include the items on level 1 and/or level 2. For example, the rule ‘‘*.BankingSoda.R2, *.Egg.R1 ==[[ *,Milk.R1’’ is a fuzzy association rule that includes the items on level1, or the rule ‘‘1&1.Verjuice.R2, Dorsa.Egg.R3, Golha.BakingSoda.R1, Hedayati.Rice.R1,… ==[[ Haraz.Doogh.R1’’ is a fuzzy association rule that includes the items on level 2, or the rule ‘‘Dorsa.Egg.R3, *.Milk.R2, … ==[[ Kosar.Honey.R2, *.Halva.R1, …’’ is a fuzzy association rule that includes the items on level1 and level2. In these rules, the mark * beside the name of each item indicates that the data on the item on that level have not been mentioned. In the rules presented in Table 3, ‘R’ represents the fuzzy expression of the quantitative values of that item in the dataset, which appears beside ‘R’ along with a digit, which indicates that to which fuzzy region they are related. For example, in a supermarket, if a good is purchased in a low, medium, or large amount, the value of this purchased good can be presented with three fuzzy regions of R1 for when a small amount of the good is purchased, R2 for when a medium amount of the good is purchased, and R3 for when a large amount of the good is purchased, or if a good is purchased in a low, or a high value, the value of this purchased good can be presented with two fuzzy regions R1 and R2.

Table 3 Mined level-crossing fuzzy association rules on multiple concept levels with the proposed algorithm Rules

Confidence

*.Verjuice.R1 ==[[ *.BankingSoda.R2, *. Rice.R1

0.25

*.BankingSoda.R2, *.Egg.R1 ==[[ *, Milk.R1

0.44

: 1&1.Verjuice.R2, Dorsa.Egg.R3, Golha.BakingSoda.R1, Hedayati.Rice.R1,… ==[[ Haraz.Doogh.R1

0.82

Tarom.Rice.R2, OroumAda.Verjuice.R3,.. ==[[ Kaleh.Curd.R1, Golha.Soya.R1, …

0.65

: Dorsa.Egg.R3, *.Milk.R2, … ==[[ Kosar.Honey.R2, *.Halva.R1, …

0.56

1237

Average Fitness Value of Level-1 chromosomes

Deriving support threshold values and membership functions

1 0.8 0.6 0.4 The proposed approach 0.2 0

Chen et al.’s approach[23] 100

300

500

700

900

1100

1300

1500

Generations

Fig. 8 The average fitness value along with different numbers of generations

:

The rule ‘‘Dorsa.Egg.R3, *.Milk.R2, … ==[[ Kosar.Honey.R2, *.Halva.R1, …’’ which includes the items on level 1 and level 2, suggests that ‘‘if someone buys a high amount of an ‘Egg’ product under the brand name ‘Dorsa’, a high amount of ‘Milk’, etc., then with 56 % confidence, he will also buy a medium amount of a ‘Honey’ product under the brand name ‘Kosar’, a low amount of a ‘Halva’ product, etc.’’ By processing multi-level items in transaction datasets, more accurate, important, and applied knowledge can be mined; moreover, expression of the association rules in a fuzzy representation is more understandable. 5.3 Comparison with other approaches In this section, we discuss the experimental results of the proposed approach in cases such as ‘‘the solution quality’’, ‘‘the number of mined level-crossing fuzzy association rules’’ and ‘‘the algorithm runtime’’ compared to other approaches. These experimental results are determined regarding the transaction dataset previously described in this paper, and can be generalized to the other qualitative transaction datasets. 5.3.1 The quality of the obtained solution In this section, we compare the quality of the solution obtained from the chromosomes in the proposed approach with the quality of the solution obtained from the chromosomes in Chen et al.’s (2008) approach. We compared the proposed approach to Chen et al.’s (2008) approach because we used the fitness function defined in their approach (Chen et al. 2008) to evaluate each chromosome. Of course, for a better comparison, in Chen et al.’s (2008) approach, we changed the fitness function so that

the horizontal asymptote of the curve of the average fitness value of chromosomes is equals to one. As shown in Fig. 8, from the 400th generation on, the curve of the average fitness value of chromosomes in the proposed algorithm for items on level 1 will rise more than Chen et al.’s (2008) algorithm curve. Thus, the solution obtained from the chromosomes in this paper have a higher quality than the solution obtained from the chromosomes in Chen et al.’s (2008) study. In fact, using a more appropriate crossover operator and optimal coding of the proposed algorithm, we optimized the membership functions and support threshold values of the items so that they would have a better quality than Chen et al.’s (2008) approach. The interestingness of the rules generated by these membership functions and support threshold values lead to higher user satisfaction than Chen et al.’s (2008) approach. 5.3.2 The number of level-crossing fuzzy association rules Since our proposed approach is designed to mine levelcrossing fuzzy association rules on multiple concept levels, it should be compared with algorithms that mine levelcrossing fuzzy association rules on multiple concept levels regarding the number of mined level-crossing fuzzy association rules. Hence, in this section, we compare our algorithm while assuming pRNL = 0.8 with the algorithms presented by Lee et al. (2008) and Chen et al. (2011) while assuming predefined approbative support threshold values. The curves in Fig. 9 show the number of level-crossing fuzzy association rules mined by our approach and Lee et al.’s (2008) and Chen et al.’s (2011) approaches along with various confidence threshold values. As can be seen in these curves, the curve of our approach is upper than the curve of Lee et al.’s (2008) and Chen et al.’s (2011) approaches. However, the decrease in the number of mined

123

Number of Fuzzy Rules

1238

M. A. Chamazi et al.

(2008), Hong et al. (2003, 2005), Chen et al. (2007b, 2008, 2009, 2011); Han and Fu 1995; Lozano et al. 2004) that discovers level-crossing fuzzy association rules from a quantitative transaction database with learning support threshold values and membership functions using the cluster-based master–slave technique. Our comprehensive algorithm has the following properties:

800 600 400 200 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Minimum Confidence



Fig. 9 The number of mined level-crossing fuzzy association rules along with various confidence threshold values



RunTime(min)

100 75



50 25 0 0%

20%

40%

60%

80%

Using the genetic algorithm, our algorithm learns the membership functions and support threshold values of items on multiple concept levels. In the mining process, our algorithm uses the IFG strategy to discover level-crossing association rules in a fuzzy representation from a quantitative transaction database. In evaluating the chromosomes, our algorithm uses the cluster-based master–slave technique to reduce the algorithm’s runtime.

100%

Percentage of Transactions Fig. 10 The runtime along with various percentage of transactions

rules in all three approaches along with the increase in the confidence threshold value seems obvious because the increase in the confidence threshold value leads to the mining of more interesting and useful rules. 5.3.3 Runtime Figure 10 shows the curves corresponding to the runtime of the proposed algorithm and Chen et al.’s (2011) algorithm for mining level-crossing fuzzy association rules from the mentioned dataset. In this section, we compare our algorithm with Chen et al.’s (2011) algorithm, because similar to our approach, Chen et al.’s (2011) approach is a multiple-level genetic-fuzzy approach. As seen in this figure, with the increase in transactions, the runtime of the proposed approach is still far less than that of Chen et al.’s (2011) approach. In fact, with optimal coding of the proposed algorithm and using the cluster-based master– slave technique in evaluating the chromosomes, we reduced the runtime of our approach more than the other similar approaches.

6 Conclusion In this paper, we proposed a comprehensive multiple-level cluster-based master–slave IFG mining approach by combining and modifying the algorithms proposed in Lee et al.

123

References Aggarwal CC, Zheng S, Yu PS (1998) Online algorithms for finding profile association rules. In: Proceedings of the ACM CIKM Conference, 1998, pp 86–95 Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: The 20th international conference on very large data bases, pp 487–499 Agrawal R, Imielinski T, Swami A (1993) Mining association rules between sets of items in large databases. In: ACM SIGMOD conference Alcala´-Fdez J, Alcala´ R, Gacto MJ, Herrera F (2009) Learning the membership function contexts for mining fuzzy association rules by using genetic algorithms. Fuzzy Sets Syst 160(7):905–921 Bavi O, Salehi M (2008) Genetic algorithms and optimization of composite structures. Abed and Mehregan Ghalam Cai C, Fu A, Cheng C, Kwong W (1998) Mining association rules with weighted items. International database engineering and applications symposium, pp 68–77 Chen C-H, Hong T-P, Tseng VS (2006). A Cluster-Based FuzzyGenetic Mining Approach for Association Rules and Membership Functions. IEEE International Conference on Fuzzy Systems, pp 1411–1416 Chen C-H, Hong T-P, Tseng VS (2007a) A modified approach to speed up genetic-fuzzy data mining with divide-and-conquer strategy. In: IEEE congress on evolutionary computation, pp 1–6 Chen C-H, Hong T-P, Tseng VS, Lee C-S (2007b) A genetic-fuzzy mining approach for items with multiple minimum supports. In: Fuzzy systems IEEE international conference, pp 1–6 Chen C-H, Hong T-P, Tseng VS (2008) A cluster-based genetic-fuzzy mining approach for items with multiple minimum supports. Advances in Knowledge Discovery and Data Mining, Lecture Notes in Computer Science 5012:864–869 Chen C-H, Hong T-P, Tseng VS (2009) An improved approach to find membership functions and multiple minimum supports in fuzzy data mining. Expert Syst Appl 36:10016–10024 Chen C-H, Hong T-P, Lee YC (2011) A Multiple-level genetic-fuzzy mining algorithm. In: IEEE international conference on fuzzy systems

Deriving support threshold values and membership functions Dunham MH, Xiao Y, Grue L, Hossain Z (2011) A survey of association rules, Technical Report, Southern Methodist University Hadian A, Nasiri M, Minaei-Bidgoli B (2010) Clustering based multiobjective rule mining using genetic algorithm. International Journal of Digital Content Technology and its Applications 4(1):37–42 Han J, Fu Y (1995) Discovery of multiple-level association rules from large databases. The international conference on very large databases 118:420–431 Hong T-P, Chen J-B (1999) Finding relevant attributes and membership functions. Fuzzy Sets Syst 103:389–404 Hong T-P, Kuo C-S, Chi S-C (1999) Mining association rules from quantitative data. Intell Data Anal 3(5):363–376 Hong TP, Lin KY, Chien BC (2003) Mining fuzzy multiple-level association rules from quantitative data. Appl Intell 18(1):79–90 Hong T-P, Chen C-H, Wu Y-L (2004) Using divide-and-conquer GA strategy in fuzzy data mining. In: Proceedings of ninth international symposium on computers and communications. ISCC 2004, vol 1, pp 116–121 Hong T-P, Lee YC, Wu MT (2005) Using master-slave parallel architecture for GA-fuzzy data mining. In: The 2005 IEEE international conference on systems, man, and cybernetics, pp 3232–3237 Hong T-P, Chen C-H, Wu Y-L (2006) A GA-based fuzzy mining approach to achieve a trade-off between number of rules and suitability of membership functions. Soft Comput 10(11):1091– 1101 Hong T-P, Chen C-H, Tseng VS (2009) Genetic-fuzzy data mining techniques. Encyclopedia of complexity and systems science, pp 4145–4160 Houtsma MA, Swami AN (1995) Set-oriented mining for association rules in relational databases. In: The eleventh international conference on data engineering, IEEE Computer Society, pp 25–33 Kaya M (2006) Multi-objective genetic algorithm based approaches for mining optimized fuzzy association rules. Soft Comput 10:578–586 Kaya M, Alhajj R (2005) Genetic algorithm based framework for mining fuzzy association rules. Fuzzy Sets Syst 152(3):587–601 Kaya M, Alhajj R (2006) Utilizing genetic algorithms to optimize membership functions for fuzzy weighted association rules mining. Appl Intell 24:7–15

1239 Lee Y-C, Hong T-P, Lin W-Y (2004) Mining fuzzy association rules with multiple minimum supports using maximum constraints. Knowl Based Intell Inf Eng Syst 3214:1283–1290 Lee Y-C, Hong T-P, Wa T-C (2008) Multi-level fuzzy mining with multiple minimum supports. Expert Syst Appl 34:459–468 Liu B, Hsu W, Ma Y (1999) Mining association rules with multiple minimum supports. In: The fifth ACM SIGKDD international conference on knowledge discovery and data mining, pp 337–341 Lozano M, Herrera F, Krasnogor N, Molina D (2004) Real-coded memetic algorithms with crossover hill-climbing. Evol Comput 12(3):273–302 Moslehi P, Minaei B, Nasiri M, Fazel EN (2011a) Mining frequent ranges of numeric attributes via ant colony optimization for continuous domain without specifying minimum support. International Journal of Computer Science 8(5):111–116 Moslehi P, Bidgoli BM, Nasiri M, Fazel EN (2011b) Mining frequent ranges of numeric attributes via ant colony optimization for continuous domains without specifying minimum support. International Journal of Computer Science Issues 8(5):1 Nasiri M, Taghavi LS, Minaee B (2010) Multi-Objective rule mining using simulated annealing algorithm. Journal of Convergence Information Technology 5(1) Nasiri M, Taghavi LS (2011) Numeric Multi-Objective rule mining using simulated annealing algorithm. ijorlu 1:37–48 Qodmanan HR, Nasiri M, Minaei-Bidgoli B (2011) Multi objective association rule mining with genetic algorithm without specifying minimum support and minimum confidence. Expert Systems with Applications 38(1):288–298 Savasere A, Omiecinski E, Navathe SB (1995) An efficient algorithm for mining association rules in large databases. In: The 21th international conference on very large data bases, pp 432–444 Shu Yue J, Tsang E, Yeung D, Shi D (2000) Mining fuzzy association rules with weighted items. IEEE Int Conf Syst Man Cybern 3:1906–1911 Srikant, R., & Agrawal, R. (1996). Mining quantitative association rules in large relational tables. ACMSIGMOD, pp 1–12 Wang K, He Y, Han J (2000) Mining frequent itemsets using support constraints. In: The 26th international conference on very large data, pp 43–52

123

Suggest Documents