Fuzzy Associative Classification using Iterative ... - IIIT Hyderabad

0 downloads 0 Views 331KB Size Report
FACISME: Fuzzy Associative Classification using Iterative Scaling and. Maximum Entropy ... independence between some parameters in the domain. [2] [3]. ..... fuzzy ARM algorithms, most of which are fuzzy adaptations of Apriori. But, in this ...
FACISME: Fuzzy Associative Classification using Iterative Scaling and Maximum Entropy by Ashish Mangalampalli, Vikram Pudi

in IEEE Intl Conference on Fuzzy Systems (FUZZ-IEEE)

Barcelona, Spain

Report No: IIIT/TR/2010/37

Centre for Data Engineering International Institute of Information Technology Hyderabad - 500 032, INDIA July 2010

FACISME: Fuzzy Associative Classification using Iterative Scaling and Maximum Entropy Ashish Mangalampalli 1, Member, IEEE and Vikram Pudi 1, Member, IEEE 1

Centre for Data Engineering (CDE), International Institute of Information Technology (IIIT), Hyderabad - 500032, India. [email protected], [email protected]

Abstract. All associative classifiers developed till now are crisp in nature, and thus use sharp partitioning to transform numerical attributes to binary ones like “Income = [100K and above]”. On the other hand, the novel fuzzy associative classification algorithm called FACISME, which we propose in this paper, uses fuzzy logic to convert numerical attributes to fuzzy attributes, like “Income = High”, thus maintaining the integrity of information conveyed by such numerical attributes. Moreover, FACISME is based on maximum entropy, and uses iterative scaling, both of which lend a very strong theoretical foundation to the algorithm. Entropy is one of the best measures of information, and maximum-entropy-based algorithms do not assume independence of parameters in the classification process. Thus, FACISME provides very good accuracy, and can work with all types of datasets (irrespective of size and type of attributes – numerical or binary) and domains.

which make heavy use of numerical attributes, or have a mix of numerical and binary attributes.

I. INTRODUCTION Recent studies in classification have proposed ways to exploit the paradigm of association rule mining for the problem of classification. A new classification approach called associative classification has gained a lot of popularity of late, because of its accuracy, which can be attributed to its ability to mine huge amounts of data in order to build a classifier. It integrates association rule mining [1] and classification by using association rule mining (ARM) algorithms, such as Apriori [1]. These methods mine high quality association rules and build classifiers based on them. Associative classifiers have several advantages:

Thus, we have come up with a new fuzzy associative classification algorithm called FACISME. It uses maximum entropy and iterative scaling to build a theoretically sound classifier which is meant for accurate and efficient performance on any kind of datasets (irrespective of size and type of attributes – numerical or binary). The maximum entropy principle is well-accepted in the statistics community. It states that given a collection of known facts about a probability distribution, choose a model for this distribution that is consistent with all the facts but otherwise is as uniform as possible. Hence, the chosen model for FACISME does not assume any independence between its parameters that is not reflected in the given facts. In this paper, we use the Generalized Iterative Scaling (GIS) algorithm [14] to compute the maximum entropy model. Crisp associative classification algorithms can mine only binary attributes, and require that any numerical attributes be converted to binary attributes. Until now the general method for this conversion process is to use sharp partitions, and try to fit the values of the numerical attribute in these sharp partitions. Assuming we have three such sharp partitions for the attribute Income, namely up to 20K, 20K-100K, 100K and above. Income = 50K would fit in the second partition, but so would Income = 99K. Thus, using sharp partitions:





Introduces uncertainty, especially at the boundaries of sharp partitions, leading to loss of information.



Moreover, small changes in the election of intervals may lead to very different results, which can be misleading.



The intervals also do not generally have clear semantics associated.

• •

Frequent itemsets capture all the dominant relationships between items in a dataset. Efficient itemset mining algorithms exist. As these classifiers deal only with statistically significant associations, the classification framework is robust.

However, there are two major problems with the associative classification as it stands right now: •

Most existing studies in the associative classification paradigm either do not provide any theoretical justification behind their approaches, or assume independence between some parameters in the domain [2] [3].



Crisp associative classification is not suitable, both theoretically and practically, for datasets and domains

A better way to solve this problem is to have attribute values belong to such partitions or sets with some membership value in the interval [0, 1], instead of belonging totally or not at all belonging to a particular partition or range, and to have transactions with a given attribute represented to a certain extent (in the range [0, 1]). In this way crisp binary attributes are replaced by fuzzy ones [4]. Thus, we need to use fuzzy methods, by which quantitative

values for numerical attributes are converted to fuzzy binary values [5], [6] instead of crisp ranges. The use of fuzzy methods depends on the nature of dataset because in a few cases (like datasets with all binary attributes) fuzzy methods may not be required. Generation of fuzzy frequent itemsets using an appropriate fuzzy ARM algorithm, and then building a fuzzy associative classifier using these itemsets is not a straightforward process. First, we need to convert the crisp dataset, containing crisp binary and numerical attributes, into a fuzzy dataset, containing crisp binary and fuzzy binary attributes. This involves substantial pre-processing using appropriate techniques, like the one described in [7], [8]. Conversion of numerical attributes to fuzzy binary attributes is a part of such pre-processing. Second, crisp ARM and associative classification algorithms calculate the frequency of an itemset just by looking for its presence or absence in a transaction of the dataset, but fuzzy ARM and associative classification algorithms need to take into account the fuzzy membership of an itemset in a particular transaction of the dataset, in addition to its presence or absence. Thus, given a dataset, using FACISME to build a fuzzy classifier based on this dataset is a very complex and involved process, far removed in many major as well as minor aspects from the process of building a crisp classifier, based on the same dataset, using a conventional associative classification algorithm. Our main contributions in this work are as follows: •

We have developed FACISME, which integrates maximum-entropy-based associative classification with fuzzy logic.



Because of the use of maximum entropy, FACISME has a very strong theoretical foundation, and does not assume independence of parameters in the classification process, thus providing very good accuracy. Moreover, maximum entropy models are interesting because of their ability to combine many different kinds of information.



And, this accuracy is easily extensible over any kind of datasets (irrespective of size and type of attributes – numerical or binary) and domains, through the use of fuzzy logic, by creating a fuzzy associative classifier.

Section 2 describes maximum entropy models, and section 3 describes the fuzzy pre-processing technique and fuzzy parameters used in FACISME. The actual FACISME algorithm is explained in detail in section 4. In section 5 we give an overview of the related work. And, in section 6 we describe our experimental results, before concluding in section 7. II. MAXIMUM ENTROPY MODELS Let K = {k1, k2, …, kn} be the set of all items that can appear in a transaction. Conditional maximum entropy models (the term “conditional” is generally dropped in normal usage) are of the form [9]:



|





, ∑

, ′

where X = {x1, x2, …, x2n-1} is an input vector representing the set of all possible transactions with each transaction containing one or more items from K. And, y is an output class such that y ∈ Y = {y1, y2, …, yl} – the set of all l classes. fi are the indicator functions or feature values that are true if a particular property of (x, y) is true, and λi is a weight for the indicator fi. Let S = {s1, s2, …, s|S|} be the union of the fuzzy frequent itemsets extracted from each class y. The fuzzy training dataset E (derived for the original training dataset D) is a set of transactions, each of which is labeled with one of the l classes. This transformation of the original crisp dataset to its fuzzy version is done through the fuzzy pre-processing described in section 3. Maximum entropy models have several valuable properties. The most important is constraint satisfaction. For a given fi, we can count how many times fi was observed in the data: observed[i] = Σj fi(xj , yj)

(2)

For a model Pλ' with parameters λ', we can see how many times the model predicts that fi would be expected, so: expected[i] = Σj,y Pλ' (y|xj) fi(xj , y)

(3)

The itemsets in S are parameters to model each class, such that each fuzzy itemset s together with (expected[i] | y) in class y forms a constraint that needs to be satisfied by the statistical model for that particular class. Thus, for each class y we have a set of constraints C[y] = {sq | sq ∈ S}. The probability distribution that we build for class y must satisfy these constraints. However, there could be multiple distributions satisfying these constraints. But, we use the maximum entropy principle [9] to select the distribution with the highest entropy. Maximum entropy models have the property that expected[i] = observed[i] for all constraints i. An additional property is that, of models in the form of eq. 1, the maximum entropy model maximizes the probability of the training data. Yet another property is that maximum entropy models are as close as possible to the uniform distribution, subject to constraint satisfaction. III. FUZZY PRE-PROCESSING AND FUZZY MEASURES In this section, we describe the fuzzy pre-processing methodology and fuzzy measures that are used for the actual fuzzy ARM and fuzzy associative classification processes. The assumption made in mining association rules and in building associative classifiers is that attributes are categorical. But, that is rarely the case, as many attributes are quantitative. And, to model such a scenario, we use discrete ranges (e.g. up to 25, 25-60, 60 and above), and try to fit the values of the numerical attribute Age in these ranges. For example, Age = 35, would fit in the range 25-60,

but so would Age = 59. Thus, using ranges introduces uncertainty, especially at the boundaries of ranges, leading to loss of information. The alternative is to use fuzzy partitions such as Young, Middle-aged and Old, and then ascertain the fuzzy membership µ (range [0, 1]) of each crisp numerical value in these fuzzy partitions. Thus, Age = 35 may have µ = 0.6 for the fuzzy partition Middle-aged, µ = 0.3 for Young, µ = 0.1 for Old. And Age = 59 may have µ = 0.3 for Middle-aged, µ = 0.1 for Young, µ = 0.3 for Old. Thus, by using fuzzy partitions, we preserve the information encapsulated in the numerical attribute, and are also able to convert it to a categorical attribute, albeit a fuzzy one. Therefore, many fuzzy sets can be defined on the domain of each quantitative attribute, and the original dataset is transformed into an extended one with attribute values having fuzzy memberships in the interval [0, 1]. Each membership function µ can be constructed manually by an expert in that domain. This is an expert-driven approach (fig. 1). Unfortunately, most real-life datasets are very huge (in the order of thousands and sometimes even millions) and can contain many quantitative attributes. Thus, it is humanly impossible for an expert to create fuzzy partitions for each attribute and then convert each crisp numeric value to a fuzzy value using these fuzzy partitions.

The fuzzy sets in most real-life datasets are Gaussian-like (as opposed to triangular or trapezoidal) due to the varied and heterogeneous nature of the datasets. And, our preprocessing technique is able to generate such Gaussian-like fuzzy datasets from any real-life dataset. Numerical data present in most real-life datasets translate into Gaussian-like fuzzy sets, where in a particular data point can belong to two or more fuzzy sets simultaneously. And, this simultaneous membership of any data point in more than two fuzzy sets can affect the quality and accuracy of the fuzzy association rules, and the fuzzy classifier generated using these data points and fuzzy sets. A. Pre-processing Methodology This pre-processing approach consists of two phases: •

Generation of fuzzy partitions for numerical attributes



Conversion of a crisp dataset into a fuzzy dataset using a standard way of fuzzy data representation.

As part of pre-processing, we have used fuzzy c-means (FCM) clustering [11] in order to create fuzzy partitions from the dataset, such that every data point belongs to every cluster to a certain degree µ in the range [0, 1]. The algorithm tries to minimize the objective function: µ ||

Fig. 1. Fuzzy partitions (piecewise linear) created using an expertdriven approach.

The alternative is to automate the creation of the fuzzy partitions, and to do this fuzzy clustering can be used. We use a data-driven pre-processing approach which automates the creation of fuzzy partitions for numerical attributes [10], and the subsequent conversion of a crisp dataset D to a fuzzy dataset E. This approach requires very minimal manual intervention even for very huge datasets. Moreover, with appropriate value of fuzziness parameter m in eq. 1 (such as m ≈ 2), we can get fuzzy partitions which are very close to linear, as illustrated in fig. 2.

Fig. 2. Fuzzy partition (Gaussian-like) created using a data-driven approach.

||

where m is any real number such that 1 ≤ m < ∞, µij is the degree of membership of xi in the cluster j, xi is the ith ddimensional measured data, cj is the d-dimension center of the cluster, and ||*|| is any norm expressing the similarity between any measured data and the center. The fuzziness parameter m is an arbitrary real number (m > 1). In the first phase, we apply one-dimensional FCM clustering on each of the numeric attributes to obtain the corresponding fuzzy partitions, with each value of any numeric attribute being uniquely identified by its membership function µ in these fuzzy partitions. One needs to select appropriate value of k (number of one-dimensional clusters), and then label the resulting clusters according to the nature of the attribute. In the second phase, if an attribute is quantitative, then the pre-processing methodology converts each crisp record in the dataset D to multiple fuzzy records based on the number of fuzzy partitions defined for the attribute. Doing so has the potential of generating fuzzy records in a combinatorial explosive manner. To deal with this problem, we have fixed a lower threshold for the membership function µ (= 0.3 most of the times) to keep the number of fuzzy records generated under control. If an attribute is a binary attribute, then we output each record appended with a membership function µ = 1, indicating this value has full membership in the binary attribute. The final fuzzy dataset E is used to extract fuzzy frequent itemsets and to build the fuzzy associative classifier as described in section 4.

B. Fuzzy Association Rules, Fuzzy Associative Classification, and Fuzzy Measures During the fuzzy ARM and fuzzy associative classification process, a number of fuzzy partitions are defined on each quantitative attribute, as a result of which the original dataset is transformed into an extended one with attribute values in the interval [0, 1]. In order to process this extended (fuzzy) dataset, we need new measures (analogous to crisp support and confidence), which are in terms of tnorms [10], [12]. [13]. The generation of fuzzy association rules is directly impacted by the fuzzy measures we use. Equations 5 and 6 respectively define a t-norm and the cardinality of a fuzzy set in a finite universe D [10], [12]. Fuzzy sets A and B in D are mapped as D → [0, 1], with A(x) and B(x) being the degrees to which attributes A and B are present in a transaction x respectively. Thus, using fuzzy partitions A and B and a t-norm, we can define fuzzy support (eq. 7) and confidence (eq. 8) [10], [12]. The more generally used t-norms are listed in Table I, with TM t-norm being the most popular one. Even we use the same t-norm in our algorithm. A ∩T B(x) = T(A(x), B(x))

(5)

| A | = ∑(x ∈D) A(x)

(6)

supp (A⟹B) = ∑(x ∈D) (A ∩T B)(x) / | X |

(7)

conf (A⟹B)=∑(x ∈D) (A ∩T B)(x) / ∑(x ∈D) A(x) (8) Table I. t-norms in fuzzy sets

t-norm TM(x, y) = min(x, y) TP(x, y) = xy TW(x, y) = max(x + y − 1, 0) IV. FACISME AND FUZZY ASSOCIATIVE CLASSIFICATION In this section we describe how FACISME is used to build a fuzzy associative classifier, and how this classifier is used for actual classification. Before the classifier training can ensue, we need to extract the fuzzy frequent itemsets from fuzzy dataset E. This can be done using any popular fuzzy ARM algorithms, most of which are fuzzy adaptations of Apriori. But, in this case, we use the fuzzy ARM algorithm that we have developed in [8]. We have come up with a new fuzzy ARM algorithm meant for fast and efficient performance to generate fuzzy frequent itemsets and fuzzy association rules. In FACISME, the training phase involves finding the set of constraints i.e. S (frequent itemsets extracted using fuzzy ARM), and computing λ values for all the classes. These λ values indicate the weights of the fuzzy frequent itemsets in each class y. The computed λ values for each class are stored, and are used in the actual classification phase. We

use the classical GIS algorithm [14] to deal with maximum entropy models. In GIS, at each iteration, a step is taken in a direction that increases the likelihood of the training data, with the step size being not too large and not too small. The likelihood of the training data increases at each iteration and eventually converges to the global optimum. GIS converges such that, for each fi, expected[i] = observed[i]. Whenever they are not equal, we can move them closer. To avoid very small probability and likelihood values, GIS is generally used in its logarithmic form. In this form, we add log (observed[i]/expected[i]) to λi, but along with a slowing factor, f# (eq. 9), equal to the largest total value of: f# = max j, y Σi fi(xj , y)

(9)

log # λi += δi

(11)

Next, GIS computes an update (eq. 10), after which δi is added to λi (eq. 11). The algorithm stops when there is no significant change in the λi values. This solution is globally optimal as it has been proved that the search space of λi s over which we are searching for the final solution is concave leading to the conclusion that every locally optimal solution is globally optimal [14]. A. Working of FACISME Equations 12, 13, and 14 are actively used by FACISME in determining the final λi values (in each iteration) in conjunction with equations 9, 10, and 11. expected[i] +=(fi(xj , yj) × es[j, y])/ z z = Σy es[j, y] s[j, y] += λi × fi(xj , yj)

(12) (13) (14)

The resultant pseudo-code for a single iteration of FACISME is shown in fig. 3, and described as follows. First it initializes all λis to 1 and expected values to 0 (fig. 3, lines 1-3). Then, iterating over all possible transactions that can be derived by using items present in E and for each class y, we calculate the s[j, y], taking into account λi values from the previous iteration (lines 4-10). The presence of each frequent (constraint) itemset i ∈ S in a particular transaction xj, is indicated by fi(xj , y) = 1. Based on the s[j,y] and z values calculated for each class y, we calculate the expected[i] values (lines 11-16). If the current iteration is the first iteration, fig, 1, line 16 is evoked to calculate expected[i]. Or else, equation 7 (line 14) is used to calculate expected[i]. Last, for each frequent (constraint) itemset i ∈ S, we calculate λi and δi (lines 17-19). This λi is used in the next iteration. We continue iterating over this procedure until expected[i]) ≅ observed[i] for each i, i.e. till convergence is achieved.

suitable t-norm (Table …). In this case, we use the TM tnorm. Next, for every ith itemset ∈ S, which is present in the current fuzzy transaction, we extract its fuzzy support (calculated during fuzzy ARM process) and λi value (calculated during training phase of FACISME). Using these three values, we calculate the entropy for each fuzzy transaction x' ∈ X', and determine the best class for each x' based on normalized values of entropy. The overall best class for the set X' is determined by the normalized values of the number of times each class y is best in the fuzzy transactions ∈ X'. We also maintain the total entropy for each class y, so that based on the normalized total entropy, we can select the best class for the whole set X' if more than one class turns out to be best in the same number of fuzzy transactions ∈ X'.

1) if first iteration 2) expected[0...F] = 0 3) λi = 1; where i = 1 to |S| 4) for each training instance j // all possible transactions using items in E 5) if not first iteration 6) for each output class y 7) s[j, y] = 0 8) for each i such that fi(xj , y) ≠ 0 9) s[j, y] += λi × fi(xj , y) s[j, y] 10) z = Σy e 11) for each output class y 12) for each i such that fi(xj , y) ≠ 0 13) if not first iteration s[j, y] 14) expected[i] += (fi(xj , y) × e )/ z 15) else if first iteration 16) expected[i] += 1 / |X| 17) for each i 18) δi = log (observed[i]/expected[i]) / f# 19) λi += δi Fig. 3. Pseudo-code for one iteration of FACISME

B. Actual Classification using FACISME Before a given crisp transaction x, containing crisp items, can be classified using a classifier trained by FACISME, we need to run the same pre-processing steps (described in section 3) that we used on the original training dataset before training. We use the fuzzy partitions obtained from preprocessing to transform the crisp attributes present in x to fuzzy attributes. This transformation process leads to the generation of one or more fuzzy transactions based on the number of fuzzy partitions generated for each crisp numerical attribute. The fuzzy transactions are represented by set X' = {x1', x2', …, xr'} Our objective is to find the class which best classifies the set X'. For a fuzzy transaction x' ∈ X', we first extract all the frequent itemsets ∈ S that are subsets of x'. These itemsets are the features of x'. Then, we compute the entropy (equations 15 and 16) for each class y. entropy = Σi -log pi × pi entropy = Σi eλi × µ × fuzzy_support

(15) (16)

Eq. 15 is the standard equation for entropy. This equation has been transformed into eq. 16, in order to handle entropy in the fuzzy associative classification context. (-log pi) in eq. 15 is equivalent to eλi in eq. 16 (recall that λi is in logarithmic form). And, pi in eq. 15 is equivalent to µ × fuzzy_support. Because FACISME involves fuzzy logic, the fuzzy membership value µ of each fuzzy transaction has to be taken into consideration. Likewise, fuzzy support (as described in section 3.B) of each frequent itemset (constraint) also needs to be used. To calculate the entropy, first we find the fuzzy membership µ of the current fuzzy transaction using a

C. Space and Time Complexities of FACISME For GIS, the set of all possible transactions X is stored as a sparse matrix of all non-zero indicator functions for each instance j and output y. GIS requires X, the λs of size |S|, as well as the expected and observed arrays, which are also of size |S|. Thus, GIS requires space O(|X| + |S|), and since |X| must be at least as large as |S|, this would be O(|X|). Regarding time complexity, every time Equation 5 and 6 are used (fig. 1, lines 17-19), expected[i] is re-calculated. This step takes O(|X|), and to execute it for each i would take O(|X| × |S|). If the algorithm requires m iterations for the distribution to converge, the time complexity of GIS can be given as O(m × |X| × |S|). D. Non-closed Itemsets and FACISME The maximum entropy model as applied to fuzzy associative classification fails in some cases when a frequent itemset, present in the constraint set S, is “not closed” [15]. An itemset is not closed if and only if it has the same frequency as one of its supersets, i.e. expected[su] ≠ 0, and ∃ sv ∈ S s.t. su ⊂ sv ∧ expected[su] = expected[sv] Thus, a major disadvantage with the maximum entropy model is that there is no solution when the system of constraints has non-closed itemsets in it. Hence, in cases when the system of constraints have non-closed constraints, the exact solution does not exist, and the model parameters will not converge under the GIS algorithm. This is elaborated in [16]. Hence, a modified form of the maximum entropy model is used which can accommodate non-closed itemsets. Let S' be the set of closed constraints in S. The non-closed constraints in S are only used to determine whether expected values are 0 or not. Thus, only those constraints ∈ S' are used for the actual calculation and final classifier building as their expected values are > 0. Thus, multiple iterations of GIS are run using S' until convergence is achieved. V. RELATED WORK CBA [17] was one of the first associative classifiers and focuses on focusing on mining a special subset of association rules, called class association rules (CARs). CMAR [18] considers multiple rules at the time of classification using weighted χ2. CPAR [19] takes a totally

different approach to associative classification in that it does not directly use the rules generated by ARM, but only uses the frequent itemsets and their respective counts to build a classifier using a FOIL-like technique called PRM. Even before the advent of associative classifiers, many nonassociative classifiers were proposed, prominent of which is C4.5, which is a widely known decision tree classifier [22], and Ripper [23]. The principle of maximum entropy has been widely used for a variety of natural language tasks [22]. ACME [16] uses it for binary data classification with mined frequent itemsets as constraints, discusses the failure of maximum entropy model to accommodate closed constraints, and proposes solutions to this problem. In fact, closed itemsets were first defined by Zaki in [15]. Thabtah in [21] provides a very detailed survey of current associative classification algorithms. The author has compared most of the well-known algorithms, like CBA, CMAR, and CPAR among others, highlighting their pros and cons, and describing briefly how exactly they work. He lays more emphasis on pruning techniques, rule ranking, and predication methods used in various classifiers, and also provides valuable insights into the future directions which should be undertaken in associative classification. In [6] Hüllermeier provides a detailed overview of fuzzy methods as applied to data-mining. De Cock et. al. [12] and Chen et. al. [5] discuss in detail about fuzzy implicators and t-norms for discovering fuzzy association rules and also talk in depth about new measures of rule quality which would help the overall fuzzy ARM process. In fact, Dubois et. al. [13] make a very detailed analysis of t-norms and implicators with respect to fuzzy partitions, and provide a firm theoretical basis for their conclusions. Verlinde et. al. [10] describe very briefly describe a pre-processing technique, to obtain fuzzy attributes from numerical attributes, using FCM [11]. FR3 [24] is a fuzzy extension of RIPPER (base learner round robin scheme) and the recently introduced R3 learner. FURIA [25] extends the well-known RIPPER algorithm, and learns fuzzy rules instead of conventional rules and unordered rule sets instead of rule lists. The structural learning algorithm in vague environment (SLAVE) algorithm makes use of genetic algorithms to learn a fuzzy classifier [26], [27]. VI. PERFORMANCE STUDY In this section, we describe the performance study model that has been used for testing FACISME using three mostwidely-used disparate UCI Machine Learning (ML) datasets, namely iris, breast, and pima. These three datasets are disparate from each other on various facets (like number of attributes, type of attributes – numerical or binary, size of dataset, and type of dataset – dense or sparse). This makes any experimental analysis performed simultaneously on all the three datasets reliable and veracious, and such that its covers a wide range of aspects on which testing of classification accuracy and performance can be done. The performance of FACISME has been compared with that of other state-of-the art classifiers, and the results of the

same are detailed in section 6.A. Specifically, the classifiers used for comparisons are: • • • • • • • •

Classification based on Predictive Association Rules (CPAR) [19] Classification based on Multiple Rules (CMAR) [18] Classification based on Associations (CBA) [17] C4.5 [22] Ripper [23] Fuzzy Round-Robin Ripper (FR3) [24] Fuzzy Unordered Rule Induction Algorithm (FURIA) [25] Structural Learning Algorithm in a Vague Environment (SLAVE) [26], [27]

For the experiments on FACISME, we downloaded the raw datasets from the UCI ML website. These datasets then underwent the fuzzy pre-processing, mentioned in section 3, before being used to generate fuzzy frequent itemsets. To generate the fuzzy frequent itemsets that are used by FACISME, we have used the fuzzy ARM algorithm described in [8]. The minimum support used for the fuzzy ARM process in order to build the fuzzy classifier, for all the three datasets, was 0.2. All the experiments were performed on an AMD 2600+ Sempron PC with 512 MB main memory. All the approaches have been implemented by their authors with the parameters that they have used for testing. For CPAR, the best five rules are used for prediction. In all the experiments (of FACISME and other associative classification algorithms), accuracy is measured using the 10-fold cross validation. We implemented FACISME using Java on Windows XP. The results of the remaining algorithms (i.e. excepting FACISME) are taken from [16], [18], [24], [25]. A. Experimental Results As mentioned above, three standard UCL-ML datasets have been used to illustrate the efficacy of FACISME in terms of accuracy and simplicity of use. Figures 4, 5, and 6 illustrate the experimental results to test accuracy obtained for various classifiers on each of the three UCI-ML datasets. Fig. 4 shows the accuracy obtained by each classifier on the iris dataset. FACISME performs the best in terms of accuracy as compared to the other datasets. From fig. 5 we get to know the accuracies for all the classifiers on the breast (breast cancer) dataset. And, FACISME performs nearly as well as the most accurate classifier for this dataset. Finally, fig. 6 depicts the results of the experimental analysis done on pima dataset. Even in this case, FACISME performs the best as compared to the other classifiers. Thus, the basic inference from this experimental analysis is that FACISME consistently performs very well on the basis of accuracy, and is even the best in two cases. Each of the classifiers (excepting FACISME) used in this experimental analysis are crisp in nature, and thus use sharp partitioning to convert numerical attributes to binary attributes. Normally 5-6 sharp partitions are used, but these can extend up to 10 or more depending on the nature of the

numerical attribute. e.g. the attribute Age which is generally in the range 0-100, can be divided into six sharp partitions, namely 0-15, 16-30, 31-45, 46-60, 61-75, 76 and above. In case of FACISME, fuzzy sets are used instead of sharp partitions, in order to convert numerical attributes to fuzzy partitions. The attribute Age can be divided into three fuzzy partitions, namely Young, Medium-aged, and Old, with each value of age belonging to each of the three fuzzy partitions with some membership value µ. Thus, the number of sharp partitions that need to be used to handle a numerical attribute is far greater than the number of fuzzy partitions required to do the same. Moreover, this leads to better understanding by the user as there are few well-defined fuzzy partitions with linguistic names/meanings like Young and Old, as opposed to sharp partitions which are less intuitive and to which a user cannot relate to immediately. For the current analysis we have used two fuzzy partitions for all numerical attributes throughout all the datasets. As described in section 5.C, the time complexity of FACISME is quite high, because of which the time required to train a fuzzy classifier using FACISME is also high. Hence, we have used two fuzzy partitions so as to limit the training time. But, even with just two fuzzy partitions we have achieved quite high accuracies, as described above. More importantly, even with this, we achieved similar (sometimes better results) as other state-of-the art classifiers used in this experimental analysis. The reason for these high accuracies is that FACISME is based on the theoretically sound and well-established maximum entropy framework. VII. CONCLUSIONS In this paper we have proposed a new classifier based on the paradigms of association rules mining and fuzzy logic. Though recent classifiers involving association rules have shown their techniques to be accurate, their approaches to build a classifier either manifest unobserved statistical relationships in the dataset, or have no statistical basis at all. FACISME uses the well-known maximum entropy principle and iterative scaling to build a statistical and robust theoretical model for classification. Fuzzy Logic gives FACISME the capability to deal accurately and efficiently with any type of datasets and domains, and in any kind of environments, which is not necessarily true for traditional crisp classifiers. Thus, using maximum entropy and fuzzy logic, FACISME provides very good accuracy, and can work with all types of data (irrespective of size and type of attributes – numerical or binary). REFERENCES [1]

[2] [3]

[4]

Agrawal, R. and Srikant, R.: Fast algorithms for mining association rules. In Proceedings of the 20th International Conference on Very Large Data Bases, pp. 487-499. Morgan Kaufmann, Santiago, Chile (1994). Clark, P. and Niblett, T.: Bayesian network classifiers. Machine Learning, 29, 131-163 (1997). Meretakis, D. and Wuthrich, B.: Extending naive-bayes classifiers using long itemsets. In Proceedings of International Conference on Knowledge Discovery and Data Mining, pp. 165-174. ACM, New York, NY (1999). Zadeh, L.A.: Fuzzy sets. Inf. Control, 8, 338–358 (1965).

[5]

[6] [7] [8]

[9] [10]

[11] [12] [13] [14] [15] [16]

[17]

[18]

[19]

[20] [21] [22] [23] [24] [25] [26] [27]

Chen G., Yan P., Kerre E.E.: Computationally Efficient Mining for Fuzzy Implication-Based Association Rules in Quantitative Databases. International Journal of General Systems, 33, 163-182 (2004). Hüllermeier, E.: Fuzzy methods in machine learning and data mining: Status and prospects. Fuzzy Sets and Systems. 156, 387-406 (2005). Mangalampalli, A., Pudi, V.: Fuzzy Logic-based Pre-processing for Fuzzy Association Rule Mining. Technical Report IIIT/TR/2008/127, International Institute of Information Technology (2008). Mangalampalli, A., Pudi, V.: Fuzzy Association Rule Mining Algorithm for Fast and Efficient Performance on Very Large Datasets. In Proceedings of IEEE International Conference on Fuzzy Systems, pp. 1163-1168. IEEE Computational Intelligence Society, Piscataway, NJ (2009). Good, I.: Maximum entropy for hypothesis formulation, especially for multidimensional contingency tables. Annals of Mathematical Statistics, 34, 911-934 (1963). Verlinde, H., De Cock, M., Boute, R.: Fuzzy Versus Quantitative Association Rules: A Fair Data-Driven Comparison. IEEE Transactions on Systems, Man, and Cybernetics - Part B: Cybernetics, 36, 679-683 (2006). Dunn, J.C.: A Fuzzy Relative of the ISODATA Process and its Use in Detecting Compact, Well Separated Clusters. J. Cyber., 3, 32-57 (1974). De Cock, M., Cornelis, C., and Kerre, E.E.: Elicitation of fuzzy association rules from positive and negative examples. Fuzzy Sets and Systems, 149, 73–85 (2005). Dubois, D., Hüllermeier, E., Prade, H.: A systematic approach to the assessment of fuzzy association rules. Data Min. Knowl. Discov., 13, 167-192 (2006). Darroch, J.N. and Ratcliff, D.: Generalized iterative scaling for loglinear models. The Annals of Mathematical Statistics, 43, 1470–1480 (1972). Zaki, M.J.: Generating non-redundant association rules. In Proceedings of the 6th International Conference on Knowledge Discovery and Data Mining, pp. 34-43. ACM, New York, NY (2000) Thonangi, R. and Pudi, V.: ACME: An Associative Classifier Based on Maximum Entropy Principle. In Proceedings of the 16th International Conference Algorithmic Learning Theory, pp. 122-134. Springer, Singapore (2005). Liu, B., Hsu, W., and Ma, Y:. Integrating classification and association rule mining. In Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining, pp. 80-86, AAAI Press, New York, NY (1998). Li, W., Han, J., and Pei, J.: CMAR: Accurate and efficient classification based on multiple class-association rules. In Proceedings of IEEE International Conference on Data Mining, pp. 369-376. IEEE Computer Society, San Jose, CA (2001). Yin, X. and Han, J.: CPAR: Classification based on Predictive Association Rules. In Proceedings of the 3rd SIAM International Conference on Data Mining, pp. 331-335. SIAM, San Francisco, CA (2003). Beeferman, D., Berger, A., and Lafferty, J.: Statistical models for text segmentation. Machine Learning. 34, (1-3), 177-210 (1999). Thabtah, F.A.: A review of associative classification mining. Knowledge Engineering Review, 22, 37-65 (2007). Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Francisco (1993). Cohen, W.: Fast Effective rule induction. In Proceedings of the 12th International Conference on Machine Learning, pp. 115-123. Morgan Kaufmann, Tahoe City, CA (1995). Hühn, J. C., and Hüllermeier, E.: FR3: A Fuzzy Rule Learner for Inducing Reliable Classifiers. IEEE Transactions on Fuzzy Systems, 17, 138-149 (2009). Hühn, J. C., and Hüllermeier, E.: FURIA: an algorithm for unordered fuzzy rule induction. Data Mining Knowledge Discovery, 19, 293-319 (2009). Gonzalez, A. and Perez, R.: Slave: A genetic learning system based on an iterative approach. IEEE Transactions on Fuzzy Systems, 7, 176191 (1999). Gonzalez, A. and Perez, R.: Selection of relevant features in a fuzzy genetic learning algorithm. IEEE Transactions on Systems, Man, and Cybernetics - Part B: Cybernetics, 31, 417–425 (2001).

95.5 95 94.5 94 93.5 93 FACISME CPAR

CMAR

CBA

C4.5

Ripper

FURIA

FR3

SLAVE

Fig. 4. Experimental results on iris dataset

96.5 96 95.5 95 94.5 94 FACISME

CPAR

CMAR

CBA

C4.5

Ripper

Fig. 5. Experimental results on breast (breast cancer) dataset

76 75 74 73 72 71 FACISME

CPAR

CMAR

CBA

C4.5

Ripper

Fig. 6. Experimental results on pima dataset

FURIA

SLAVE

Suggest Documents