International Journal of Approximate Reasoning 54 (2013) 184–195
Contents lists available at SciVerse ScienceDirect
International Journal of Approximate Reasoning journal homepage: www.elsevier.com/locate/ijar
Fuzzy similarity-based nearest-neighbour classification as alternatives to their fuzzy-rough parallels Yanpeng Qu a,b , Qiang Shen b,∗ , Neil Mac Parthaláin b , Changjing Shang b , Wei Wu a a b
School of Mathematical Sciences, Dalian University of Technology, Dalian 116024, PR China Department of Computer Science, Aberystwyth University, Aberystwyth, Ceredigion, SY23 3DB, UK
ARTICLE INFO
ABSTRACT
Article history: Received 7 September 2011 Revised 15 May 2012 Accepted 18 June 2012 Available online 15 July 2012
Fuzzy-rough sets have enjoyed much attention in recent years as an effective way in which to extend rough set theory such that it can deal with real-valued data. More recently, fuzzyrough sets have been employed for the task of classification. This has led to the development of approaches such as fuzzy-rough nearest-neighbour (FRNN) and its extension based on vaguely-quantified rough sets (VQNN). These methods perform well and experimental evaluation demonstrates that VQNN in particular is very effective for dealing with data in the presence of noise. In this paper, the underlying mechanisms of FRNN and VQNN are investigated and analysed. The theoretical proof and empirical evaluation show that the resulting classification of FRNN and VQNN depends only upon the highest similarity and greatest summation of the similarities of each class, respectively. This fact is exploited in order to formulate the novel methods proposed in this paper: similarity nearest-neighbour (SNN) and aggregated-similarity nearest-neighbour (ASNN). These two novel approaches are equivalent to FRNN and VQNN, but do not employ the concepts or framework of fuzzy-rough sets. Instead only fuzzy similarity is used. Experimental evaluation confirms the observation that these new methods maintain the classification performance of the existing advanced fuzzy-rough nearest-neighbour-based classifiers. In addition, the underlying mathematical foundation is simplified. © 2012 Elsevier Inc. All rights reserved.
Keywords: Fuzzy-rough sets Similarity function Nearest neighbour Classification
1. Introduction Fuzzy-rough set theory [12,24,36] is a hybridisation of rough sets [28,32,35] and fuzzy sets [27,39], which is capable of conjunctively dealing with vagueness and uncertainty in data. As a hybridisation of fuzzy sets and rough sets, fuzzy-rough sets not only inherit the domain independence of rough sets, but also address the inability of rough sets theory to deal with real-valued data. That is, fuzzy-rough sets provide a means to deal with discrete or real-valued noisy data (or a mixture of both) without the need for user-supplied domain-specific thresholding information. As such, this technique can be applied to regression as well as classification tasks. No additional information is required and the fuzzy partitions for each feature are generated automatically from the data [19]. The k-nearest-neighbour (kNN) algorithm [13] is a well-known non-linear classification technique for classifying objects based on the k closest training examples in the feature space. kNN is a type of instance-based learning, which assigns a test object to the decision class that is most common among its k nearest neighbours, i.e., the k training objects that are closest to the test object. Fuzzy-rough nearest-neighbour (FRNN) is an extension to the kNN algorithm which employs fuzzy-rough set theory [18]. FRNN uses a single test object’s nearest neighbours to construct the lower and upper approximation of each decision class. It then computes the membership of the test object to each of these approximations. The approach is very flexible, as there are many ways in which to define the fuzzy partitions. These include the traditional implicator/t-norm ∗ Corresponding author. E-mail address:
[email protected] (Q. Shen). 0888-613X/$ - see front matter © 2012 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.ijar.2012.06.008
Y. Qu et al. / International Journal of Approximate Reasoning 54 (2013) 184–195
185
based model [30], as well as the vaguely quantified rough set (VQRS) measure [9], which is robust in the presence of noisy data. The work in this paper focuses on analysing the relationship between the fuzzy-rough approaches described in [9,18]. The aim of this study is not to propose an entirely new nearest-neighbour classification approach but rather to analyse the underlying mechanisms of the aforementioned methods and to demonstrate that they can be implemented without the complex constructs of fuzzy-rough sets. The proofs in this paper show that the resulting classification accuracy for FRNN depends solely upon the greatest similarity between the test data object and the data objects in the training data. For VQNN, it is shown that the summation of the similarities between the test data object and the data objects of each class in the training data of the k nearest-neighbours is employed as the qualifier in the decision-making process. Based on this, two algorithms, namely: similarity nearest-neighbour (SNN) and aggregated-similarity nearest-neighbour (ASNN), are proposed as the equivalent simplifications of FRNN and VQNN respectively. In fact, compared to their fuzzy-rough based counterparts, these novel approaches can achieve the same classification accuracy but employ a simpler underlying classification model. The relationship between the fuzzy-rough-based approaches and their equivalent similarity-based counterparts also helps in understanding the influence of the number of nearest-neighbours and the effects this has on classification. In addition, the reasons for the robustness of FRNN and VQNN in the presence of noisy datasets is also clarified. The experimental evaluation and theoretical analysis demonstrate that SNN and ASNN are more efficient than FRNN and VQNN. The remainder of this paper is structured as follows. The theoretical background is presented in Section 2 with a short review of fuzzy-rough nearest-neighbour classification. In Section 3, a detailed theoretical analysis is made to explore the underlying mechanisms of FRNN and VQNN. Based on this, the similarity-based nearest-neighbour approaches are proposed in Section 4. The new similarity-based classifiers are compared with others in an experimental evaluation in Section 5, including the equivalent fuzzy-rough nearest-neighbour techniques. Section 6 concludes the paper with a short discussion of potential further work. 2. Theoretical background 2.1. Rough sets The work on rough set theory (RST) [28,32,35] provides a methodology that can be employed to extract knowledge from a domain in a concise way: it is able to minimise information loss whilst reducing the amount of information involved. Central to rough set theory is the concept of indiscernibility. Let I = (U, A) be an information system, where U is a non-empty set of finite objects (the universe) and A is a non-empty finite set of attributes such that a : U −→ Va for every a ∈ A. Va is the set of values that attribute a may take. For any P ⊆ A, there exists an associated equivalence relation IND(P ): IND(P )
= {(x, y) ∈ U2 | ∀a ∈ P , a(x) = a(y)}.
(1)
The partition generated by IND(P ) is denoted U/IND(P ) or abbreviated to U/P and is calculated as follows:
U/IND(P ) = ⊗{a ∈ P : U/IND({a})},
(2)
where
U/IND({a}) = {{x | a(x) = b, x ∈ U} | b ∈ Va }
(3)
A⊗B
(4)
and
= {X ∩ Y | ∀X ∈ A, ∀Y ∈ B, X ∩ Y = ∅}.
If (x, y) ∈ IND(P ), then x and y are indiscernible by attributes from P. The equivalence classes of the P-indiscernibility relation are denoted [x]P . Let X ⊆ U. X can be approximated using only the information contained in P by constructing the P-lower and P-upper approximations of X [37]: PX
= {x | [x]P ⊆ X },
PX = {x | [x]P ∩ X = ∅}. The tuple PX , PX is called a rough set.
(5) (6)
2.2. Hybridisation of rough sets and fuzzy sets The process described in the previous section although useful can only operate effectively on datasets containing discrete values. As most datasets contain real-valued attributes, a subjective judgement or threshold must therefore be employed in order for RST to operate on such data. The imposition of such a subjective threshold is however, contrary to the concept of domain independence of RST. An appropriate way of handling the problem of real-valued data is the use of fuzzy-rough sets (FRS) [12,24,36]. FRS offer a high degree of flexibility in enabling the vagueness and imprecision present in real-valued data to be simultaneously modelled effectively.
186
Y. Qu et al. / International Journal of Approximate Reasoning 54 (2013) 184–195
Definitions for the fuzzy lower and upper approximations can be found in [20,30], where a T-transitive fuzzy similarity relation is used to approximate a fuzzy concept X:
μRP X (x) = inf I (μRP (x, y), μX (y)),
(7)
μRP X (x) = sup T (μRP (x, y), μX (y)).
(8)
y∈U
y∈U
Here, I is a fuzzy implicator and T is a T-norm. RP is the fuzzy similarity relation induced by the subset of features P:
μRP (x, y) = Ta∈P {μRa (x, y)}.
(9)
μRa (x, y) is the degree to which objects x and y are similar for feature a, and may be defined in many ways, for example: |a(x) − a(y)| , |amax − amin | (a(y) − (a(x) − σa )) ((a(x) + σa ) − a(y)) μRa (x, y) = max min ,0 , σa σa μRa (x, y) = 1 −
(10) (11)
where σa 2 is the statistical variance of feature a. As these relations do not necessarily display T-transitivity, fuzzy transitive closure must be computed for each feature [8]. In other words, T-transitivity is not required for fuzzy-rough sets. Instead, fuzzy tolerance relations [10] can be employed to construct fuzzy-rough sets [17]. This technique is adopted in this paper also. Note that formulas (7) and (8) are quite sensitive to noisy values (just like their crisp counterparts). Thus, the concept of vaguely quantified rough set (VQRS) has been introduced [9]. Following this approach, given a pair (Qu , Ql ) of fuzzy quantifiers, which is an increasing [0, 1] → [0, 1] mapping, the lower and upper approximation of X by RP are redefined by |RP (x, y) ∩ X | y∈U T (μRP (x, y), μX (y)) μQRPuX (x) = Qu = Qu , (12) |RP (x, y)| y∈U μRP (x, y)
μQR l X (x) P
= Ql
|RP (x, y) ∩ X | = Ql |RP (x, y)|
T (μRP (x, y), μX (y)) y∈U μRP (x, y)
y∈U
.
(13)
The fuzzy set intersection is defined by the T-norm min and the fuzzy set cardinality by the summation operation. In contrast to the fuzzy-rough versions, Eqs. (7) and (8), the VQRS upper and lower approximations do not extend the classical rough set approximations, in the sense that when X and R are crisp, Eqs. (12) and (13) may still be fuzzy. 2.3. Fuzzy-rough nearest-neighbour algorithm A number of approaches have been developed for building fuzzy-rough nearest-neighbour (FRNN) classifiers [3,31]. However, the initial work on the use of the fuzzy upper and lower approximation concepts to determine class membership was proposed in [18]. The algorithm in [18] is shown in Fig. 1. It works by iteratively examining each of the decision classes in the training data. It calculates the membership of the test data object under consideration to the upper and lower approximation for each class. The average of these values is then compared with the highest existing value (τ ). If the average of the approximation membership values for the currently considered class is higher, then τ is updated with this value and the class label is assigned to this test object. If not, the algorithm continues to iterate through all of the remaining decision classes. Classification accuracy is calculated by comparing the predicted output class assignment with the actual class label of each of the test data objects. The intuitive rationale behind this algorithm is that the lower and upper approximations of the decision classes (calculated by means of the nearest neighbours of a test object y) provide useful indicators in predicting the actual membership of a test object to any given class.
Fig. 1. Fuzzy-rough nearest-neighbour algorithm.
Y. Qu et al. / International Journal of Approximate Reasoning 54 (2013) 184–195
187
An extension of FRNN is vaguely quantified rough set-based nearest-neighbour (VQNN) [9] which employs (12) and (13), to determine class membership of test objects. The underlying learning mechanism is very similar to that of FRNN and hence, omitted here. However, it is important to note that, the use of k in FRNN is not required for prediction in principle. Whereas for VQNN, this may not be true, as the specification of k has a direct impact on the performance of the classification algorithm. The reasons for this are explained in the next section. 3. Theoretical analysis of the FRNN and VQNN In this section, the underlying mechanisms of FRNN and VQNN algorithms are theoretically analysed. The foundation for this work lies in the investigation of: a) the effect of selecting the value of k for the classification accuracy of FRNN, and b) the ability of the VQNN algorithm to operate effectively in the presence of noise. Lemma 1. For a given y, and a set X inf I (xi , y) X
= I (max{xi }, y),
sup T (xi , y) X
= {xi }ni=1 , xi ∈ R, (14)
X
= T (max{xi }, y).
(15)
X
Proof. According to the properties of Implicators and T-norms, for given y, I (x, y) is monotonically decreasing for x, and T (x, y) is monotonically increasing for x. Thus, it is held that inf X I (xi , y) = I (maxX {xi }, y) and supX T (xi , y) = T (maxX {xi }, y). Theorem 1. For a given set U, by FRNN, an object y will be classified into a class, which has the greatest similarity to y, i.e., y will belong to class A, where ∃x∗ ∈ A, s.t. μRP (x∗ , y) = maxx∈U {μRP (x, y)}. Proof. Let the class contains x∗ be class A. Because for any a ∈ R, I (a, 1) = 1 and T (a, 0) = 0, thus, for the decision concept X = A, the lower and upper approximation can be reduced to: μRP A (y) = inf inf I (μRP (x, y), μA (x)), inf I (μRP (x, y), μA (x)) = inf I (μRP (x, y), 0), (16) x∈A
x∈U−A
x∈U−A
μRP A (y) = sup sup T (μRP (x, y), μA (x)), sup T (μRP (x, y), μA (x)) = sup T (μRP (x, y), 1). x∈A
x∈U−A
x∈A
By Lemma 1, Eqs. (16) and (17) can be further reduced to: μRP A (y) = I max {μRP (x, y)}, 0 , x∈U−A
(17)
(18)
μRP A (y) = T max{μRP (x, y)}, 1 = T (μRP (x∗ , y), 1). x∈A
(19)
= B, where B is an arbitrary class different from A, there are:
Similarly, for the decision concept X
μRP B (y) = inf I (μRP (x, y), 0),
(20)
μRP B (y) = sup T (μRP (x, y), 1).
(21)
x∈U−B
x∈B
By Lemma 1, Eqs. (20) and (21) can be further reduced to: μRP B (y) = I max {μRP (x, y)}, 0 = I (μRP (x∗ , y), 0), x∈U−B
(22)
μRP B (y) = T max{μRP (x, y)}, 1 . x∈B
(23)
Because
μRP (x∗ , y) ≥ max {μRP (x, y)},
(24)
μRP (x∗ , y) ≥ max{μRP (x, y)}.
(25)
x∈U−A
x∈B
According to the properties of Implicators and T-norms, it is that
μRP A (y) ≥ μRP B (y),
(26)
188
Y. Qu et al. / International Journal of Approximate Reasoning 54 (2013) 184–195
μRP A (y) ≥ μRP B (y),
(27)
then
μRP A (y) + μRP A (y)
μRP B (y) + μRP B (y)
≥
2
2
.
(28)
Because of the arbitrariness of class B, following the method of FRNN, x will be classified into class A.
As described previously, FRNN uses the information of the average of both lower and upper approximation memberships in order to decide which class the test data object belongs to. However, in Theorem 1, it is proven that the resulting classification accuracy gained by FRNN is decided simply by the greatest similarity amongst all classes. In other words, the test data object will always be classified into one of the classes which has the greatest similarity. In this case, no matter what value k is, the greatest similarity within the nearest neighbours is also the greatest for the whole dataset. This is the reason why the specification of the number of nearest neighbours k has no effect on the final classification for FRNN. an object y will be classified into Theorem 2. For a given set U, by VQNN, a class, which has the greatest summation of similarities to y, i.e., y will belong to class X ∗ , where x∈X ∗ μRP (x, y) = maxX ∈C { x∈X μRP (x, y)}, C is the set of decision classes. Proof. According to the definition of VQNN, for the decision concept X = X ∗ , the lower and upper approximations are x∈X ∗ T (μRP (x, y), μX ∗ (x)) + x ∈X ∗ T (μRP (x, y), μX ∗ (x)) x∈U T (μRP (x, y), μX ∗ (x)) μQRPuX ∗ (y) = Qu = Q , u x∈U μRP (x, y) x∈U μRP (x, y) (29)
μQR l X ∗ (y) P Because T (a, 0)
x∈U
= Ql
T (μRP (x, y), μX ∗ (x)) x∈U μRP (x, y)
= Ql
x∈X ∗
T (μRP (x, y), μX ∗ (x)) + x∈U
x ∈X ∗
T (μRP (x, y), μX ∗ (x))
μRP (x, y)
.
(30)
= 0 and T (a, 1) = a, Eqs. (29) and (30) can be simplified as
μRP (x, y) , x∈U μRP (x, y)
x∈X ∗
μQRPuX ∗ (y) = Qu
(31)
μRP (x, y) . μ x∈U RP (x, y) x∈X ∗
μQR l X ∗ (y) = Ql P
(32)
Similarly, for a distinct decision X = Z, the lower and upper approximation can be respectively denoted by x∈Z μRP (x, y) Qu μRP Z (y) = Qu , x∈U μRP (x, y)
μRP (x, y) . x∈U μRP (x, y)
μQR l Z (y) = Ql x∈Z P
(34)
Following the definition of VQNN, the fuzzy qualifiers in VQNN is an increasing [0, 1] because μRP (x, y) ≥ μRP (x, y), x∈X ∗
(33)
→ [0, 1] mapping. In this case, (35)
x∈Z
it can be derived that
μQRPuX ∗ (y) ≥ μQRPuZ (y),
(36)
μQR l X ∗ (y) ≥ μQR l Z (y).
(37)
P
P
Also,
μQRPuX ∗ (y) + μQR l X ∗ (y) P
2
≥
μQRPuZ (y) + μQR l Z (y) P
2
.
This concludes that y will be always classified into Class X ∗ .
(38)
Y. Qu et al. / International Journal of Approximate Reasoning 54 (2013) 184–195
189
In Theorem 2, it is proven that the resulting classification accuracy of VQNN is only decided by the greatest summation of all the similarities for each class within the k nearest-neighbours. In other words, the test data object will be classified into one of the classes which has the greatest summation of similarities located in the k nearest neighbours. This explains why VQNN performs better than FRNN for classification in the presence of noisy data. In fact, because of the existence of noise in datasets, even if a class has the highest similarity, it may not be sufficient to make a reliable decision. Note that the summation of the similarities of each class within the k nearest-neighbours weakens the impact of noisy data. In so doing, it becomes clear that the choice of k is influential in deciding the resulting classification. If k is too small or too large, there will be insufficient non-noisy data, or conversely too much noisy data within the nearestneighbours. 4. Similarity-based nearest-neighbour classification Based on the findings in the last section, two novel similarity-based nearest-neighbour methods are proposed. Importantly, these approaches do not rely on the concepts or framework of fuzzy-rough sets, but are equivalent to the FRNN and VQNN methods respectively. 4.1. Similarity nearest-neighbour By exploiting the fact analysed in Theorem 1, a simpler equivalent algorithm for FRNN can be formulated. This new algorithm termed similarity nearest-neighbour (SNN), is outlined in Fig. 2. In contrast to FRNN, the greatest similarity for expression (9) is employed as the decision qualifier for class assignment rather than the average of lower approximation (7) and upper approximation (8) membership values. In the case of SNN, not only the conditions required in order to make the decision have been reduced from two to one, but also the complexity of calculating the decision qualifier is simplified. This is because the calculation of the lower and upper approximations is no longer necessary. Given the pseudo code for the FRNN algorithm (as shown in Fig. 1), there are two loops - one to iterate through the classes, and another to iterate through the neighbours, which in the worst case, is the entire universe U. Therefore the complexity of FRNN is O(|C | · |U|). In contrast, the complexity of SNN is: O(|U|), as there is only one loop through the data. According to Theorem 1, the resulting classifications are the same for both SNN and FRNN, however as demonstrated above, SNN is more computationally efficient than FRNN. SNN – a worked example In order to illustrate the concepts of the SNN algorithm, a small worked example is presented here. This example employs a dataset with 3 real-valued conditional attributes (a, b and c) and a single crisp discrete-valued decision attribute (q) as the training data, shown in Table 1. A further dataset shown in Table 2 contains 2 objects is used as the test data to be classified, again with the same number of conditional and decision attributes. Referring to the SNN algorithm described in the previous section, the first step is to calculate the similarities for all decision classes. In Table 1 there are four objects that belong to one of two classes: yes, and no.
Fig. 2. The similarity nearest-neighbour algorithm. Table 1 Example training data. Object 1 2 3 4
a
−0.4 −0.4 0.2 0.2
b 0.2 0.1 −0.3 0
c
0 0
q Yes No No Yes
c 0 −0.3
q No Yes
−0.5 −0.1
Table 2 Example test data. Object t1 t2
a 0.3 −0.3
b
−0.3 0.3
190
Y. Qu et al. / International Journal of Approximate Reasoning 54 (2013) 184–195
Using the Łukasiewicz t-norm (max(x + y − 1, 0)) [7] and the fuzzy similarity measure as defined in (10) the similarity of each test object is compared to all of the objects in the training data. For instance, consider the test object t1:
μR{P } (t1, 1) = T (μR{a} (t1, 1), μR{b} (t1, 1), μR{c} (t1, 1)) = 0, μR{P } (t1, 2) = T (μR{a} (t1, 2), μR{b} (t1, 2), μR{c} (t1, 2)) = 0, μR{P } (t1, 3) = T (μR{a} (t1, 3), μR{b} (t1, 3), μR{c} (t1, 3)) = 0.83, μR{P } (t1, 4) = T (μR{a} (t1, 4), μR{b} (t1, 4), μR{c} (t1, 4)) = 0.23. With reference once again to the SNN algorithm in Fig. 2, it can be seen that the greatest similarity values for test object t1 is to the (training) object 3. Because the class label of object 3 is no, the algorithm will therefore classify t1 as belonging to the class no. The procedure can be repeated for test object t2, resulting in the following similarities:
μR{P } (t2, 1) = T (μR{a} (t2, 1), μR{b} (t2, 1), μR{c} (t2, 1)) = 0.23, μR{P } (t2, 2) = T (μR{a} (t2, 2), μR{b} (t2, 2), μR{c} (t2, 2)) = 0.03, μR{P } (t2, 3) = T (μR{a} (t2, 3), μR{b} (t2, 3), μR{c} (t2, 3)) = 0, μR{P } (t2, 4) = T (μR{a} (t2, 4), μR{b} (t2, 4), μR{c} (t2, 4)) = 0. In this case, the greatest similarity values for test object t2 is gained to the object 1. Because the class label of object 1 is yes, t2 will be classified as belonging to the class yes. 4.2. Aggregated-similarity nearest-neighbour In a similar way to that described for FRNN, vaguely quantified nearest-neighbour (VQNN) also has an equivalent fuzzysimilarity simplification. Also, rather than the average of upper and lower approximations as employed in VQNN, the summation of the similarities for each class for the k nearest-neighbours can be employed as the decision qualifier. Using this foundation, a novel algorithm is proposed, which is termed aggregated-similarity nearest-neighbour (ASNN), and outlined in Fig. 3. At the highest level, both VQNN and ASNN iterate through two loops. Thus, the complexity of the VQNN algorithm and that of the ASNN algorithm may appear to be the same: O(|C | · |U|). However, considering the extra time cost required to calculate the lower and upper approximations in VQNN, the actual time complexity of VQNN will be greater than that of ASNN. Therefore, in implementation, ASNN will be more efficient than VQNN for practical applications. ASNN – a worked example Using the same example data as presented in Section 4.1, the basic concepts of the ASNN approach and algorithm are illustrated below. t1 memberships for each concept are:
μR{P } (t1, yes) = μR{P } (t1, 1) + μR{P } (t1, 4) = 0 + 0.23 = 0.23, μR{P } (t1, no) = μR{P } (t1, 2) + μR{P } (t1, 3) = 0 + 0.83 = 0.83. Referring to the ASNN algorithm in Fig. 3, it can be seen that the summation of similarities for test object t1 for the class label no is higher than that for the class yes. The algorithm will therefore classify t1 as belonging to the class no. For t2, the same procedure results in:
Fig. 3. The aggregated-similarity nearest-neighbour algorithm.
Y. Qu et al. / International Journal of Approximate Reasoning 54 (2013) 184–195
191
μR{P } (t2, yes) = μR{P } (t2, 1) + μR{P } (t2, 4) = 0.23 + 0 = 0.23, μR{P } (t2, no) = μR{P } (t2, 2) + μR{P } (t2, 3) = 0.03 + 0 = 0.03. According to the ASNN algorithm, t2 will be classified as belonging to the class yes. It is important to point-out that these two proposed fuzzy nearest-neighbour classifiers, SNN and ASNN, are equivalent to the FRNN and VQNN methods respectively in the sense that they can achieve the same classification accuracy with different models. However, for other indicators, such as root mean squared error (RMSE) and the Area-Under-the-ROC-Curve (AUC) metric [2] (as demonstrated later in Section 5), these equivalent relationships may not hold. As introduced in [34], assuming the number of the classes is n, for a given instance, the model of the classifier can be defined as a vector (p1 , p2 , . . . , pn ) for the classes. Here, pi refers to the probability that a particular prediction is in the ith class. The actual outcome for that instance will be one of the possible classes. Similarly, it is expressed as a vector (a1 , a2 , . . . , an ), whose ith component, where i is the actual class, is 1 and all other components are 0. As analysed previously, this instance will be classified into the same class (e.g., class j) by FRNN/VQNN and SNN/ASNN. This means that for class j, the f
corresponding predicted probabilities, pj (achieved by FRNN/VQNN)and psj (achieved by SNN/ASNN) are both the greatest f
component in the respective predicted probability vector. However, the values of pj and psj may not be the same, because the decision qualifiers of FRNN/ASNN and SNN/ASNN are different. Therefore, given the definition of RMSE: (p1 − a1 )2 + · · · + (pn − an )2 , (39) RMSE = n f
f
f
although the two classification decisions are identical, the two corresponding predicted probability vectors, (p1 , p2 , . . . , pn ) and (ps1 , ps2 , . . . , psn ), may be different. In this case, the values of two corresponding RMSE may be different. Furthermore, as the models of FRNN/ASNN and SNN/ASNN are different, the values of the other corresponding indicator metrics may not the same as well. This result will be observed in the experiments. 5. Experimental evaluation This section presents an experimental evaluation of the proposed novel similarity-based nearest-neighbour algorithms (SNN and ASNN). The evaluation itself is divided into two parts. The first compares the novel methods with their fuzzy-rough counterparts. The classification accuracy and RMSE results are presented here to support the previous theoretical analysis. The comparison of SNN and ASNN with other fuzzy nearest-neighbour methods is carried out in the second part. Once again the performance of the novel methods with different values of k also can also be observed here. 5.1. Experimental set-up Seventeen benchmark datasets obtained from [4,23] are used for this evaluation. These datasets contain between 120 and 19,020 objects with numbers of features ranging from 6 to 649 as shown below in Table 3. For the evaluation described here, SNN and FRNN employ the common relation given in Eq. (10) as a similarity relation. For FRNN, the Kleene–Dienes T-norm [11,22] is used to implement the implicator, which is defined by I (x, y) = max(1 − x, y). In [18], the VQNN method is implemented with the quantifiers Ql = Q(0.1,0.6) and Qu = Q(0.2,1.0) (which is the default), according to the general formula: Table 3 Evaluation datasets. Dataset Arrhythmia Ecoli Glass Handwritten Heart Isolet Liver Magic Multifeat Olitos Optdigits Sonar Water 2 Water 3 Waveform Wine Wisconsin
Objects 452 336 214 1593 270 7797 345 19020 2000 120 5620 208 390 390 5000 178 683
Attributes 279 7 9 256 13 617 6 10 649 25 64 60 38 38 40 14 9
192
Y. Qu et al. / International Journal of Approximate Reasoning 54 (2013) 184–195
Q(α,β) (x)
=
⎧ ⎪ 0, ⎪ ⎪ ⎪ 2 ⎪ ⎪ ⎨ 2(x−α)2 ,
x
≤ α,
α≤x≤
(β−α) (x−β)2 α+β ⎪ ⎪ 1 − 2(β−α) ⎪ 2 , 2 ⎪ ⎪
⎪ ⎩
1,
α+β 2
,
(40)
≤ x ≤ β,
β ≤ x.
In the interest of comparing the proposed methods to the existing work, in this paper, VQNN is also configured in this way. Stratified 10×10-fold cross-validation (10-FCV) is employed for data validation. In 10-FCV, the original dataset is partitioned into 10 subsets. Of these 10 subsets, a single subset is retained as the testing data for the classifier, and the remaining 9 subsets are used for training. The cross-validation process is then repeated 10 times (the number of folds). The 10 sets of results are then aggregated to produce a single classifier estimation. The advantage of 10-FCV over random sub-sampling is that all objects are used for both training and testing, and each object is used for testing only once per fold. The stratification of the data prior to its division into folds ensures that each class label (as far as possible) has equal representation in all folds, thus helping to alleviate bias/variance problems. In order to investigate the level of ‘fit’ of the generated classifiers, the root mean squared error (RMSE) measure is used. RMSE is the squared root of the variance of the residuals. It indicates the absolute fit of a classifier to the data and how close the observed data objects are to the classification outcomes. Note that RMSE is an absolute measure. As the squared root of the variance, RMSE can be viewed as the standard deviation of the unexplained variance. Lower values of RMSE indicate better fit. RMSE is a good measure of how accurately the classifier performs, and is a generally accepted criterion for assessing fit. In addition to RMSE conventional classification accuracy is also used to assess the performance of learnt classifiers. Classification accuracy often does not provide a realistic or comprehensive view of performance, especially in the presence of class imbalance. As an alternative to evaluating the performance of classifier learners, Receiver Operating Characteristic (ROC) curves provides an additional perspective. ROC curves (when used in conjunction with the aforementioned metrics) help to provide a better overall assessment of the performance of learners. ROC analysis can be employed in order to assess the predictive ability of a learning algorithm, by using the Area-Under-the-ROC-Curve (AUC) metric [2]. AUC is a statistically consistent measure, and when compared to classification accuracy, is more discriminative [16]. In [14] it is proven that, AUC is equivalent to the Wilcoxon signed-ranks test, which is a non-parametric alternative to the paired t-test [26]. 5.2. Part 1 – comparison of the novel methods and equivalent fuzzy-rough methods In this section the new SNN and ASNN algorithms are compared with their equivalent fuzzy-rough counterparts. For this comparison, the seven largest datasets from Table 3 have been chosen. The reason for selecting these datasets is that they offer a better comparison in terms of classifier complexity assessment. The results are shown in Tables 4 and 5. As well as classification accuracy, the level of fit (RMSE), and the CPU runtime are also included to assist in assessing the performance. In order to clarify the contribution of the novel algorithms for real life data sets, the execution time is subjected to a statistical evaluation (paired t-test with significance level of 0.05). These statistical results are also included in both tables. The symbol ’v’ represents the fact that the corresponding execution time of the existing FRNN and VQNN is statistically larger than that required by SNN and ASNN, respectively. The absence of any symbol indicates that there is no statistically significant difference. Table 4 Comparison between FRNN and SNN. Dataset
FRNN
Arrhythmia Handwritten Isolet Magic Multifeat Optdigits Waveform
Accy. (%) 54.67 91.27 87.13 81.35 97.57 97.95 73.77
SNN RMSE 0.22 0.18 0.13 0.40 0.09 0.09 0.39
Time (s) 0.12 1.62v 78.13 23.36v 5.95v 4.81 3.23
Accy. (%) 54.67 91.27 87.13 81.35 97.57 97.95 73.77
RMSE 0.21 0.11 0.08 0.34 0.06 0.06 0.29
Time (s) 0.12 1.56v 80.07v 23.18 5.51v 4.86v 3.14
Accy. (%) 60.40 91.27 89.24 84.40 97.95 97.97 81.55
RMSE 0.21 0.18 0.13 0.40 0.09 0.09 0.38
Time (s) 0.11 1.45 76.95 22.62 4.79 4.75 3.18
RMSE 0.19 0.12 0.09 0.35 0.06 0.06 0.29
Time (s) 0.11 1.47 71.10 22.98 4.85 4.39 3.11
Table 5 Comparison between VQNN and ASNN. Dataset
VQNN
Arrhythmia Handwritten Isolet Magic Multifeat Optdigits Waveform
Accy. (%) 60.40 91.27 89.24 84.40 97.95 97.97 81.55
ASNN
Y. Qu et al. / International Journal of Approximate Reasoning 54 (2013) 184–195
193
Table 6 Classification accuracy and statistical comparison for SNN. Dataset Ecoli Glass Heart Liver Olitos Sonar Water2 Water3 Wine Wisconsin
SNN(10) 80.57(6.36) 73.54(8.56) 76.63(7.38) 62.81(8.88) 78.67(9.65) 85.25(7.61) 84.38(4.57) 79.82(4.44) 97.47(3.42) 96.38(2.25)
SNN 80.57(6.36) 73.54(8.56) 76.63(7.38) 62.81(8.88) 78.67(9.65) 85.25(7.61) 84.38(4.57) 79.82(4.44) 97.47(3.42) 96.38(2.25)
FNN(10) 86.55(5.93)v 68.57(9.62) 66.11(7.89)* 69.52(7.26) 63.25(12.48)* 73.21(9.50)* 77.97(2.66)* 74.64(3.77)* 96.40(4.06) 97.20(2.12)
FNN 42.56(1.18)* 62.54(8.01)* 61.63(4.78)* 58.01(1.47) 44.00(3.76)* 53.43(1.80)* 80.00(1.03)* 73.59(1.18)* 93.54(4.93) 68.87(2.12)*
FRNN-O(10) 77.95(26.67) 71.70(10.15) 66.00(8.13)* 62.37(6.92) 67.58(11.60)* 85.06(7.53) 79.79(5.81)* 73.21(5.39)* 95.62(4.54) 96.00(2.20)
FRNN-O 66.12(22.70) 71.75(10.18) 66.04(8.14)* 62.43(6.95) 67.67(11.62)* 84.30(7.82) 79.79(5.81)* 73.21(5.39)* 95.68(4.42) 96.02(2.27)
Summary
(v/ /*)
(0/10/0)
(1/4/5)
(0/2/8)
(0/6/4)
(0/6/4)
Table 7 Classification accuracy and statistical comparison for ASNN. Dataset Ecoli Glass Heart Liver Olitos Sonar Water2 Water3 Wine Wisconsin
ASNN(10) 86.85(5.88) 68.95(8.63) 82.19(6.02) 66.26(8.12) 80.75(9.07) 79.38(8.72) 85.15(4.03) 81.28(3.45) 97.14(3.79) 96.69(2.15)
Summary (v/ /*)
ASNN 42.56(1.18)* 37.93(4.68)* 65.41(4.66)* 57.98(0.84)* 41.67(0.00)* 53.38(1.63)* 80.00(1.03)* 73.59(1.18)* 63.85(6.44)* 74.70(3.27)*
FNN(10) 86.55(5.93) 68.57(9.62) 66.11(7.89)* 69.52(7.26) 63.25(12.48)* 73.21(9.50)* 77.97(2.66)* 74.64(3.77)* 96.40(4.06) 97.20(2.12)
FNN 42.56(1.18)* 62.54(8.01)* 61.63(4.78)* 58.01(1.47)* 44.00(3.76)* 53.43(1.80)* 80.00(1.03)* 73.59(1.18)* 93.54(4.93) 68.87(2.12)*
FRNN-O(10) 77.95(26.67) 71.70(10.15) 66.00(8.13)* 62.37(6.92) 67.58(11.60)* 85.06(7.53)v 79.79(5.81)* 73.21(5.39)* 95.62(4.54) 96.00(2.20)
FRNN-O 66.12(22.70)* 71.75(10.18) 66.04(8.14)* 62.43(6.95) 67.67(11.62)* 84.30(7.82)v 79.79(5.81)* 73.21(5.39)* 95.68(4.42) 96.02(2.27)
(0/0/10)
(0/5/5)
(0/1/9)
(1/5/4)
(1/4/5)
It can been seen from these results that, for each of these seven datasets, the similarity-based nearest-neighbour algorithms achieve identical classification accuracies as those of the equivalent fuzzy-rough counterparts. However, less time is taken for the fuzzy similarity-based approaches. As discussed previously, this is due to the fact that the calculation of the lower and upper approximations is not required for SNN and ASNN. In particular, for handwritten, magic and multifeat datasets, the execution time for SNN is statistically lower than that of FRNN. Also, for handwritten, isolet, multifeat and optdigits datasets, ASNN runs in statistically less time than VQNN. As for the other datasets, the CPU runtime consumed by SNN and ASNN is equal to that by their fuzzy-rough counterpart in statistics. These results demonstrate that for the given seven benchmark datasets, SNN/ASNN is statistically more sufficient than FRNN/VQNN. In real applications, for extremely large datasets, it is expected that the statistical efficiency of the similarity-based approaches would be more pronounced. Furthermore, the RMSE values of SNN and ASNN are different to those of FRNN and VQNN. This further demonstrates that although the final classifications are the same, the underlying models of the classification systems are different. 5.3. Part 2 – comparison with fuzzy nearest-neighbour approaches This section presents the results of a comparison of SNN and ASNN with two fuzzy nearest-neighbour classification methods: FNN (standard fuzzy nearest-neighbours) [21], and the fuzzy ownership algorithm, FRNN-O [31]. In this experimentation, each approach is run twice, the first time setting k = 10 and the second time with k set to the full set of training objects. Again, the results are generated using 10-FCV. Experimental results in terms of classification accuracy and AUC are given in this section. 5.3.1. Classification accuracy The results are shown in Tables 6 and 7, where the average classification accuracies and the standard deviations for each of the methods are recorded. In addition to classification accuracy, a statistical evaluation was also performed with the results integrated in these two tables. In particular, a paired t-test with significance level of 0.05 has been carried out over 10 runs. The baseline references for the tests are the result obtainable by SNN and ASNN with k = 10 (SNN(10) and ASNN(10)), respectively. This statistical analysis is done in order to ensure that the experimental results are not discovered by chance. The statistical significance results are summarised by the final line of each of the two tables, showing the count of the number of statistically better (v), equivalent and worse (*) results for each method in comparison to SNN(10) and ASNN(10), respectively. For example, in Table 6, (1/4/5) in the FNN(10) column indicates that this method performs better than SNN(10) in one dataset, equivalently to SNN(10) in four datasets, and worse than SNN(10) in five datasets. Table 6 confirms that which is shown theoretically, for all datasets, SNN remains unaffected by changes in k and, generally, yields the best and the most consistent result. FRNN-O suffers from a slight effect by changing k, but the classification accuracies of FRNN-O are lower than those of SNN. For the ecoli dataset, FNN(10) obtains a statistically better result. As can be seen from Table 7, for ASNN and FNN, increasing the value of k can have a significant effect on classification accuracy. This
194
Y. Qu et al. / International Journal of Approximate Reasoning 54 (2013) 184–195 Table 8 AUC for SNN/ASNN. Dataset Ecoli Glass Heart Liver Olitos Sonar Water2 Water3 Wine Wisconsin
SNN(10) 0.98 0.87 0.85 0.66 0.79 0.95 0.88 0.85 1.00 0.99
SNN 0.98 0.83 0.85 0.66 0.91 0.95 0.86 0.88 1.00 0.99
ASNN(10) 0.98 0.91 0.90 0.69 0.94 0.92 0.93 0.90 1.00 0.99
ASNN 0.99 0.85 0.92 0.71 0.92 0.86 0.93 0.92 1.00 0.99
FNN(10) 0.97 0.81 0.65 0.67 0.74 0.72 0.50 0.56 0.99 0.97
FNN 0.50 0.79 0.57 0.50 0.52 0.50 0.50 0.50 0.98 0.56
FRNN-O(10) 0.93 0.91 0.78 0.65 0.84 0.93 0.76 0.74 1.00 0.99
FRNN-O 0.94 0.91 0.78 0.65 0.85 0.95 0.75 0.73 1.00 0.99
is most clearly observed in the results for the ecoli, liver, olitos, sonar and wisconsin datasets, where there are clear downward trends. This is because when k cross over a certain degree for each dataset, the impact caused by noise overwhelms the contribution from the underlying non-noisy. Especially, all the results gained by ASNN are worse statistically. However, overall, ASNN(10) is the statistically best for all datasets, except sonar for which FRNN-O method performs the best. 5.3.2. Area under the ROC curve In Table 8, the results regarding the area under the ROC curve (AUC) are presented. High values of AUC are indicative of good performance. Thus, ASNN provides the best performance consistently with different choices of k for most datasets. Although, for the sonar dataset, ASNN has lower values of AUC than FRNN-O, SNN gives a comparable result. It is interesting to note that the classification accuracies of SNN(10) and SNN are the same, however, these corresponding values of AUC are different. This is because the underlying model obtained by the nearest-neighbour-based classifier is affected by the choice of the value for k. For certain datasets, e.g., liver, water2, water3, FNN achieves values of AUC of around 0.50, which means the results are almost random. Considering the corresponding performance of classification accuracy, SNN and ASNN outperform the others. 6. Conclusion This paper has analysed the classification ability of two fuzzy-rough nearest-neighbour classifiers [18] and respectively presented two novel fuzzy similarity-based parallels for them. The work has explored the theoretical aspects and reinforced them with empirical analysis. According to the proofs given in this paper, the accuracy of the fuzzy-rough nearest neighbour (FRNN) [18] approach is decided only by the greatest similarity between the objects in the training datasets and the test object. For vaguely quantified nearest-neighbour (VQNN) [9], the classification only depends on the highest summation of the similarities for each class within the k nearest neighbours. To demonstrate the validity of the theoretical analysis, several experiments have been carried out on benchmark datasets. The experimental results show that the similarity-based nearest-neighbour algorithms are more efficient than their respective fuzzy-rough counterparts (FRNN and VQNN). In addition, because SNN and ASNN are equivalent to FRNN and VQNN, they also retain the advantages of these methods. In the case of VQNN and ASNN, this is the ability to deal with noisy data. For FRNN and SNN, this relates to the lack of influence of the value of the parameter k. Topics for further research include a more comprehensive study of how the novel similarity-based framework could be used for other tasks such as attribute reduction [1,15,19,25,38]. As part of a more comprehensive empirical investigation into the performance of SNN/ASNN, a series of experiments for complicated real-world applications [29,33] will be considered for future work. Also, recently a proposal has been made to develop techniques for efficient information aggregation and unsupervised feature selection, which exploits the concept of nearest-neighbour-based data reliability [5,6]. An investigation into how such work may be combined with the present similarity-based classification methods, in order to reinforce the potential of both approaches, remains active research. Acknowledgement The authors are grateful to the Area Editor and Reviewers whose comments have helped to improve this paper significantly. References [1] J. Basak, R.K. De, S.K. Pal, Unsupervised feature selection using neuro-fuzzy approach, Pattern Recognition Letters 19 (13) (1998) 997–1006. [2] J.R. Beck, E.K. Schultz, The use of relative operating characteristic (roc) curves in test performance evaluation, Archives of Pathology & Laboratory Medicine 110 (1986) 13–20. [3] H. Bian, L. Mazlack, Fuzzy-rough nearest-neighbor classification approach, in: Proceeding of the 22nd International Conference of the North American Fuzzy Information Processing Society, 2003, pp. 500–505. [4] C.L. Blake, C.J. Merz, UCI repository of machine learning databases, University of California, Irvine, School of Information and Computer Sciences, 1998.
Y. Qu et al. / International Journal of Approximate Reasoning 54 (2013) 184–195
195
[5] T. Boongoen, Q. Shen, Nearest-neighbor guided evaluation of data reliability and its applications, IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 40 (6) (2010) 1622–1633. [6] T. Boongoen, C. Shang, N. Iam-On, Q. Shen, Extending data reliability measure to a filter approach for soft subspace clustering, IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 41 (6) (2011) 1705–1714. [7] L. Borkowski (Ed.), Selected Works by Jan Łukasiewicz, North-Holland Pub. Co., Amsterdam, 1970, pp. 87–88. [8] M. De Cock, C. Cornelis, E.E. Kerre, Fuzzy rough sets: The forgotten step, IEEE Transactions on Fuzzy Systems 15 (1) (2007) 121–130. [9] C. Cornelis, M. De Cock, A.M. Radzikowska, Vaguely quantified rough sets, Lecture Notes in Artificial Intelligence 4482 (2007) 87–94. [10] M. Das, M.K. Chakraborty, T.K. Ghoshal, Fuzzy tolerance relation, fuzzy tolerance space and basis, Fuzzy Sets and Systems 97 (1998) 361–369. [11] S.P. Dienes, On an implication function in many-valued systems of logic, Journal of Symbolic Logic 14 (2) (1949) 95–97. [12] D. Dubois, H. Prade, Putting rough sets and fuzzy sets together, Intelligent Decision Support (1992) 203–232. [13] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, 2nd ed., John Wiley and Sons, New York, 2001. [14] D. Hand, R. Till, A simple generalisation of the area under the roc curve for multiple class classification problems, Machine Learning 45 (2001) 171–186. [15] Q. Hu, D. Yu, Z. Xie, J. Liu, Fuzzy probabilistic approximation spaces and their information measures, IEEE Transactions on Fuzzy Systems 14 (2) (2006) 191–201. [16] J. Huang, C. Ling, Using auc and accuracy in evaluating learning algorithms, IEEE Transactions on Knowledge and Data Engineering 17 (3) (2005) 299–310. [17] R. Jensen, C. Cornelis, Fuzzy-rough instance selection, in: Proceedings of the 19th International Conference on Fuzzy Systems, 2010, pp. 1776–1782. [18] R. Jensen, C. Cornelis, Fuzzy Rough Nearest Neighbour Classification and Prediction, Theoretical Computer Science 412 (42) (2011) 5871–5884. [19] R. Jensen, Q. Shen, Computational Intelligence and Feature Selection: Rough and Fuzzy Approaches, IEEE Press and Wiley & Sons, 2008. [20] R. Jensen, Q. Shen, New approaches to fuzzy-rough feature selection, IEEE Transactions on Fuzzy Systems 17 (4) (2009) 824–838. [21] J.M. Keller, M.R. Gray, J.A. Givens, A fuzzy k-nearest neighbor algorithm, IEEE Transactions on Systems, Man, and Cybernetics 15 (4) (1985) 580–585. [22] S.C. Kleene, Introduction to Metamathematics, Van Nostrand, New York, 1952. [23] S. Lanteria, C. Armanino, R. Leardia, G. Modi, Chemometric analysis of tuscan olive oil, Chemometrics and Intelligent Laboratory Systems 5 (4) (1989) 343–354. [24] P. Maji, S.K. Pal, Fuzzy-rough sets for information measures and selection of relevant genes from microarray data, IEEE Transactions on System, Man and Cybernetics, Part B, Cybernetics 40 (3) (2010) 741–752. [25] P. Maji, S.K. Pal, Feature selection using f-information measures in fuzzy approximation space, IEEE Transactions on Knowledge and Data Engineering 22 (6) (2010) 854–867. [26] S.J. Mason, N.E. Graham, Areas beneath the relative operating characteristics (roc) and relative operating levels (rol) curves: Statistical significance and interpretation, Quarterly Journal of the Royal Meteorological Society 128 (2002) 2145–2166. [27] S. Mitra, S.K. Pal, Fuzzy sets in pattern recognition and machine intelligence, Fuzzy Sets and Systems 156 (3) (2005) 381–386. [28] Z. Pawlak, Rough Sets: Theoretical Aspects of Reasoning About Data, Kluwer Academic Publishing, 1991. [29] Y. Qu, C. Shang, W. Wu, Q. Shen, Evolutionary fuzzy extreme learning machine for mammographic risk analysis, International Journal of Fuzzy Systems 13 (4) (2011) 282–291. [30] A.M. Radzikowska, E.E. Kerre, A comparative study of fuzzy rough sets, Fuzzy Sets and Systems 126 (2) (2002) 137–155. [31] M. Sarkar, Fuzzy-rough nearest neighbors algorithm, Fuzzy Sets and Systems 158 (2007) 2123–2152. [32] D. Sen, S.K. Pal, Generalized rough sets, entropy and image ambiguity measures, IEEE Transactions on System, Man and Cybernetics, Part B, Cybernetics 39 (1) (2009) 117–128. [33] C. Shang, D. Barnes, Q. Shen, Facilitating efficient mars terrain image classification with fuzzy-rough feature selection, International Journal of Hybrid Intelligent Systems 8 (1) (2011) 3–13. [34] I.H. Witten, E. Frank, Data Mining: Practical machine learning tools with Java implementations, Morgan Kaufmann, San Francisco, 2000. [35] Y.Y. Yao, Two views of the theory of rough sets in finite universes, International Journal of Approximation Reasoning 15 (4) (1996) 291–317. [36] Y.Y. Yao, A comparative study of fuzzy sets and rough sets, Information Sciences 109 (1–4) (1998) 227–242. [37] Y.Y. Yao, Relational interpretations of neighborhood operators and rough set approximation operators, Information Sciences 111 (1–4) (1998) 239–259. [38] Y.Y. Yao, Y. Zhao, Attribute reduction in decision-theoretic rough set models, Information Sciences 178 (17) (2008) 3356–3373. [39] L.A. Zadeh, Fuzzy sets, Information and Control 8 (1965) 338–353.