Fuzzy Distance Measure and Fuzzy Clustering Algorithm Ismat Beg ...

2 downloads 0 Views 1MB Size Report
Fuzzy Distance Measure and Fuzzy Clustering Algorithm. Ismat Beg *. Centre for Mathematics and Statistical Sciences. Lahore School of Economics. Lahore.
Journal of Interdisciplinary Mathematics Vol. ? (?), No. ?, pp. ?–? DOI : xxxxxx

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36

Fuzzy Distance Measure and Fuzzy Clustering Algorithm Ismat Beg * Centre for Mathematics and Statistical Sciences Lahore School of Economics Lahore Pakistan Tabasam Rashid † Department of Science and Humanities National University of Computer and Emerging Sciences Lahore Campus Pakistan Abstract A modified dissimilarity measure for fuzzy data is proposed and is applied to real data with mixed feature variables of symbolic and fuzzy data. We also proposed an improved version of fuzzy relational clustering algorithm. Numerical examples and comparison are given between mixed variable fuzzy C-means and fuzzy relational clustering algorithm. Keywords: Algorithm; dissimilarity matrix; fuzzy clustering; fuzzy data; mixed-variable data; symbolic data.

1. Introduction Cluster analysis is one of the important technique in pattern recognition and data analysis. Bellman et al. [3] and Ruspini [12] started the research on cluster analysis by using fuzzy sets. The fuzzy clustering methods can be divided into two fields: One involves distance defined objective functions and second field is based on fuzzy relations, for further details see [17, 1, 2]. Fuzzy data can be presented in feature vectors so that the distance can be calculated between fuzzy feature vectors. *E-mail:  [email protected] †E-mail:  [email protected]

©

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

I. BEG AND T. RASHID

Data is often symbolic or fuzzy, and these fuzzy clustering algorithms cannot be used for these mixed-variable data. Recently, Liao [11] proposed coding and classification approaches to a part family under a fuzzy environment with the construction of a coding structure for fuzzy data by considering a fuzzy feature as a linguistic variable. The fuzzy c-mean (FCM) clustering algorithms are the well known methods [5, 14]. FCM is used for data set in Rd. Yang and Ko [13] proposed distance measure for LR-type fuzzy data. Yang et al. [15] present mixed variable fuzzy C-means (MVFCM) for mixed data type feature vector. Yang et al. [14] applied the MVFCM proposed by Yang et al. [15]. Hung et al. [7] described two problems with Yang et al. [15] distance measure for fuzzy data. Hung et al. [7] also proposed new distance measure to solve the drawbacks of Yang et al. [15] distance measure for fuzzy data. In this paper, we describe three problems with Hung et al. [7] distance measure for fuzzy data. To solve these problems, we propose a new distance measure and construct dissimilarity matrix. We describe problem with the fuzzy relational data clustering (FRC) algorithm, proposed by Davé and Sen [4], which is also used by Hung et al. [7]. To solve these problems, we propose improve version of Davé and Sen [4] FRC algorithm. Based on the partition entropy (PE) [2] and Partition coefficient (PC) , we obtain the optimal partition. In Section 2, we review Yang et al. distance [15] for symbolic data. In Section 3, Hung et al. distance [7]for fuzzy data is discussed. In Section 4, we propose a modified distance measure for fuzzy data. In Section 5, we studied FRC algorithm given by Davé and Sen [4]. We introduced improve version of FRC algorithm in Section 6. Section 7, applies the improved version of FRC algorithm to real data which is used by Yang et al. [15] in by using propose distance measure. We also discuss comparison between the results of MVFCM and FRC algorithm applied on Yang et al. [15] distance measure and proposed distance measure, respectively. Finally, conclusions are stated in Section 8. 2.  Distance measure between mixed-variable feature vectors Let F be a feature vector with components F1, ..., Fd. For any two feature vectors A and B with A = ( A1 , , Ad ) and B = ( B1 , , Bd ), the distance between A and B is given by d



D( A, B) = ∑D( Ak , Bk ), k =1

FUZZY DISTANCE MEASURE

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

3

where D( Ak , Bk ) is the distance between the kth feature components based on its feature type. Symbolic features consist of quantitative (absolute, ratio and interval) and qualitative (ordinal, nominal and combinational) values. Gowda and Diday’s [5, 6] distance measure between symbolic features is defined on the basis of position, span and content. However, the Gowda and Diday’s distance measure has drawbacks. To overcome these drawbacks, Yang et al. [15] gave a modified distance measure as follows: The distance measure between the kth feature quantitative interval type Ak and Bk is defined by the following three components: (1) D p ( Ak , Bk ) due to position; (2) Ds ( Ak , Bk ) due to span; (3) Dc ( Ak , Bk ) due to content. Let alk auk blk

buk

= lower limit of interval Ak;

= upper limit of interval Ak; = lower limit of interval Bk;

= upper limit of interval Bk; inters k = length of intersection of A and B ; k k  ks = span length of Ak and Bk =| max (auk , buk ) − min (alk , blk ) |; k k k k U = max k (au , bu ) − min k (al , bl ) = the difference between the highest and lowest values over all quantitative interval type;

 ka =| auk − alk |;



 kb =| buk − blk | . The three distance components are then given as follows: | (alk + auk ) / 2 − (blk + buk ) / 2 | , U |  ka −  kb | Ds ( Ak , Bk ) = , k U +  a +  kb − inters k

D p ( Ak , Bk ) =



Dc ( Ak , Bk ) =

|  ka +  kb − 2 ⋅ inters k | . U +  ka +  kb − inters k

Thus, the distance measure Dqt ( Ak , Bk ) between quantitative interval type of Ak and Bk is defined as follows:

4

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

I. BEG AND T. RASHID



Dqt ( Ak , Bk ) = D p ( Ak , Bk ) + Ds ( Ak , Bk ) + Dc ( Ak , Bk ). (1)

For qualitative types of Ak and Bk, the distance component due to position is absent. The term U is absent too. Thus, the two remaining components that contribute to distance are due to span and due to content. Let  ka

= length of Ak = the number of elements in Ak;

 kb

= length of Bk = the number of elements in Bk;

 ks

= span length of Ak and Bk = the number of elements in the union of Ak and Bk;

inters k = the number of elements in the intersection of A and B . k k The two distance components are then defined as follows:



Ds ( Ak , Bk ) =

|  ka −  kb | ,  ks

Dc ( Ak , Bk ) =

|  ka +  kb − 2 ⋅ inters k | .  ks

Thus, the distance measure between qualitative type of Ak and Bk, Dql ( Ak , Bk ), is defined as

Dql ( Ak , Bk ) = Ds ( Ak , Bk ) + Dc ( Ak , Bk ).

3.  Fuzzy feature components Fuzzy data is a data type with imprecision or with a source of uncertainty and due to ambiguity. It is generally more convenient and useful in describing fuzzy data to use LR-type fuzzy numbers [11]. Zimmermann [17, Subsubsection 5.3.2] defined the LR-type fuzzy numbers and Hung et al. [7] defined the distance for an LR-type trapezoidal fuzzy number (TFN) as follows. Let L (and R) be decreasing, shape functions from ℜ + = [0, ∞) to [0, 1] with L(0) = 1; L( x) < 1 for all x > 0; L( x) > 0 for all x < 1; L(1) = 0 or

FUZZY DISTANCE MEASURE

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

5

( L( x) > 0 for all x and L(+∞) = 0 ). An LR-type TFN X has the following membership function   m1 − x  L     α   µ X ( x) = 1   R  x − m2    β 



for x ≤ m1 , for m1 ≤ x ≤ m2 , for x ≥ m2 ,

where a > 0 and b > 0 are called the left and right spreads, respectively. Symbolically, X is denoted by (m1 , m2 , α, β) LR . The LR-type TFN is very general and allows one to represent the different types of information. For example, the LR-type TFN X = (m, m, 0, 0) LR with m ∈ℜ = (−∞, ∞) is used to denote a real number X and the LR-type TFN X = (a, b, 0, 0) LR with a, b ∈ℜ and a < b is used to denote an interval X. The LR-type TFN (m1 , m2 , α, β) LR divides into three parts: left part, middle part and right part. The left, middle and right parts include the intervals [m1 − α, m1 ], [m1 , m2 ] and [m2 , m2 + β], respectively. Let Ak = (m1ka , m2ka , α ka , β ka ) LR and Bk = (m1kb , m2kb , α bk , βbk ) LR be two fuzzy feature vectors. The distance between Ak and Bk is defined as the sum of the distance between these three parts due to position, span and content. Let UL

= the difference between the highest and lowest of the left part over all feature components = max k{m1ka , m1kb } − min k{m1ka − α ka , m1kb − α bk }; intersL = the length of the intersection of the left parts of Ak and Bk; UT

= the difference between the highest and lowest values of the the middle part over all feature components = max k{m2ka , m2kb } − min k{m1ka , m1kb }; intersT = the length of the intersection of the middle parts of Ak and Bk; UR

= the difference between the highest and lowest values of the right part over all feature components = max k{m2ka + β ka , m2kb + βbk } − min k{m2ka , m2kb }; intersR = the length of the intersection of the right parts of Ak and Bk; Then the distance between Ak and Bk is defined as follows [7]: DMLR ( Ak , Bk ) = TD p + TDs + TDc + LD p + LDs + LDc + RD p + RDs + RDc , (2)

6

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

I. BEG AND T. RASHID

where TD p TDs



| a1 − b1 | , UT | a2 − b2 | = , U T + a2 + b2 − intersT =

| a2 + b2 − 2 ⋅ intersT | , U T + a2 + b2 − intersT

TDc

=

LD p

=

 2a1 − a2   2b − b  − la3  −  1 2 − lb3  / U L ,     2 2

LDs

=

| a3 − b3 | , U + a3 + b3 − inters L

LDc

=

| a3 + b3 − 2 ⋅ inters L | , U L + a3 + b3 − inters L

RD p

=

 2a1 + a2   2b + b  + ra4  −  1 2 + rb4  / U R ,    2  2

RDs

=

| a4 − b4 | , U + a4 + b4 − inters R

L

R

where a1 = (m1ka + m2ka ) / 2, a2 = m2ka − m1ka , a3 = α ka , a4 = β ka ,

b1 = (m1kb + m2kb ) / 2, b2 = m2kb − m1kb , b3 = α bk , b4 = βbk ,



l = ∫L−1 ( w)dw and r = ∫R −1 ( w)dw.

1

1

0

0

For an LR-type TFN X = (m1 , m2 , α, β) LR , if L( x) = R( x) = 1− x then X is called a TFN, denoted by X = (m1 , m2 , α, β)T , i.e.



 m1 − x 1 − α  µX ( x) = 1  x−m 2 1 − β 

for x ≤ m1 (α > 0), for m1 ≤ x ≤ m2 , for x ≤ m2 (β > 0).

FUZZY DISTANCE MEASURE

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

7

Obviously, l = 1/2 and r = 1/2 to calculate DMLR in Eq. (2) for two TFNs A = (m1a , m2 a , α a , β a )T and B = (m1b , m2b , α b , βb )T . In LR-type TFNs, the TFNs are most commonly used. We find that Hung et al. [7] distance measure DMLR as described in Section 3 has three drawbacks. The first one is that the distance measure DMLR depends on the presentation of data. The second is that the distance measure Dqt or Dql is dominated by DMLR. The third is that we reach at 0 0 form while calculating DMLR. We use the following examples to illustrate these three drawbacks. Example 1. [7, Example 1] Let A = (0.48, 0.56, 0, 0)T and B = (0.16, 0.23, 0, 0)T . Then DMLR ( A, B) = 3.103, Dqt ( A, B) = 1.103. This result illustrates that the proposed distance measure DMLR depend on the presentation of data. Example 2. Let the nine data A, B, C, D, E, F, G, H and I have the two feature components with quantitative and fuzzy types. The corresponding data are:



A = (6, (4,8, 2, 2)T ), B = (10, (8,12, 2, 2)T ), C = (11, (9,13, 2, 2)T ), D = (112, (10,14, 2, 2)T ), E = (16, (14,18, 2, 2)T ), F = (6, (4,8, 0, 0)T ), G = (13, (9,14, 0, 0)T ), H = (6, (4,8, 0, 2)T ) I = (14, (9,14, 2, 0)T ).

We have D( A, B) = Dqt (6,10) + DMLR ((4,8, 2, 2)T , (8,12, 2, 2)T ) = 0.4 + 1.849 = 2.249. D( A, C ) = Dqt (6,11) + DMLR ((4,8, 2, 2)T , (9,13, 2, 2)T ) = 0.5 + 2.054 = 2.554. D( A, D) = Dqt (6,12) + DMLR ((4,8, 2, 2)T , (10,14, 2, 2)T ) = 0.6 + 2.291 = 2.891. D( A, E ) = Dqt (6,16) + DMLR ((4,8, 2, 2)T , (14,18, 2, 2)T ) = 1 + 3.244 = 4.244. D( F , G ) = Dqt (6,13) + DMLR ((4,8, 0, 0)T , (9,14, 0, 0)T ) = 0.7 + 1.7443 = 2.4443.

D( H , I ) = Dqt (6,14) + DMLR ((4,8, 0, 2)T , (9,14, 2, 0)T ) = 0.8 + 2.1491 = 2.9491.

The second component distance measure between H and I should be less than the second component distance measure between F and G as second component of H and I are geometricaly close to each other than F and G.

8

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

I. BEG AND T. RASHID

Example 3. Let the two data A and B have the one feature component with fuzzy type. The corresponding data are: A = (0,3, 0,1)T and B = (0,9, 0, 4)T . We have D( A, B) = DMLR ((0,3, 0,1)T , (0,9, 0, 4)T ). To find the above mentioned distance we get 0 form in LD p , LDs 0 and LDc . Example 4. Let the two data C and D have the one feature component with fuzzy type. The corresponding data are:

C = (2, 0,5, 0)T and D = (2, 0, 4, 0)T . We have



D(C , D) = DMLR ((2, 0,5, 0)T , (2, 0, 4, 0)T ).

To find the above mentioned distance we get 0 form in RD p , RDs 0 and RDc . 4.  New distance measure Let F be a feature vector with components F1, ..., Fd. Suppose there are s feature vectors A1 , A2 , , As . For any two feature vectors Ay and Az with Ay = ( Ay1 , , Ayd ) and Az = ( Az1 , , Azd ), the distance between Ay and Az is given by d



D( Ay , Az ) = ∑D( Ayk , Azk ), k =1

where D( Ayk , Azk ) is the distance between the kth feature components based on its feature type and 1 ≤ y ≤ s, 1 ≤ z ≤ s. Since there are symbolic and fuzzy feature components in a d-tuple feature vector, the distance D( Ay , Az ) will be the sum of the distances for the individual components. To overcome the drawbacks of Hung et al. [7] as described in Section 3, we propose a new distance measure between LR-type TFN. Let Ayk = (a yk , byk , α ky , β ky ) LR and Azk = (azk , bzk , α kz , β kz ) LR be two fuzzy feature

FUZZY DISTANCE MEASURE

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

9

vectors. The distance between Ayk and Azk is defined as due to position, span and content. Let U k = max{b1k + β1k , , bsk + β ks } − min{a1k − α1k , , ask − α ks };



intersk = the length of the intersection [a yk − α ky , byk + β ky ] and [a − α kz , bzk + β kz ]. Then the distance between Ayk and Azk is defined as follows: k z



Dtab ( Ayk , Azk ) = tD p ( Ayk , Azk ) + tDs ( Ayk , Azk ) + tDc ( Ayk , Azk ) (3)

where tD p ( Ayk , Azk ) = tDs ( Ayk , Azk ) =



tDc ( Ayk , Azk ) =

| (a yk − lα ky + byk + rβ ky ) / 2 − (azk − lα kz + bzk + rβ kz ) / 2 | Uk || byk + β ky − a yk + α ky | − | bzk + β kz − azk + α kz ||

,

U k + | byk + β ky − a yk + α ky | + | bzk + β kz − azk + α kz ) − intersk ||| byk + β ky − a yk + α ky | + | bzk + β kz − azk + α kz | −2 intersk | U k + | byk + β ky − a yk + α ky | + | bzk + β kz − azk + α kz ) − intersk

, .

It is natural to ask “ Is the defined distance measure Dtab ( Ayk , Azk ) reasonable?” We answer this question in Proposition 1. Proposition 1. The defined distance measure Dtab ( Ayk , Azk ) between fuzzy feature vectors Ayk and Azk satisfies the following properties: (P1) Dtab ( Ayk , Azk ) = Dtab ( Azk , Ayk ); (P2) Ayk = Azk if and only if Dtab ( Ayk , Azk ) = 0. Proof. It is easy to see that Dtab ( Ayk , Azk ) satisfies the (P1). We therefore only prove the (P2). By Eq. (3), we have

Dtab ( Ayk , Azk ) = 0



⇔ tD p = tDs = tDc = 0

10

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

I. BEG AND T. RASHID



⇔ a yk − lα ky + byk + rβ ky = azk − lα kz + bzk + rβ kz ,



byk + β ky − a yk + α ky = bzk + β kz − azk + α kz ,



| byk + β ky − a yk + α ky | + | bzk + β kz − azk + α kz |= 2 × intersk



⇔ a yk = azk , byk = bzk , α ky = α kz , β ky = β kz



⇔ Ayk = Azk . Thus, the (P2) is proven.

Next, we re-consider Examples 1, 2, 3 and 4 using this new distance measure Dtab ( Ayk , Azk ). Revisiting Example 1. Let A = (0.48, 0.56, 0, 0)T and B = (0.16, 0.23, 0, 0)T . Then Dtab ( A, B) = 1.103(= Dqt ( A, B)). This result illustrates that the proposed distance measure DMLR does not depend on the presentation of data. It is natural to ask “ Is the new distance measure Dtab ( Ayk , Azk ) = Dqt ( A, B) for α ky = α kz = 0 and β ky = β kz = 0? ” We answer this question in Proposition 2. Proposition 2. The new distance measure Dtab ( Ayk , Azk ) = Dqt ( A, B) for α ky = α kz = 0 and β ky = β kz = 0 in Ayk and Azk. Proof. By Eq. (3), we have





tD p ( Ayk , Azk ) =

tDs ( Ayk , Azk ) =

| (a yk − lα ky + byk + rβ ky ) / 2 − (azk − lα kz + bzk + rβ kz ) / 2 | Uk

,

|| byk + β ky − a yk + α ky | − | bzk + β kz − azk + α kz || U k + | byk + β ky − a yk + α ky | + | bzk + β kz − azk + α kz ) − intersk

,

FUZZY DISTANCE MEASURE

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40



tDc ( Ayk , Azk ) =

11

|| byk + β ky − a yk + α ky | + | bzk + β kz − azk + α kz | −2 intersk | U k + | byk + β ky − a yk + α ky | + | bzk + β kz − azk + α kz ) − intersk

.

Put α ky = α kz = 0 and β ky = β kz = 0 in tD p ( Ayk , Azk ), tDs ( Ayk , Azk ) and tDc ( Ayk , Azk ), we have







tD p ( Ayk , Azk ) =

tDs ( Ayk , Azk ) =

tDc ( Ayk , Azk ) =

| (a yk + byk ) / 2 − (azk + bzk ) / 2 | Uk

,

|| byk − a yk | − | bzk − azk || U k + | byk − a yk | + | bzk − azk ) − intersk || byk − a yk | + | bzk − azk | −2 intersk | U k + | byk − a yk | + | bzk − azk ) − intersk

,

.

Now by Section 2, we have

tD p ( Ayk , Azk ) = D p ( Ak , Bk )



tDs ( Ayk , Azk ) = Ds ( Ak , Bk )



tDc ( Ayk , Azk ) = Dc ( Ak , Bk )

Thus Dtab ( Ayk , Azk ) = Dqt ( A, B) for α ky = α kz = 0 and β ky = β kz = 0 in Ayk and Azk. Revisiting Example 2. Consider the nine data A, B, C, D, E, F, G, H and I as shown in Example 2. We have

D( A, B) = Dqt (6,10) + Dtab ((4,8, 2, 2)T , (8,12, 2, 2)T ) = 0.4 + 0.4889 = 0.8889.



D( A, C ) = Dqt (6,11) + Dtab ((4,8, 2, 2)T , (9,13, 2, 2)T ) = 0.5 + 0.6004 = 1.1004.



D( A, D) = Dqt (6,12) + Dtab ((4,8, 2, 2)T , (10,14, 2, 2)T ) = 0.6 + 0.7083 = 1.3083.



D( A, E ) = Dqt (6,16) + Dtab ((4,8, 2, 2)T , (14,18, 2, 2)T ) = 1 + 1.0261 = 2.0261.

12

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

I. BEG AND T. RASHID



D( F , G ) = Dqt (6,13) + Dtab ((4,8, 0, 0)T , (9,14, 0, 0)T ) = 0.7 + 0.6759 = 1.3759.



D( H , I ) = Dqt (6,14) + Dtab ((4,8, 0, 2)T , (9,14, 2, 0)T ) = 0.8 + 0.5357 = 1.3357.

The second component distance measure between H and I less than the second component distance measure between F and G as second component of H and I are geometricaly close to each other than F and G. Revisiting Example 3. Let the two data A and B have the one feature component with fuzzy type. The corresponding data are: A = (0,3, 0,1)T and B = (0,9, 0, 4)T . We have D( A, B) = Dtab ((0,3, 0,1)T , (0,9, 0, 4)T ) = 0.9808. Revisiting Example 4. Let the two data C and D have the one feature component with fuzzy type. The corresponding data are: C = (2, 0,5, 0)T and D = (2, 0, 4, 0)T . We have D(C , D) = Dtab ((2, 0,5, 0)T , (2, 0, 4, 0)T ) = 0.4167. The above examples indicate that the proposed distance measure Dtab between two LR-type TFNs actually corrects the problem from Hung et al. [7] distance. 5.  Fuzzy relational data clustering algorithm In 2002, Davé and Sen [4] proposed a clustering algorithm, called the fuzzy relational data clustering (FRC) algorithm which is a generalization of FANNY [8, Chapter 4]. The advantages of the FRC algorithm are: faster convergence, robustness with respect to outliers, and easy for handling all kinds of relational data, including that based on non-Euclidean distance measures. The FRC algorithms [4] are described as follows. For n objects, the relational data is usually described by an n × n dissimilarity matrix ( Rij ) n × n , where Rij ≥ 0, Rii = 0 and Rij = R ji , for i, j = 1,..., n.

FUZZY DISTANCE MEASURE

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

13

The functional for FRC is n

c

FFRC = ∑

n

∑∑u

m ik

j =1 k =1

n

uijm R jk ,

2∑ u

i =1



t =1

m it

subject to the conditions c

∑u



i =1

uik ≥ 0,



k = 1, 2, , n

= 1,

ik

i = 1, 2, , c;

k = 1, 2, , n,

with m > 1 and c a positive integer greater than 1. Applying the Lagrange multiplier method, the memberships are obtained as follows:

uik =

aik−1/( m −1)

c

∑ awk−1/( m−1)



(4)

w =1

where n



aik =

m∑ uijm R jk j =1 n

∑u j =1

m ij

n



n

m∑ ∑ uijmuihm R jh h =1 j =1

 n  2  ∑ uijm   j =1 

2

.

(5)

Thus, we have the FRC clustering algorithm as follows: FRC algorithm S1 : Fix m > 1 and 2 ≤ c ≤ n − 1 and set  = 1. Give any e > 0 and initial memberships uik(0) ,1 ≤ i ≤ c, and 1 ≤ k ≤ n. S2 : Compute aik(  ) with uik( −1) by Eq. (5). S3 : Update uik(  ) with aik(  ) by Eq. (4).

14

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

I. BEG AND T. RASHID

S4 : Compare uik(  ) to uik( −1) in a convenient norm  uik(  ) − uik(  −1)  . IF  uik(  ) − uik(  −1) < ε, THEN stop ELSE  =  + 1 and return to S2. We find that Davé and Sen’s [4] initial conditions of FRC algorithm as described in this section has two drawbacks. The first is that if there exist i such that Σ nk =1uik = 0, then in Eq. (5) aik will give us 0 form. 0 The second is that if there exist k such that uik = u jk , i ≠ j , i, j = 1, 2, , c, k = 1, 2, , n, then after apply the FRC clustering algorithm we get uik = u jk from which we cannot get partition clusters. The first drwaback is no need for much explanation as it is very obvious. The second drawback can be found out by using the initial uik = 0.5 for all possible i and k for c = 2. To remove these drawbacks we will extend some initial condition for FRC algorithm [4]. The imporved version of FRC algorithm after some extensions on initial conditions is given in Section 6. Note that the proposed distance Dtab is a non-Euclidean distance measure. It is well-known that distance measures and dissimilarity measures are dual concepts. On the other hand, the proposed distance measure Dtab is less complicated than DMLR so that the mixed-variable fuzzy c-means (MVFCM) proposed by Yang et al. [15] is not suitable for the proposed distance measure Dtab. For these reasons, the FRC algorithm is suitably chosen to apply the proposed distance measure Dtab for real data example. 6.  Improved version of FRC algorithm The improved version of Davé and Sen’s [4] FRC algorithms is as follows. For n objects, the relational data is usually described by an n × n dissimilarity matrix ( Rij ) n × n , where

Rij ≥ 0, Rii = 0 and Rij = R ji , for i, j = 1, , n.

FUZZY DISTANCE MEASURE

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

15

The functional for FRC is n

c

FFRC = ∑

n

∑∑u

m m ik ij

j =1 k =1

n

u R jk ,

2∑ u

i =1



t =1

m it

subject to the conditions c

∑u



i =1

uik ≥ 0,



i = 1, 2, , c; k = 1, 2, , n; n

∑u



k = 1, 2, , n;

= 1,

ik

ik

k =1

> 0, for all i;

there exist k such that uik ≠ u jk , i ≠ j , i, j = 1, 2, , c, k = 1, 2, , n, with m > 1 and c a positive integer greater than 1. Applying the Lagrange multiplier method, the memberships are obtained as follows:

uik =

aik−1/( m −1)

c

∑a w =1



(6)

−1/( m −1) wk

where n



aik =

m∑ uijm R jk j =1 n

∑u j =1

m ij

n



n

m∑ ∑ uijmuihm R jh h =1 j =1

 n  2  ∑ uijm   j =1 

2

.

(7)

Thus, we have the FRC clustering algorithm [4] is described as follows: FRC algorithm S1 : Fix m > 1 and 2 ≤ c ≤ n − 1 and set  = 1. Give any e > 0 and initial memberships uik(0) ,1 ≤ i ≤ c, and 1 ≤ k ≤ n.

16

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

I. BEG AND T. RASHID

S2 : Compute aik(  ) with uik( −1) by Eq. (7) S3 : Update uik(  ) with aik(  ) by Eq. (6). S4 : Compare uik(  ) to uik( −1) in a convenient norm  uik(  ) − uik(  −1)  . IF  uik(  ) − uik(  −1) < ε, THEN stop ELSE  =  + 1 and return to S2.

7.  A real data study The data of the following example is from [15]. Example 5. In this example, we use a real data. There are 10 brands of automobiles from four companies Ford, Toyota, China-Motor and YulonMotor in Taiwan. The data set is shown in Table 1. In each brand, there are six feature components-company, exhaust, price, color, comfort and safety features. In the color feature, the notations W = white, B = blue, G = green, P = purple, S = silver, D = dark, R = red, Gr = grey and Go = golden are used. In all feature components, company, exhaust, color are symbolic data and price are real data and safety and comfort are fuzzy data. We used the propose dissimilarity measure described in Section 4 to illustrate the dissimilarity calculation between object one of Virage and the object six of Tercel as follows: D (Virage,Tercel) = D (China-Motor,Toyota) + D (1.8,1.5) + D (63.9,45.8)

+ D (W, S, D, R, B, W, S, R, G) + D ((10,10,2,2)T, (2,6,0,2)T)



+ D((9,9,3,3)T,(6,6,3,3)T).



D (China-Motor,Toyoya) = Dql (China-Motor,Toyoya)

= Ds (China-Motor,Toyoya) + Dc (China-Motor,Toyoya) = (|1 − 1| /2) + (|1 + 1 − 2 × 0 | /2) = 1. D (1.8,1.5) = Dtab (1.8,1.5) = tDp (1.8,1.5) + tDs (1.8,1.5) + tDc (1.8,1.5)

= 0.4286 + 0 + 0 = 0.4286.

FUZZY DISTANCE MEASURE

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

17

D (63.9,45.8) = Dtab (63.9,45.8) = tDp (63.9,45.8) + tDs (63.9,45.8) + tDc (63.9,45.8) = 0.5262 + 0 + 0 = 0.5262.



D (W,S,D,R,B,W,S,R,G) = Dql (W,S,D,R,B,W,S,R,G)

= Ds(W,S,D,R,B,W,S,R,G) + Dc (W,S,D,R,B,W,S,R,G) = 0.1667 + 0.5000 = 0.6667. D ((10,10,2,2)T,(2,6,0,2)T) = Dtab((10,10,2,2)T,(2,6,0,2)T) = tDp((10,10,2,2)T,(2,6,0,2)T) + tDs((10,10,2,2)T,(2,6,0,2)T)

+tDc((10,10,2,2)T,(2,6,0,2)T)

= 0.4583 + 0.0909 + 0.4545 = 1.0038. D ((9,9,3,3)T,(6,6,3,3)T), = Dtab((9,9,3,3)T,(6,6,3,3)T) = tDp((9,9,3,3)T,(6,6,3,3)T) + tDs((9,9,3,3)T,(6,6,3,3)T)

+ tDc((9,9,3,3)T,(6,6,3,3)T)



= 0.1765 + 0 + 0.2308 = 0.4072.

D (Virage,Tercel) = 1 + 0.4286 + 0.5262 + 0.6667 + 1.0038 + 0.4072



= 4.0325.

Table 2 shows the dissimilarity matrix for all ten automobiles. Using this table, we run the FRC algorithm with m = 2 and e = 0.001 from c = 2 to c = 5. The clustering results and the validity index values of the partition entropy (PE) and partition coefficient (PC) [2] are shown in Tables 3 – 6. Based on the validity index PE and PC values, the optimal cluster number c = 2 with the maximum and minimum value of PC and PE respectively.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 Dissimilarity matrix for ten automobiles from Table 1.

Table 2

Data set of ten automobiles

Table 1 18 I. BEG AND T. RASHID

FUZZY DISTANCE MEASURE

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

19

Table 3 FRC memberships of two part families for ten automobiles from Table 2.

PC 0.5812, PE = 0.6090 Virage, Galant, M2000, Corolla, Premio G2.0, Cerfiro and New Lancer, Tierra Activa, Tercel, March

Table 4 FRC memberships of three part families for ten automobiles from Table 2.

PC = 0.4351, PE = 0.9518 {Galant, Corolla, Premio G2.0, Cerfiro, Tierra Activa, Tercel, March and Virage, New Lancer, M2000}

20

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

I. BEG AND T. RASHID

Table 5 FRC memberships of four part families for ten automobiles from Table 2.

PC = 0.4755, PE = 0.9982 {M2000, Cerfiro}, {Virage, New Lancer, Tierra Activa}, {Galant, Corolla, Premio G2.0} and {Tercel, March}

Table 6 FRC memberships of five part families for ten automobiles from Table 2.

PC 0.5534, PE = 0.9282 {M2000}, {Tercel, March}, {Galant}, {Virage, New Lancer, Tierra Activa} and {Corolla, Premio G2.0, Cerfiro}

Comparison. MVFCM algorithm is apply on the distance measure by Yang et al. and an improved version of FRC algorithm is apply on the proposed distance measure.

FUZZY DISTANCE MEASURE

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

21

Table 7 Memberships of two part families for ten automobiles from Table 1.

The clustering results are different for automobile New Lancer. Optimaly we found that the cluster result for proposed approach is better than from all previous methods. Conclusion This paper presents a novel approach to form the partition clusters based on a FRC algorithm. The approach modified Hung et al. distance measure for fuzzy data to obtain a dissimilarity matrix. The improved version of FRC algorithm with the modified distance measure is used to form dissimilarity matrix. Based on the validity index PE and PC values, we can obtain an optimal partition. References [1]  R. Bellman, R. Kalaba, L.A. Zadeh, Abstraction and pattern classification, J. Math. Anal. Appl. 2 (1966) 581–585. [2]  J.C. Bezdek, Cluster validity with fuzzy sets, J. Cybernetics 3 (1974) 58–74. [3]  J.C. Bezdek, Pattern Recognition With Fuzzy Objective Function Algorithms, Plenum Press, New York, 1981.

22

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

I. BEG AND T. RASHID

[4]  R.N. Davé, S. Sen, Robust fuzzy clustering of relational data, IEEE Transactions on Fuzzy Systems 10 (2002) 713–727. [5]  K.C. Gowda, E. Diday, Symbolic clustering using a new similarity measure, IEEE Transactions on Systems, Man and Cybernetics 22 (1992) 368–378. [6]  K.C. Gowda, E. Diday, Symbolic clustering using a new dissimilarity measure, Pattern Recognition 24 (1991) 567–578. [7]  W.L. Hung, M.S.Yang, E.S. Lee, Cell formation using fuzzy relational clustering algorithm, Mathematical and Computer Modelling 53 (2011) 1776–1787. [8]  L. Kaufman, P.J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis, Wiley, New York, 1990. [9]  T.W. Liao, Classification and coding approaches to part family formation under a fuzzy enviroment, Fuzzy Sets and Systems 122 (3) (2001) 425–441. [10]  E.H. Ruspini, A new approach to clustering, Inform. and Control 15 (1969) 22–32. [11]  S.F. Schnatter, On statistical inference for fuzzy data with applications to descriptive statistics, Fuzzy Sets and Systems 50 (1992) 143– 165. [12]  K.L. Wu, M.S. Yang, Alternative c-means clustering algorithms, Pattern Recognition 35 (2002) 2267–2278. [13]  M.S. Yang, C.H. Ko, On a class of fuzzy c-numbers clustering procedures for fuzzy data, Fuzzy Sets and Systems 84 (1996) 49–60. [14]  M.S. Yang, W.L. Hung, F.C. Cheng, Mixed-variable fuzzy clustering approach to part family and machine cell formation for GT applications, Int. J. of Production Economics 103 (2006) 185–198. [15]  M.S. Yang, P.Y. Hwang, D.H. Chen, Fuzzy clustering algorithms for mixed feature variables, Fuzzy Sets and Systems 141 (2004) 301--317. [16]  M.S. Yang, A survey of fuzzy clustering, Math. Comput. Modelling 18 (1993) 1–16. [17]  H.J. Zimmermann, Fuzzy Set Theory and Its Applications, Kluwer, Dordrecht, 1991.

Received October, 2012