Researches of Data Mining Method Based on Closeness ... - ISIF

5 downloads 26 Views 489KB Size Report
Nowadays, the new methods for data mining are designed constantly. In this paper, a new method for mining based on closeness relationship of fuzzy clustering ...
Researches of Data Mining Method Based on Closeness Relationship of Fuzzy Clustering* Xian-chen Hao

De-gan Zhang

Hai Zhao

Communication & Information System Institute, School of Information Science & Engineering(134 #), Northeastern University, Shenyang, 110006, P. R. China [email protected] or [email protected]

Abstract –Data mining as a subset of data fusion is

fuzzy clustering is one of important and valid

becoming more and more important because of its

methods

function and efficiency. Nowadays, the new methods

fundamental problems of fuzzy clustering is to

for data mining are designed constantly. In this paper,

define the sample which is given a define sample

a new method for mining based on closeness

space with a fuzzy division[3-5]. Suppose there is a

relationship of fuzzy clustering is presented. As we

define sample collection(set) X={x1 ,x2 , … ,xm },

know, fuzzy clustering is one of important and valid

U=[0,1], if X′s sub-collection(set) P= {p 1 , p 2 ,…,

methods to data mining. One of problems in fuzzy

p k |2= k = m} meets the following conditions:

clustering is to determine a certain fuzzy sample

I)∀x belong(s) to X, p i belong(s) to P, p i (x)

classification in given limited sample space. Another is

belong(s) to U. II)∀x belong(s) to X,S 1= i= kpi

its validity, that is to say, if the sample is resemble in

(x)=1. III)∀p i belong(s) to P,0< S x belong(s)

sample space, its fuzzy type will be resemble too. In our

xpi (x)=r(x,y), h will

To sum up, if there is an arbitrary group of

be called Xs r-subset based on Nt-relation, R, H

fuzzy subsets of X and t-operator, by using

will be called X′s r-fuzzy coverage based on

theorem 4 and theorem 5 X′s Nt-relation, r, and r/

Nt-relation, R(the signal is r/ R-F coverage).

R-F coverage, G is divided and X′s fuzzy division,

According to the above definition, the similarity of samples of r/ R-F coverage is in harmony with the similarity of their fuzzy

p, is got.

3.2 The validity of fuzzy clustering According

to

the

fundamental

idea

of

clustering. Therefore the more similar two

evaluating the validity of fuzzy clustering, the

samples are, the more similar their fuzzy

more close the logical similarity of fuzzy

clusterings are. Therefore, r/ R-F coverage can

clustering of samples and the physical similarity

of samples is, the more effective clustering is.

According to the discussion about the

According to the Nt-relation of fuzzy similarity,

above ways of fuzzy clustering and the evaluation

X’s Nt-relation, r(x,y) reports the original

of validity, we can get the following algorithm of

similarity of samples in sample space, and can

fuzzy clustering.

measure the physical similarity of samples. U′s

Algorithm 1 There is the number of

Nt-relation, R(g(x),g(y)), reports the similarity of

clustering, k, and the group of t-operator, T={t1 ,

the fuzzy clustering of samples and can measure

t2 ,…,t v}, and the group of fuzzy subclasses of

the logical similarity of samples. Therefore, there

class X={x1 ,x2 , …,xm }, H={h 1 ,h 2 ,…,h n }. How can

are the following descriptions of the validity of

we get X’s Fk-division, P={p 1 ,p 2 ,…,p k}of every

fuzzy clustering.

t-operator and the validity index of P and r/ R-F

Definition 5 Suppose that T is a continuous t-operator and that R is U′s Nt-relation and that r is X′s Nt-relation and that P={p 1 ,p 2 ,…,p k} is X′s k

F -division. If E(P)=Infx,y

coverage, G={g 1 , g 2 ,…, g m }. Step one: all t -operator ti , i=1,2, … ,v, complete step two? step three and step four;

belong(s) to XR(R(Sp(x),

Step two: according to theorem four and

S p(y)), E(P) will be called P′s validity index and

theorem five, r/ R-F coverage G is made on the

R(S p(x), Sp(y))= Infl = i= k R(pi (x), pi (y)). Obviously,

basis of H;

the bigger E(P) is(the more close E(P) is to 1), the more effective clustering is. Another problem is that what kind of

Step three: get the validity index of G,E(G); Step four: the fuzzy k-division of G is made to get P and the validity index of P, E(P);

t-operator is used to cluster and calculate validity

Step five: the end.

when there are arbitrary fuzzy subclasses, H, of X.

On step four, when the fuzzy k-division of

Or which t-operator and Nt-relation is better used

fuzzy coverage is made, there are two problems

to cluster and calculate validity when there is a

to be solved: One is that the number of fuzzy

group of definite subclasses? At the same time,

subclasses of G, m is turned into the number of

the definition of validity index can be used to

division, k. The other is that all fuzzy subclasses,

evaluate t-operator and Nt-relation.

k, are turned into fuzzy k-division.

Definition 6 Suppose that T is a continuous

As to the first problem, the audio-visual way is

t-operator and that R is U′s Nt-relation and that

to set up similar relation of n fuzzy subclasses of

that r is X′s Nt-relation and that R is U′s N t-relation

G and then group the most similar classes into

and that G={g 1 , g2 ,…, gn } is X′s fuzzy coverage.

one fuzzy subclass, that is, k-division of the group

If E(G)= Infx,y belong(s) to XR(R(S G(x), SG (y)),r(x,Y)),

fuzzy subclass, G, is made. This is the

E(G) will be called G′s validity index and R(SG (x),

development of the idea on common clustering

S G(y))= Infl =

R(gi(x), gi (y)). Obviously, the

unity in fuzzy clustering, It may lose more

bigger E(G) is(the more close E(P) is to 1), the

original structural information of sample space.

more effective fuzzy coverage is. Especially for r/

Here is another way: you can cross out n-k fuzzy

R-F coverage, the more close E(P) is to 1, the

subclasses which is similar to others from G and

more effective the chosen t-operator is and the

make models on the basis of the rest, k fuzzy

better Nt-relation reports the structure of sample

subsets. In this way, k fuzzy subclasses which are

space and at the same time, the more reliable the

kept are original fuzzy subclasses of G. Although

evaluation of E(G) is proved.

it’s not concern that fuzzy coverage, they are all

i= n

4 The realization of fuzzy clustering

r-subsets of fuzzy close relation R of X and hold original structure information of sample space as much as possible. The concrete ways are the following steps:

Step one: make the similar relation, W, of G. If W(gi , gj )=Infx belong(s) to g j )=?

x belong(s) to X

X

R(gi (x), gj(x)) or W(gi ,

R(gi (x), gj(x))/|X|,etc. (|X| is the

number of element of class X).

Ltd. In the electronic commerce, in order to attract

more

guests

and

customers,

good

programs and contents are included. From time to time, as the owners of the e-commerce Co., Ltd,

c

Step two: choose fuzzy subclass g which is to

they provide some questions or choice for their

be crossed out. According to the physical

programs and contents to customers and guests,

meaning of algorithm, I) if every g belong(s) to G,

after the answer or choice result feeding back, the

S(g)= ? g



belong(s) toG W(g



,g) is to be proved, and

owners analyze their works quality of programs,

one of fuzzy subsets which have the bigger S(g) is

contents, merchandises, and so on. There is a

c

c

to be chosen as g . The chosen g is relatively in

sample distributing of nature of merchandises or

the center of fuzzy subclasses. Their faults are

programs in Fig.1, now we study it.

that if there are two similar subclasses, but they

The Fig.1 is 2-D sample set Z={zi | zi =(xi , yi ),

are not similar to the others at all, one of them can

i∈[1,21]},the X-axis may be the numb er of

r

not be crossed out. II) How can we get G ? If

program, the quantity of merchandise, the Y-axis

W(g’,g)=sup{ W(gi , g j ) | g i , g j belong(s) to G, g i

may be the nature number of them, now, we give

r

? g j }, then g’ belong(s) to G . For every g r

the fuzzy clustering. The result of clustering and

belong(s) to G , the elements of set { W(gi , g j ) |

the comparison result with reference [9] is in

g j belong(s) to G} will be grouped in a

table 3, and table 4, respectively. The realization

descending order. The fuzzy subclass which has

process is as follows: I) According to the Euclid

c

distance equation in sample space d(zi , zj ), the

The chosen g has the highest degree of similarity

fuzzy subsets H={h i | h i ( zi )=1- d(zi , zj )/L}can be

with another fuzzy subclass and is also similar to

obtained, where d(zi , zj )=((xi - xj )2 + (yi - yj )2 )1/2 , L

the other subclasses.

=((xma x- xmin )2 + (ymax – ymin)2 )1/2 , xma x =Sup i∈[1,21]

the biggest alignment number will be chosen as g . c

Step three: cross out fuzzy subset gc from G, c

G=G- g .

xi , xminx =Inf i∈[1,21] xi , yma x =Sup i∈[1,21] yi , yminx =Inf i∈[1,21]

Step four: if | G| = k, go back, or turn to step one.

yi . II) t–operator sets T={ t1 = Minimum

multiply, t2 =Probability multi- y , t3 =Einstein multiply

,

t4 =Giles

multiply

,

As to the second problem, k-division of k

t5 =Schweizer-Skland multiply s=-2} }. II) Build-

fuzzy subclasses will be made: all x belong(s) to

g the similar relation wof fuzzy coverage G, w (g i ,

X, if p i (x)=wix x g i (x)/ ?

g j )= ?

1= i = k

wjx x g j (x). (wix

z belong(s) to Z

R(g i (z), g j(z))/21. IV) By the

belong(s) to [0,1] is weight). According to the

second method, the subset g c will be selected. V)

facts, the following formulas can be used: wix=|

The fuzzy-division is by p i(x)= (g i(x))2 /

g i | =?

(g1 (x))2 +(g2 (x))2 ).

x belong(s) to X g i (x),

or wix= g i (x), or wix=1.

5 Example and analysis Based on the algorithm above, we give an example for electronic commerce, and example analysis for Shenyang Dongyu e-commerce Co.,

From the table 3, we can find that the clustering result of t4 and t5 is valid, but t1 and t2 is invalid. From the table 4, we can know that the clustering result of t4 and t5 is better than that of reference [9]. According to the analysis, we can easily

Internet, more information and profit may be got.

6 Conclusion In this paper , in order to find efficient data mining method, we have already discussed the following content:

decide the quality of merchandises or programs,

I) Making clustering is to draw into triangle

so the owners can make the next decision for the

operator and triangle transmission, from fuzzy

future development. Through the closeness

equivalent relation to fuzzy close relation. In the

relationship of fuzzy clustering, we can mine the

eyes of geometry, fuzzy equivalent relation

interested data from Internet, and we can obtain

requires that the shorter two sides of a triangle are

good benefits from the mining. This data mining

equal, but fuzzy close relation only requires that

method is used to electronic commerce on

the shortest side of a triangle id longer than or

equal to the triangle power of the longer two sides of the triangle. Low requirements and wide applicable scope make it suitable for other fields

method have been tested.

Acknowledgement

on fuzzy class.

This research is sponsored by the National

II) r/ R-F coverage is the fuzzy coverage drawn into on the basis of [0,1] and X′s fuzzy close relation. How can we get X′s r/ R-F

Natural Foundation of China(No. 69873007).

About authors

coverage when there is an arbitrary group of

HAO Xian-chen: professor, interests in data

fuzzy subclasses of X(or fuzzy relation). r/ R-F

mining and so on.

coverage requires that the fuzzy clustering of

ZHANG De-gan: Ph.D candidate of NEU,

close samples are also similar. Therefore, it gives

interests in information fusion and so on.

full picture of the original distribution of samples

ZHAO Hai: professor of NEU, director of Ph.D,

and ensures that clustering is effective.

interests in information fusion and so on.

III) There is a kind of way to make use of fuzzy close relation and r/ R-F coverage to make

References

clustering. It develops the clustering way on the

[1] W.H.Inmon:The Data Warehouse and Data

basis of losing much information when you want

Mining, Communication of the ACM, 1996,39

to get fuzzy equivalent relation.

(11).

IV) Putting forward a set of ways to evaluate

[2] Adriaans P,Zantinge D.Data mining.England:

the validity of fuzzy clustering. It emphasizes the

Addison Wesley Longman, 1998,40~100.

relationship between the results of clustering and

[3] Ragab M Z and Emam E G. On the min-max

the description of samples, and makes the results

composition of fuzzy matrices[J]. Fuzzy Sets and

of clustering and the original information of

Systems, 1995, (75):83~92.

samples into one. On the basis of the original

[4] Kamel S Mohamed. New algorithms for

information, the results of clustering should hold

solving the fuzzy c-means clustering problem[J].

original information as much as possible. That is,

Pattern Recognition,1994, 27:421~428.

the sorts of the close samples should be close in

[5] Bezdek. Pattern Recognition with Fuzzy

relationship. The law reveals the true meaning of

Objective function Algorithm[J]. Plenum Press ,

clustering.

New York, 1997.

V) Making use of the results, we should

[6] Tan,C.,and Witting G.,A Study of the

make clustering of the samples of reference [9].

Parameters of a back propagation stock price

From the point of audio-visual geometry meaning

prediction model[J], in Proc. ANNES,1998.

and distribution, when we use Giles multiply

[7] Deboeck,G.J., Trading on the Edge: Neural,

operator

Genetic and Fuzzy Systems for Chaotic Financial

and

Schweizer-Skland

multiply

operator(s=-2), the results of clustering reports

Markets[M] ,Wiley,1999.

the distribution of samples better than reference

[8] Agrawal R,Mannila H,SrikantRetal. Fast

[9].

discovery of association rules[M]. In Knowledge Based on the discussion and method above,

Discovery and Data Mining. Menlo Park, CA:

which is applied to data mining for electronic

AAAI /MIT Press, 1999,307~ 328.

commerce on Internet, we have got good result.

[9] Andrew K, Wing S, Relational Duals of the

Our experiments have been done for Dongyu

C-means

e-commerce Co. Ltd., Shengyang, China, and

Recognition, 1989, 22.

good economic benefits have already been

[10] Xu Ling-yu, Zhao Hai, Applicaion of Neural

obtained. So the validity and feasibility of this

Fusion to Accident Forecast in Hydropower

Clustering

Algorithms.

Pattern

station, Proceedings of The Second International Conference on Information Fusion, Vol 2,1999 [11] Du Qing-dong, Zhao Hai, D-S Evidence Theory Applied to Fault Diagnosis of Generator Based on Embedded Sensors, Proceedings of The Third International Conference on Information Fusion, Vol 1,2000

Suggest Documents