Nowadays, the new methods for data mining are designed constantly. In this
paper, a new method for mining based on closeness relationship of fuzzy
clustering ...
Researches of Data Mining Method Based on Closeness Relationship of Fuzzy Clustering* Xian-chen Hao
De-gan Zhang
Hai Zhao
Communication & Information System Institute, School of Information Science & Engineering(134 #), Northeastern University, Shenyang, 110006, P. R. China
[email protected] or
[email protected]
Abstract –Data mining as a subset of data fusion is
fuzzy clustering is one of important and valid
becoming more and more important because of its
methods
function and efficiency. Nowadays, the new methods
fundamental problems of fuzzy clustering is to
for data mining are designed constantly. In this paper,
define the sample which is given a define sample
a new method for mining based on closeness
space with a fuzzy division[3-5]. Suppose there is a
relationship of fuzzy clustering is presented. As we
define sample collection(set) X={x1 ,x2 , … ,xm },
know, fuzzy clustering is one of important and valid
U=[0,1], if X′s sub-collection(set) P= {p 1 , p 2 ,…,
methods to data mining. One of problems in fuzzy
p k |2= k = m} meets the following conditions:
clustering is to determine a certain fuzzy sample
I)∀x belong(s) to X, p i belong(s) to P, p i (x)
classification in given limited sample space. Another is
belong(s) to U. II)∀x belong(s) to X,S 1= i= kpi
its validity, that is to say, if the sample is resemble in
(x)=1. III)∀p i belong(s) to P,0< S x belong(s)
sample space, its fuzzy type will be resemble too. In our
xpi (x)=r(x,y), h will
To sum up, if there is an arbitrary group of
be called Xs r-subset based on Nt-relation, R, H
fuzzy subsets of X and t-operator, by using
will be called X′s r-fuzzy coverage based on
theorem 4 and theorem 5 X′s Nt-relation, r, and r/
Nt-relation, R(the signal is r/ R-F coverage).
R-F coverage, G is divided and X′s fuzzy division,
According to the above definition, the similarity of samples of r/ R-F coverage is in harmony with the similarity of their fuzzy
p, is got.
3.2 The validity of fuzzy clustering According
to
the
fundamental
idea
of
clustering. Therefore the more similar two
evaluating the validity of fuzzy clustering, the
samples are, the more similar their fuzzy
more close the logical similarity of fuzzy
clusterings are. Therefore, r/ R-F coverage can
clustering of samples and the physical similarity
of samples is, the more effective clustering is.
According to the discussion about the
According to the Nt-relation of fuzzy similarity,
above ways of fuzzy clustering and the evaluation
X’s Nt-relation, r(x,y) reports the original
of validity, we can get the following algorithm of
similarity of samples in sample space, and can
fuzzy clustering.
measure the physical similarity of samples. U′s
Algorithm 1 There is the number of
Nt-relation, R(g(x),g(y)), reports the similarity of
clustering, k, and the group of t-operator, T={t1 ,
the fuzzy clustering of samples and can measure
t2 ,…,t v}, and the group of fuzzy subclasses of
the logical similarity of samples. Therefore, there
class X={x1 ,x2 , …,xm }, H={h 1 ,h 2 ,…,h n }. How can
are the following descriptions of the validity of
we get X’s Fk-division, P={p 1 ,p 2 ,…,p k}of every
fuzzy clustering.
t-operator and the validity index of P and r/ R-F
Definition 5 Suppose that T is a continuous t-operator and that R is U′s Nt-relation and that r is X′s Nt-relation and that P={p 1 ,p 2 ,…,p k} is X′s k
F -division. If E(P)=Infx,y
coverage, G={g 1 , g 2 ,…, g m }. Step one: all t -operator ti , i=1,2, … ,v, complete step two? step three and step four;
belong(s) to XR(R(Sp(x),
Step two: according to theorem four and
S p(y)), E(P) will be called P′s validity index and
theorem five, r/ R-F coverage G is made on the
R(S p(x), Sp(y))= Infl = i= k R(pi (x), pi (y)). Obviously,
basis of H;
the bigger E(P) is(the more close E(P) is to 1), the more effective clustering is. Another problem is that what kind of
Step three: get the validity index of G,E(G); Step four: the fuzzy k-division of G is made to get P and the validity index of P, E(P);
t-operator is used to cluster and calculate validity
Step five: the end.
when there are arbitrary fuzzy subclasses, H, of X.
On step four, when the fuzzy k-division of
Or which t-operator and Nt-relation is better used
fuzzy coverage is made, there are two problems
to cluster and calculate validity when there is a
to be solved: One is that the number of fuzzy
group of definite subclasses? At the same time,
subclasses of G, m is turned into the number of
the definition of validity index can be used to
division, k. The other is that all fuzzy subclasses,
evaluate t-operator and Nt-relation.
k, are turned into fuzzy k-division.
Definition 6 Suppose that T is a continuous
As to the first problem, the audio-visual way is
t-operator and that R is U′s Nt-relation and that
to set up similar relation of n fuzzy subclasses of
that r is X′s Nt-relation and that R is U′s N t-relation
G and then group the most similar classes into
and that G={g 1 , g2 ,…, gn } is X′s fuzzy coverage.
one fuzzy subclass, that is, k-division of the group
If E(G)= Infx,y belong(s) to XR(R(S G(x), SG (y)),r(x,Y)),
fuzzy subclass, G, is made. This is the
E(G) will be called G′s validity index and R(SG (x),
development of the idea on common clustering
S G(y))= Infl =
R(gi(x), gi (y)). Obviously, the
unity in fuzzy clustering, It may lose more
bigger E(G) is(the more close E(P) is to 1), the
original structural information of sample space.
more effective fuzzy coverage is. Especially for r/
Here is another way: you can cross out n-k fuzzy
R-F coverage, the more close E(P) is to 1, the
subclasses which is similar to others from G and
more effective the chosen t-operator is and the
make models on the basis of the rest, k fuzzy
better Nt-relation reports the structure of sample
subsets. In this way, k fuzzy subclasses which are
space and at the same time, the more reliable the
kept are original fuzzy subclasses of G. Although
evaluation of E(G) is proved.
it’s not concern that fuzzy coverage, they are all
i= n
4 The realization of fuzzy clustering
r-subsets of fuzzy close relation R of X and hold original structure information of sample space as much as possible. The concrete ways are the following steps:
Step one: make the similar relation, W, of G. If W(gi , gj )=Infx belong(s) to g j )=?
x belong(s) to X
X
R(gi (x), gj(x)) or W(gi ,
R(gi (x), gj(x))/|X|,etc. (|X| is the
number of element of class X).
Ltd. In the electronic commerce, in order to attract
more
guests
and
customers,
good
programs and contents are included. From time to time, as the owners of the e-commerce Co., Ltd,
c
Step two: choose fuzzy subclass g which is to
they provide some questions or choice for their
be crossed out. According to the physical
programs and contents to customers and guests,
meaning of algorithm, I) if every g belong(s) to G,
after the answer or choice result feeding back, the
S(g)= ? g
’
belong(s) toG W(g
’
,g) is to be proved, and
owners analyze their works quality of programs,
one of fuzzy subsets which have the bigger S(g) is
contents, merchandises, and so on. There is a
c
c
to be chosen as g . The chosen g is relatively in
sample distributing of nature of merchandises or
the center of fuzzy subclasses. Their faults are
programs in Fig.1, now we study it.
that if there are two similar subclasses, but they
The Fig.1 is 2-D sample set Z={zi | zi =(xi , yi ),
are not similar to the others at all, one of them can
i∈[1,21]},the X-axis may be the numb er of
r
not be crossed out. II) How can we get G ? If
program, the quantity of merchandise, the Y-axis
W(g’,g)=sup{ W(gi , g j ) | g i , g j belong(s) to G, g i
may be the nature number of them, now, we give
r
? g j }, then g’ belong(s) to G . For every g r
the fuzzy clustering. The result of clustering and
belong(s) to G , the elements of set { W(gi , g j ) |
the comparison result with reference [9] is in
g j belong(s) to G} will be grouped in a
table 3, and table 4, respectively. The realization
descending order. The fuzzy subclass which has
process is as follows: I) According to the Euclid
c
distance equation in sample space d(zi , zj ), the
The chosen g has the highest degree of similarity
fuzzy subsets H={h i | h i ( zi )=1- d(zi , zj )/L}can be
with another fuzzy subclass and is also similar to
obtained, where d(zi , zj )=((xi - xj )2 + (yi - yj )2 )1/2 , L
the other subclasses.
=((xma x- xmin )2 + (ymax – ymin)2 )1/2 , xma x =Sup i∈[1,21]
the biggest alignment number will be chosen as g . c
Step three: cross out fuzzy subset gc from G, c
G=G- g .
xi , xminx =Inf i∈[1,21] xi , yma x =Sup i∈[1,21] yi , yminx =Inf i∈[1,21]
Step four: if | G| = k, go back, or turn to step one.
yi . II) t–operator sets T={ t1 = Minimum
multiply, t2 =Probability multi- y , t3 =Einstein multiply
,
t4 =Giles
multiply
,
As to the second problem, k-division of k
t5 =Schweizer-Skland multiply s=-2} }. II) Build-
fuzzy subclasses will be made: all x belong(s) to
g the similar relation wof fuzzy coverage G, w (g i ,
X, if p i (x)=wix x g i (x)/ ?
g j )= ?
1= i = k
wjx x g j (x). (wix
z belong(s) to Z
R(g i (z), g j(z))/21. IV) By the
belong(s) to [0,1] is weight). According to the
second method, the subset g c will be selected. V)
facts, the following formulas can be used: wix=|
The fuzzy-division is by p i(x)= (g i(x))2 /
g i | =?
(g1 (x))2 +(g2 (x))2 ).
x belong(s) to X g i (x),
or wix= g i (x), or wix=1.
5 Example and analysis Based on the algorithm above, we give an example for electronic commerce, and example analysis for Shenyang Dongyu e-commerce Co.,
From the table 3, we can find that the clustering result of t4 and t5 is valid, but t1 and t2 is invalid. From the table 4, we can know that the clustering result of t4 and t5 is better than that of reference [9]. According to the analysis, we can easily
Internet, more information and profit may be got.
6 Conclusion In this paper , in order to find efficient data mining method, we have already discussed the following content:
decide the quality of merchandises or programs,
I) Making clustering is to draw into triangle
so the owners can make the next decision for the
operator and triangle transmission, from fuzzy
future development. Through the closeness
equivalent relation to fuzzy close relation. In the
relationship of fuzzy clustering, we can mine the
eyes of geometry, fuzzy equivalent relation
interested data from Internet, and we can obtain
requires that the shorter two sides of a triangle are
good benefits from the mining. This data mining
equal, but fuzzy close relation only requires that
method is used to electronic commerce on
the shortest side of a triangle id longer than or
equal to the triangle power of the longer two sides of the triangle. Low requirements and wide applicable scope make it suitable for other fields
method have been tested.
Acknowledgement
on fuzzy class.
This research is sponsored by the National
II) r/ R-F coverage is the fuzzy coverage drawn into on the basis of [0,1] and X′s fuzzy close relation. How can we get X′s r/ R-F
Natural Foundation of China(No. 69873007).
About authors
coverage when there is an arbitrary group of
HAO Xian-chen: professor, interests in data
fuzzy subclasses of X(or fuzzy relation). r/ R-F
mining and so on.
coverage requires that the fuzzy clustering of
ZHANG De-gan: Ph.D candidate of NEU,
close samples are also similar. Therefore, it gives
interests in information fusion and so on.
full picture of the original distribution of samples
ZHAO Hai: professor of NEU, director of Ph.D,
and ensures that clustering is effective.
interests in information fusion and so on.
III) There is a kind of way to make use of fuzzy close relation and r/ R-F coverage to make
References
clustering. It develops the clustering way on the
[1] W.H.Inmon:The Data Warehouse and Data
basis of losing much information when you want
Mining, Communication of the ACM, 1996,39
to get fuzzy equivalent relation.
(11).
IV) Putting forward a set of ways to evaluate
[2] Adriaans P,Zantinge D.Data mining.England:
the validity of fuzzy clustering. It emphasizes the
Addison Wesley Longman, 1998,40~100.
relationship between the results of clustering and
[3] Ragab M Z and Emam E G. On the min-max
the description of samples, and makes the results
composition of fuzzy matrices[J]. Fuzzy Sets and
of clustering and the original information of
Systems, 1995, (75):83~92.
samples into one. On the basis of the original
[4] Kamel S Mohamed. New algorithms for
information, the results of clustering should hold
solving the fuzzy c-means clustering problem[J].
original information as much as possible. That is,
Pattern Recognition,1994, 27:421~428.
the sorts of the close samples should be close in
[5] Bezdek. Pattern Recognition with Fuzzy
relationship. The law reveals the true meaning of
Objective function Algorithm[J]. Plenum Press ,
clustering.
New York, 1997.
V) Making use of the results, we should
[6] Tan,C.,and Witting G.,A Study of the
make clustering of the samples of reference [9].
Parameters of a back propagation stock price
From the point of audio-visual geometry meaning
prediction model[J], in Proc. ANNES,1998.
and distribution, when we use Giles multiply
[7] Deboeck,G.J., Trading on the Edge: Neural,
operator
Genetic and Fuzzy Systems for Chaotic Financial
and
Schweizer-Skland
multiply
operator(s=-2), the results of clustering reports
Markets[M] ,Wiley,1999.
the distribution of samples better than reference
[8] Agrawal R,Mannila H,SrikantRetal. Fast
[9].
discovery of association rules[M]. In Knowledge Based on the discussion and method above,
Discovery and Data Mining. Menlo Park, CA:
which is applied to data mining for electronic
AAAI /MIT Press, 1999,307~ 328.
commerce on Internet, we have got good result.
[9] Andrew K, Wing S, Relational Duals of the
Our experiments have been done for Dongyu
C-means
e-commerce Co. Ltd., Shengyang, China, and
Recognition, 1989, 22.
good economic benefits have already been
[10] Xu Ling-yu, Zhao Hai, Applicaion of Neural
obtained. So the validity and feasibility of this
Fusion to Accident Forecast in Hydropower
Clustering
Algorithms.
Pattern
station, Proceedings of The Second International Conference on Information Fusion, Vol 2,1999 [11] Du Qing-dong, Zhao Hai, D-S Evidence Theory Applied to Fault Diagnosis of Generator Based on Embedded Sensors, Proceedings of The Third International Conference on Information Fusion, Vol 1,2000