ICTAI 2014
Limassol-Cyprus
CATEGORICAL DATA CLUSTERING: A CORRELATION-BASED APPROACH FOR UNSUPERVISED ATTRIBUTE WEIGHTING Joel Carbonera Mara Abel
Institute of Informatics, UFRGS, Porto Alegre , Brazil
Joel Luis Carbonera
[email protected] BDI group - UFRGS www.inf.ufrgs.br/bdi
INTRODUCTION
Joel Luis Carbonera
[email protected] BDI group - UFRGS 2 www.inf.ufrgs.br/bdi
Clustering • Clustering: – Technique in which objects are partitioned into groups, in such a way that objects in the same group (or cluster) are more similar among themselves than to those in other clusters.
3
Categorical data clustering • Categorical data clustering: – Refers to the clustering of objects that are described by categorical attributes. • Their values are not inherently comparable in any way.
4
Challenges • Categorical data sets are often highdimensional. – The dissimilarity between a given object x and its nearest object will be close to the dissimilarity between x and its farthest object. – Discovering meaningful separable clusters becomes a very challenging task.
5
Subspace clustering approaches • For handling the high-dimensionality, some works take advantage of the fact that clusters usually occur in a subspace defined by a subset of the initially selected attributes.
6
Soft subspace clustering approaches • Soft subspace clustering approaches – Different weights are assigned to each attribute in each cluster, for measuring their respective contributions to the formation of each cluster. – Instead of assigning a global weight vector for the whole data set, different weight vectors are assigned to different clusters. – In these approaches, the strategy for attribute weighting plays a crucial role. 7
Our approach • We propose a strategy for measuring the contribution of each attribute considering its correlations with other attributes. – Studies in Cognitive Sciences suggest that humans spontaneously learn categories by exploring the correlations among the attributes of the perceived objects.
8
Our approach • We propose the Correlational relevance index (cri). – It can be used for measuring the relevance of a given categorical attribute ai. – The cri(ai) is directly proportional to the degree of coincidence between the values of all attributes of the dataset (a1,…,am), with the values of the attribute ai. – The bigger the cri(ai) is, the higher the relevance of ai is. 9
CORRELATIONAL RELEVANCE INDEX
Joel Luis Carbonera
[email protected] BDI group - 10 UFRGS www.inf.ufrgs.br/bdi
Example of dataset Data objects
a1
a2
a3
a4
a5
x1
l
a
d
g
i
x2
m
a
d
h
i
x3
n
a
e
g
i
x4
o
b
e
h
j
x5
p
b
e
g
j
x6
q
b
e
h
j
x7
r
b
e
g
k
x8
s
c
e
H
k
x9
t
c
f
g
k
x10
u
c
f
h
k
11
Correlational relevance index • It is necessary to measure the co-occurence of values of distinct attributes. • The function Ψ(ah(l), aj(p)) maps the values ah(l) (of atribute ah) and aj(p) (of atribute aj) to the number of objects in the cluster in which these values co-occur.
12
Correlational relevance index Data objects
a1
a2
a3
a4
a5
x1
l
a
d
g
i
x2
m
a
d
h
i
x3
n
a
e
g
i
x4
o
b
e
h
j
x5
p
b
e
g
j
x6
q
b
e
h
j
x7
r
b
e
g
k
x8
s
c
e
H
k
x9
t
c
f
g
k
x10
u
c
f
h
k
Ψ(b, e) = |{x4,x5,x6,x7}| = 4
13
Correlational relevance index • The function M(ah(l), aj) maps a value ah(l) (of atribute ah) and a given attribute aj to the greatest value that Ψ(ah(l), aj(p)) can take, considering all possible values aj(p) of attribute aj.
14
Correlational relevance index Data objects
a1
a2
a3
a4
a5
x1
l
a
d
g
i
x2
m
a
d
h
i
x3
n
a
e
g
i
x4
o
b
e
h
j
x5
p
b
e
g
j
x6
q
b
e
h
j
x7
r
b
e
g
k
x8
s
c
e
H
k
x9
t
c
f
g
k
x10
u
c
f
h
k
Ψ(b, d) = |{}| = 0 Ψ(b, e) = |{x4,x5,x6,x7}| = 4 Ψ(b, f) = |{}| = 0
M(b, a3) = 4 15
Correlational relevance index The function α is defined as: (𝒍) α(𝒂𝒉 ,𝒂𝒋 )=
(𝒍) M(𝒂𝒉 ,𝒂𝒋 ) (𝒍) 𝒇𝒓𝒆𝒒𝒖𝒆𝒏𝒄𝒚(𝒂𝒉 )
16
Correlational relevance index Data objects
a1
a2
a3
a4
a5
x1
l
a
d
g
i
x2
m
a
d
h
i
x3
n
a
e
g
i
x4
o
b
e
h
j
x5
p
b
e
g
j
x6
q
b
e
h
j
x7
r
b
e
g
k
x8
s
c
e
H
k
x9
t
c
f
g
k
x10
u
c
f
h
k
α(b, a3) = 4/4 = 1
17
Correlational relevance index • Maximum co-occurrence correlation index (mcci). Can be measured between two given categorical attributes ah and aj through the function mcci:
𝑚𝑐𝑐𝑖 𝑎ℎ , 𝑎𝑗
=
(𝑙) |𝑑𝑜𝑚 𝑎ℎ | α(𝑎 𝑙=1 ℎ ,𝑎𝑗 )
|𝑑𝑜𝑚 𝑎ℎ |
Where |dom(ah)| is the number of values that the attribute ah can assume. 18
Correlational relevance index Data objects
a1
a2
a3
a4
a5
x1
l
a
d
g
i
x2
m
a
d
h
i
x3
n
a
e
g
i
x4
o
b
e
h
j
x5
p
b
e
g
j
x6
q
b
e
h
j
x7
r
b
e
g
k
x8
s
c
e
H
k
x9
t
c
f
g
k
x10
u
c
f
h
k
𝒎𝒄𝒄𝒊 𝒂𝟐, 𝒂𝟑 =
α(𝒂, 𝒂𝟑) + α(𝒃, 𝒂𝟑) + α(𝒄, 𝒂𝟑) 𝟎. 𝟔𝟕 + 𝟏 + 𝟎. 𝟔𝟕 = = 𝟎. 𝟕𝟖 𝟑 𝟑
19
Correlational relevance index • Correlational relevance index (cri). Can be assigned to a given attribute ah as defined by the function mcci:
𝑐𝑟𝑖 𝑎ℎ =
|𝐴| 𝑗=1 mcci(𝑎𝑗 ,𝑎ℎ )
|𝐴|
Where |A| is the number of values that the atribute ah can assume. 20
Correlational relevance index Data objects
a1
a2
a3
a4
a5
x1
l
a
d
g
i
x2
m
a
d
h
i
x3
n
a
e
g
i
x4
o
b
e
h
j
x5
p
b
e
g
j
x6
q
b
e
h
j
x7
r
b
e
g
k
x8
s
c
e
H
k
x9
t
c
f
g
k
x10
u
c
f
h
k
Attributes
a1
a2
a3
a4
a5
CRI
0.44
0.84
0.82
0.74
0.83
21
CONCLUSION
Joel Luis Carbonera
[email protected] BDI group - 22 UFRGS www.inf.ufrgs.br/bdi
Conclusion • We investigate how to use the correlation among categorical attributes for measuring their relevancies in clustering tasks. • As a result, we propose a correlation-based attribute weighting approach for categorical attributes.
23
Conclusion • Currently, we are applying this approach for developing variations of well known extensions of the k-modes algorithm. • In our preliminary experiments, the resulting algorithms have good results; comparable to the performance of state-of-the-art algorithms.
24
Download The source code of the algorithm that implements our approach for attribute weighting can be downloaded at: http://www.inf.ufrgs.br/~jlcarbonera/?page_id=51
25
ICTAI 2014
Limassol-Cyprus
CATEGORICAL DATA CLUSTERING: A CORRELATION-BASED APPROACH FOR UNSUPERVISED ATTRIBUTE WEIGHTING Joel Carbonera Mara Abel
Institute of Informatics, UFRGS, Porto Alegre , Brazil
Joel Luis Carbonera
[email protected] BDI group - UFRGS www.inf.ufrgs.br/bdi
Appendix CK-modes • CK-modes (correlation-based k-modes) – Is a subspace clustering agorithm. – Extends the basic K-modes. – Considers the Correlational relevance index (cri) for measuring the relevance of each attribute (globally and locally).
27
Appendix Dissimilarity measure EBK-modes adopts a function d that computes the dissimilarity between an object xi and a cluster mode zl. |𝐴|
𝑑 𝑧𝑖 , 𝑧𝑙 =
𝜃𝑎𝑗 (𝑥𝑖 , 𝑧𝑙 ) 𝑗=1
where 1, 𝜃𝑎𝑗 (𝑥𝑖 , 𝑧𝑙 ) = 1 − 𝐿𝑊 ∗ 𝐺𝑊 , 𝑎𝑗 𝑎𝑗
𝑥𝑖𝑗 ≠ 𝑧𝑖𝑗 𝑥𝑖𝑗 = 𝑧𝑖𝑗
where • 𝐿𝑊𝑎𝑗 (local weights): cri applied to the attribute aj within the cluster. • 𝐺𝑊𝑎𝑗 (global weights): cri globally applied to the attribute aj.
28
Appendix Algorithm
29
Appendix Comparison of accuracy
Each cell shows both the best performance (at the top) and the average performance (at the bottom).
30