categorical data clustering: a correlation-based

ICTAI 2014

Limassol-Cyprus

CATEGORICAL DATA CLUSTERING: A CORRELATION-BASED APPROACH FOR UNSUPERVISED ATTRIBUTE WEIGHTING Joel Carbonera Mara Abel

Institute of Informatics, UFRGS, Porto Alegre , Brazil

Joel Luis Carbonera [email protected] BDI group - UFRGS www.inf.ufrgs.br/bdi

INTRODUCTION

Joel Luis Carbonera [email protected] BDI group - UFRGS 2 www.inf.ufrgs.br/bdi

Clustering • Clustering: – Technique in which objects are partitioned into groups, in such a way that objects in the same group (or cluster) are more similar among themselves than to those in other clusters.

3

Categorical data clustering • Categorical data clustering: – Refers to the clustering of objects that are described by categorical attributes. • Their values are not inherently comparable in any way.

4

Challenges • Categorical data sets are often highdimensional. – The dissimilarity between a given object x and its nearest object will be close to the dissimilarity between x and its farthest object. – Discovering meaningful separable clusters becomes a very challenging task.

5

Subspace clustering approaches • For handling the high-dimensionality, some works take advantage of the fact that clusters usually occur in a subspace defined by a subset of the initially selected attributes.

6

Soft subspace clustering approaches • Soft subspace clustering approaches – Different weights are assigned to each attribute in each cluster, for measuring their respective contributions to the formation of each cluster. – Instead of assigning a global weight vector for the whole data set, different weight vectors are assigned to different clusters. – In these approaches, the strategy for attribute weighting plays a crucial role. 7

Our approach • We propose a strategy for measuring the contribution of each attribute considering its correlations with other attributes. – Studies in Cognitive Sciences suggest that humans spontaneously learn categories by exploring the correlations among the attributes of the perceived objects.

8

Our approach • We propose the Correlational relevance index (cri). – It can be used for measuring the relevance of a given categorical attribute ai. – The cri(ai) is directly proportional to the degree of coincidence between the values of all attributes of the dataset (a1,…,am), with the values of the attribute ai. – The bigger the cri(ai) is, the higher the relevance of ai is. 9

CORRELATIONAL RELEVANCE INDEX

Joel Luis Carbonera [email protected] BDI group - 10 UFRGS www.inf.ufrgs.br/bdi

Example of dataset Data objects

a1

a2

a3

a4

a5

x1

l

a

d

g

i

x2

m

a

d

h

i

x3

n

a

e

g

i

x4

o

b

e

h

j

x5

p

b

e

g

j

x6

q

b

e

h

j

x7

r

b

e

g

k

x8

s

c

e

H

k

x9

t

c

f

g

k

x10

u

c

f

h

k

11

Correlational relevance index • It is necessary to measure the co-occurence of values of distinct attributes. • The function Ψ(ah(l), aj(p)) maps the values ah(l) (of atribute ah) and aj(p) (of atribute aj) to the number of objects in the cluster in which these values co-occur.

12

Correlational relevance index Data objects

a1

a2

a3

a4

a5

x1

l

a

d

g

i

x2

m

a

d

h

i

x3

n

a

e

g

i

x4

o

b

e

h

j

x5

p

b

e

g

j

x6

q

b

e

h

j

x7

r

b

e

g

k

x8

s

c

e

H

k

x9

t

c

f

g

k

x10

u

c

f

h

k

Ψ(b, e) = |{x4,x5,x6,x7}| = 4

13

Correlational relevance index • The function M(ah(l), aj) maps a value ah(l) (of atribute ah) and a given attribute aj to the greatest value that Ψ(ah(l), aj(p)) can take, considering all possible values aj(p) of attribute aj.

14


a1

a2

a3

a4

a5

x1

l

a

d

g

i

x2

m

a

d

h

i

x3

n

a

e

g

i

x4

o

b

e

h

j

x5

p

b

e

g

j

x6

q

b

e

h

j

x7

r

b

e

g

k

x8

s

c

e

H

k

x9

t

c

f

g

k

x10

u

c

f

h

k

Ψ(b, d) = |{}| = 0 Ψ(b, e) = |{x4,x5,x6,x7}| = 4 Ψ(b, f) = |{}| = 0

M(b, a3) = 4 15

Correlational relevance index The function α is defined as: (𝒍) α(𝒂𝒉 ,𝒂𝒋 )=

(𝒍) M(𝒂𝒉 ,𝒂𝒋 ) (𝒍) 𝒇𝒓𝒆𝒒𝒖𝒆𝒏𝒄𝒚(𝒂𝒉 )

16


a1

a2

a3

a4

a5

x1

l

a

d

g

i

x2

m

a

d

h

i

x3

n

a

e

g

i

x4

o

b

e

h

j

x5

p

b

e

g

j

x6

q

b

e

h

j

x7

r

b

e

g

k

x8

s

c

e

H

k

x9

t

c

f

g

k

x10

u

c

f

h

k

α(b, a3) = 4/4 = 1

17

Correlational relevance index • Maximum co-occurrence correlation index (mcci). Can be measured between two given categorical attributes ah and aj through the function mcci:

𝑚𝑐𝑐𝑖 𝑎ℎ , 𝑎𝑗

=

(𝑙) |𝑑𝑜𝑚 𝑎ℎ | α(𝑎 𝑙=1 ℎ ,𝑎𝑗 )

|𝑑𝑜𝑚 𝑎ℎ |

Where |dom(ah)| is the number of values that the attribute ah can assume. 18


a1

a2

a3

a4

a5

x1

l

a

d

g

i

x2

m

a

d

h

i

x3

n

a

e

g

i

x4

o

b

e

h

j

x5

p

b

e

g

j

x6

q

b

e

h

j

x7

r

b

e

g

k

x8

s

c

e

H

k

x9

t

c

f

g

k

x10

u

c

f

h

k

𝒎𝒄𝒄𝒊 𝒂𝟐, 𝒂𝟑 =

α(𝒂, 𝒂𝟑) + α(𝒃, 𝒂𝟑) + α(𝒄, 𝒂𝟑) 𝟎. 𝟔𝟕 + 𝟏 + 𝟎. 𝟔𝟕 = = 𝟎. 𝟕𝟖 𝟑 𝟑

19

Correlational relevance index • Correlational relevance index (cri). Can be assigned to a given attribute ah as defined by the function mcci:

𝑐𝑟𝑖 𝑎ℎ =

|𝐴| 𝑗=1 mcci(𝑎𝑗 ,𝑎ℎ )

|𝐴|

Where |A| is the number of values that the atribute ah can assume. 20


a1

a2

a3

a4

a5

x1

l

a

d

g

i

x2

m

a

d

h

i

x3

n

a

e

g

i

x4

o

b

e

h

j

x5

p

b

e

g

j

x6

q

b

e

h

j

x7

r

b

e

g

k

x8

s

c

e

H

k

x9

t

c

f

g

k

x10

u

c

f

h

k

Attributes

a1

a2

a3

a4

a5

CRI

0.44

0.84

0.82

0.74

0.83

21

CONCLUSION

Joel Luis Carbonera [email protected] BDI group - 22 UFRGS www.inf.ufrgs.br/bdi

Conclusion • We investigate how to use the correlation among categorical attributes for measuring their relevancies in clustering tasks. • As a result, we propose a correlation-based attribute weighting approach for categorical attributes.

23

Conclusion • Currently, we are applying this approach for developing variations of well known extensions of the k-modes algorithm. • In our preliminary experiments, the resulting algorithms have good results; comparable to the performance of state-of-the-art algorithms.

24

Download The source code of the algorithm that implements our approach for attribute weighting can be downloaded at: http://www.inf.ufrgs.br/~jlcarbonera/?page_id=51

25

ICTAI 2014

Limassol-Cyprus

CATEGORICAL DATA CLUSTERING: A CORRELATION-BASED APPROACH FOR UNSUPERVISED ATTRIBUTE WEIGHTING Joel Carbonera Mara Abel

Institute of Informatics, UFRGS, Porto Alegre , Brazil

Joel Luis Carbonera [email protected] BDI group - UFRGS www.inf.ufrgs.br/bdi

Appendix CK-modes • CK-modes (correlation-based k-modes) – Is a subspace clustering agorithm. – Extends the basic K-modes. – Considers the Correlational relevance index (cri) for measuring the relevance of each attribute (globally and locally).

27

Appendix Dissimilarity measure EBK-modes adopts a function d that computes the dissimilarity between an object xi and a cluster mode zl. |𝐴|

𝑑 𝑧𝑖 , 𝑧𝑙 =

𝜃𝑎𝑗 (𝑥𝑖 , 𝑧𝑙 ) 𝑗=1

where 1, 𝜃𝑎𝑗 (𝑥𝑖 , 𝑧𝑙 ) = 1 − 𝐿𝑊 ∗ 𝐺𝑊 , 𝑎𝑗 𝑎𝑗

𝑥𝑖𝑗 ≠ 𝑧𝑖𝑗 𝑥𝑖𝑗 = 𝑧𝑖𝑗

where • 𝐿𝑊𝑎𝑗 (local weights): cri applied to the attribute aj within the cluster. • 𝐺𝑊𝑎𝑗 (global weights): cri globally applied to the attribute aj.

28

Appendix Algorithm

29

Appendix Comparison of accuracy

Each cell shows both the best performance (at the top) and the average performance (at the bottom).

30

categorical data clustering: a correlation-based

categorical data clustering: a correlation-based

Suggest Documents

Clustering Categorical Data Streams - arXiv

Clustering Categorical Data - Semantic Scholar

Clustering Numerical and Categorical Data

Clustering Categorical Data Streams - arXiv

Summarizing Categorical Data by Clustering Attributes

Multiobjective Approach to Categorical Data Clustering - cs.York

Clustering Categorical Data using Bayesian Concept - ijcte

Learning Dissimilarity for Categorical Data Clustering

Clustering Categorical Data based on Information

Improving Categorical Data Clustering Algorithm by ... - CiteSeerX

A Framework for Clustering Categorical Time-Evolving Data - CiteSeerX

Hierarchical Density-Based Clustering of Categorical Data and a ...

A Spectral Based Clustering Algorithm for Categorical Data with

A Variant of Genetic Algorithm Based Categorical Data Clustering for

A Framework for Clustering Massive Text and Categorical Data Streams

Clustering Mixed Numeric and Categorical Data: A ... - Semantic Scholar

A Bi-clustering Framework for Categorical Data - LIRIS - CNRS

A cluster ensemble method for clustering categorical data - Google Sites

Review Paper On Data Clustering Of Categorical Data

Data Reduction Method for Categorical Data Clustering - Springer Link

A Link Clustering Based Approach for Clustering Categorical

Central Clustering of Categorical Data with Automated Feature ...

Clustering Categorical Data Using an Extended ... - LIPN, Paris 13

SSDR: An Algorithm for Clustering Categorical Data Using Rough Set ...