Clustering Categorical Data Using an Extended ... - LIPN, Paris 13

34 downloads 0 Views 181KB Size Report
This is the case when using categorical data since; generally, there is no known ordering between ... Km|...|KM ). Let us recall that each variable V m provides a similarity matrix Sm which ..... ROCK: A robust clustering algorithm for categorical ...
Clustering Categorical Data Using an Extended Modularity Measure Lazhar Labiod, Nistor Grozavu and Youn`ens Bennani LIPN-UMR 7030, Universit´e Paris 13, 99, av. J-B Cl´ement, 93430 Villetaneuse, France email: {firstname.secondname}@lipn.univ-paris13.fr

Abstract. Newman and Girvan [12] recently proposed an objective function for graph clustering called the Modularity function which allows automatic selection of the number of clusters. Empirically, higher values of the Modularity function have been shown to correlate well with good graph clustering. In this paper we propose an extended Modularity measure for categorical data clustering; first, we establish the connection with the Relational Analysis criterion. The proposed Modularity measure introduces an automatic weighting scheme which takes in consideration the profile of each data object. A modified Relational Analysis algorithm is then presented to search for the partitions maximizing the criterion. This algorithm deals linearly with large data set and allows natural clusters identification, i.e. doesn’t require fixing the number of clusters and size of each cluster. Experimental results indicate that the new algorithm is efficient and effective at finding both good clustering and the appropriate number of clusters across a variety of real-world data sets.

1

Introduction

In the exploratory data analysis of high dimensional data, one of the main tasks is the formation of a simplified, usually visual, overview of data sets. This can be achieved through simplified description or summaries, which should provide the possibility to discover most relevant features or patterns. Clustering is among the examples of useful methods to achieve this task; classical clustering algorithms produce a grouping of the data according to a chosen criterion. Most algorithms use similarity measures based on Euclidean distance. However there are several types of data where the use of this measure is not adequate. This is the case when using categorical data since; generally, there is no known ordering between the feature values. If the data vectors contain categorical variables, geometric approaches are inappropriate and other strategies must be developed [4]. The problem of clustering becomes more challenging when the data is categorical, that is, when there is no inherent distance measures between data values. This is often the case in many applications where data is described by a set of descriptive or binary attributes, many of which are not numerical. Examples of such include the country of origin and the color of eyes in demographic data. Many algorithms

2

Labiod et al

have been developed for clustering categorical data, e.g., (Barbara et al [3]., 2002; Gibson et al [7]., 1998; Huang [9], 1998). Modularity measure have been used recently for graph clustering [12] [13] [1]. In this paper, we show that the Modularity clustering criterion can be formally extended for categorical data clustering. We also establish the connections between the Modularity criterion and the Relational Analysis (RA) approach [10][11] which is based on Condorcet’s criterion. We then develop an efficient procedure inspired from the RA heuristic to find the optimal partition for maximizing the Modularity criterion. Experiments demonstrate the efficacy and effectiveness of our approach. The rest of the paper is organized as follows: Section 2 introduces some notations and definitions; Section 3 presents the Relational Analysis approach (RA), Section 4 provides the extended modularity measure and its connection with the RA criterion. Section 5 discusses the proposed optimization procedure; Section 6 shows our experimental results and finally, Section 7 presents our conclusions.

2

Definitions and Notations

Let be D a dataset with a set I of N objects (O1 , O2 , ..., ON ) described by the set V of M categorical attributes (or variables) V 1 , V 2 ., V m , .., V M each one having PM p1 , .., pm , .., pM categories respectively and let P = m=1 pm denote the full number of categories of all variables. Each categorical variable can be decomposed into a collection of indicator variables. For each variable V m , let the pm values naturally correspond to the numbers from 1 to pm and let V1m , V2m , ..., Vpm m be the binary variables such that for each j, 1 ≤ j ≤ pm , Vkm = 1 if and only if the V m takes the j-th value. Then the data set can be expressed as a collection m of M N × pm matrices K m , (m = 1, .., M ) of general term kij such as: ½ m kij

=

1 if the object i takes the categorie j of V m 0 otherwise

(1)

which gives the N by P binary disjunctive matrix K = (K 1 |K 2 |...|K m |...|K M ). Let us recall that each variable V m provides a similarity matrix S m which can be expressed as S m = K m t K m and the global similarity matrix S = K t K where t K m and t K are the transposed K m matrix respectively K matrix. 2.1

Undirect graph and data matrices

An interesting connection between data matrices and graph theory can be established,. A data matrix can be viewed as a weighted undirect graph G = (V, E), where V = I is the set of vertices and E is the set of edges. The data matrix S can be viewed as a weighted undirect graph where each node i in I corresponds to a row. The edge between i and i0 has weight sii0 , denoting the element of the matrix in the intersection between row i and column i0 .

An Extended Modularity Measure

2.2

3

Equivalence relation

We set down some notations. Suppose that a set of N by P-dimensional binary data vectors, K, represented as an N by P matrix, is partitioned into L classes C = {C1 , C2 , ...CL } and we want the points within each class are similar to each other. We view C as an equivalence relation X which models a partition in a relational space, and must respect the following properties :  xii = 1, ∀ii reflexivity    0 xii − xl0 l = 0, ∀(i, i0 ) symetry (2) 0 00 0 0 00 00 x + x − x ≤ 1, ∀(i, i , i ) transitivity  ii ii ii   0 0 xii ∈ {0, 1}, ∀(i, i ) binarity

3

The Relational Analysis approach

The relational analysis theory is a data analysis technique that has been initiated and developed at IBM in the 1970s, by F. Marcotorchino and P. Michaud [10] [11]. This technique is used to resolve many problems that occur in fields like: preferences, voting systems, clustering, etc. The Relational Analysis approach is a clustering model that automatically provides the suitable number of clusters, this approach takes as input a similarity matrix. In our context, since we want to cluster the objects of the set I, the similarity matrix S is given, then we want to maximize the following clustering function XX (sii0 − mii0 )xii0 (3) RRA (S, X) = i

i0

ii0 Where M = [mii0 = sii +s ]i,i0 =1,...,N is the matrix of threshold values. In 4 0 others words, objects i and i have chances to be in the same cluster providing their similarity measure sii0 , is greater or equal to their threshold value of majority mii0 . X is the solution we are looking for, it is a binary relational matrix with general term xii0 = 1 if object i is in the same cluster as object i0 ; and xii0 = 0, otherwise. X represents an equivalence relation, thus it must respect the properties in (2).

4

Extensions of the Modularity measure

This section shows how to adapt the Modularity measure for categorical data clustering 4.1

Modularity and Graphs

Modularity is a recently quality measure for graph clusterings, it has immediately received a considerable attention in several disciplines [12] [1]. As for the RA clustering problem, maximizing the modularity measure can be expressed in the form of an integer linear programming. Given the graph G = (V, E), let A be

4

Labiod et al

a binary, symmetric matrix whose (i, j) entry, aij = 1 if there is edge between nodes i and j. If there is no edge between nodes i and j, aij is zero. We note that in our problem setting, A is an input having all information on the given graph G and is often called an adjacency matrix. Finding a partition of the set nodes V into homogeneous subsets leads to the resolution of the following integer linear programming : (4) max Q(A, X) X

where

n

1 X ai. ai0 . Q(A, X) = )xii0 (5) (aii0 − 2|E| 0 2|E| i,i =1 P is the modularity measure, 2|E| = i,i0 aii0 = a.. is the total number of edges P and ai. = i0 aii0 the degree of i. X is the solution we looking for wich must satisfies the properties of an equivalence relation defined on I × I. 4.2

First extension: Early Integration

The early integration consist in a direct combination of graphs from all variables into a single dataset (graph) before applying the learning algorithm. Let us PM consider the Condorcet’s matrix S where each entry sii0 = m=1 sm ii0 , which can be viewed as weight matrix associated to graph G = (I, E), where each edge eii0 have the weight sii0 . Similarly to the classical Modularity measure, we define the extension Q1 (S, X) as follow : Q1 (S, X) = where 2|E| = the degree of i. 4.3

n 1 X si. si0 . )xii0 (sii0 − 2|E| 0 2|E| i,i =1

P i,i0

sii0 = s.. is the total number of edges and si. =

(6) P i0

sii0

Modularity extensions as a modified RA criterion

This subsection shows the theoretical connection between the RA criterion and the proposed extension of the modularity measure. We can establish a relationship between the Modularity measure extension and the RA criterion, indeed the function Q1 (S, X) can be expressed as a modified RA criterion in the following way: Q1 (S, X) as a modified RA criterion: Q1 (S, X) = where

1 (RRA (S, X) + ψ1 (S, X)) 2|E| n

(7)

n

1 XX si. si0 . )xii0 ψ1 (S, X) = (mii0 − 2|E| i=1 0 2|E| i =1

(8)

An Extended Modularity Measure

5

is the weighting term that depends on the profile of each pair of objects (i, i0 ). This extension of the modularity measure allows to introduce a weighting scheme depending on the profile of each data object.

5

Optimization Procedure

As the objective function is linear with respect to X and as the constraints that X must respect are linear equations, theoretically wa can solve the problem using an integer linear programming solver. However, this problem is NP-hard. As a result in practice, we use heuristics for dealing with large data sets 5.1

Modularity decomposition

The extension of the modularity measure can be decomposed in terms of the contribution of each object i in each clusters Cl of the searched partition as follows: Q1 (S, X) =

L X N X

cont(i, Cl )

(9)

l=1 i=1

where contQ1 (i, Cl ) =

1 X si. si0 . (sii0 − ) 2|E| 0 2|E| i ∈Cl

Using the transformations, sii0 =< Ki , Ki0 > and si. = the contribution expression becomes1 ,

(10) P i00

1 X (< Ki , Ki0 > 2|E| 0 i ∈Cl P P 0 00 00 i00 < Ki , Ki > i00 < Ki , Ki > − ) 2|E| X 1 = < Ki , Pl > − δii0 2|E| 0

< Ki , Ki00 >,

contQ1 (i, Cl ) =

(11) (12)

i ∈Cl

where δii0

X P 00 < Ki , Ki00 > P 00 < Ki0 , Ki00 > i i = 2|E| 0

(13)

i ∈Cl

The new contribution formula introduces an automatic weighting scheme, the value of the new formula will be great, let or equal to the RA contribution 1

Let us recall that this new writing of the contribution formula allows to reduce considerably the computational cost related to the square similarity matrix S and to characterize each cluster Cl using his prototype Pl .

6

Labiod et al

depending on the weight δii0 . The contribution formula contQ1 can be written in term of the RA contribution contRA by adding a weighting term depending on the profile of each objects pairwise (i, i0 ): X 1 [(< Ki , Pl > − mii0 ) 2|E| i0 ∈Cl X (mii0 − δii0 )] +

contQ1 (i, Cl ) =

(14)

i0 ∈Cl

=

X 1 (mii0 − δii0 )] [contRA (i, Cl ) + 2|E| 0

(15)

i ∈Cl

The change in the contribution formula is interesting because it introduces a weighting relative to the profiles of the data objects automatically, without requiring the presence of an expert. There are three scenarios; 1. Taking δii0 = mii0 , ∀i, i0 we find the case of the RA algorithm. 2. If the weight δii0 is less than mii0 , ∀i, i0 the contribution formula contQ1 is greater than the old contribution contRA , and therefore it is more likely to be positive than contRA , the observation i is then found aggregated to an existing cluster. the number of clusters will be smaller. 3. If the weight δii0 is great than mii0 , ∀i, i0 , the contribution formula contQ1 is greater than the old contribution contRA , and therefore it is more likely to be negative than contRA , the observation i is then found in a new cluster. The number of clusters will be more important.

5.2

Relational Analysis heuristic

The heuristic process starts from an initial cluster (a singleton cluster) and build in an incremental way, a partition of the set I by increasing the value of Condorcet’s criterion RRA (S, X) at each assignment. We give in (Algorithm1) the description of the relational analysis algorithm which was used by the Relational Analysis methodology (see Marcotorchino and Michaud for further details). The presented algorithm aims at maximizing the criteria (RRA and Q1 ) given above, based on the contribution computation. We have to fix a number of iterations in order to have an approximate solution in a reasonable processing time. Besides, it is also required a maximum number of clusters, but since we don’t need to fix this parameter, we put by default Lmax = N . Basically, this algorithm has O(Niter × Lmax × N ) computation cost. In general term, we can assume that Niter