Dynamic clustering of histograms using Wasserstein metric - CiteSeerX

9 downloads 0 Views 180KB Size Report
The frequency distribution of the variable Y is given considering the number of data values nh falling in each Ih. The histogram is then the typical graphical.
Dynamic clustering of histograms using Wasserstein metric Antonio Irpino1 , Rosanna Verde1 , and Yves Lechevallier2 1

2

Facolt´ a di studi politici e per l’alta formazione europea e mediterranea Seconda Universit´ a degli studi di Napoli Caserta - Italy [email protected] [email protected] INRIA Roquencourt France [email protected]

In the present paper we present a new distance, based on the Wasserstein metric, in order to cluster a set of data described by distributions with finite continue support. The proposed distance allows to define a measure of inertia of data with respect a barycenter that satisfies the Huygens theorem of decomposition of inertia. Thus, this measure is proposed as allocation function in the dynamic clustering process, because it allows to optimize the criterion of the minimum within inertia of the classes with respect to their barycenters. An application on real data is performed in order to illustrate the procedure.

1 Introduction In many real experiences, data are collected and/or represented by frequency distributions. If Y is a numerical and continuous variable, many distinct values yi can be observed. In these cases, the values are usually grouped in a smaller number H of consecutive and disjoint bins Ih (groups, classes, intervals, etc.). The frequency distribution of the variable Y is given considering the number of data values nh falling in each Ih . The histogram is then the typical graphical representation for the variable Y. The interest to analyze data expressed by frequency distributions, as well as by histograms, is of many fields of research. In particular, we may refer to the treatment of data of experiments which are collected in range of values, whereas the measurement instrument gives only approximated (or rounded) values. An example can be given by sensors for the air pollution control located in different zones of an urban area. The different distributions of measured data about the different levels of air pollutants across a day, allow to compare, and then to group into homogeneous clusters, the different controlled zones. In a different context of analysis, histograms are the key to understanding digital images. A digital image is basically a mosaic of square tiles or ”pixels” of uniform color which are so tiny that it appears uniform and smooth. Instead

2

Antonio Irpino, Rosanna Verde, and Yves Lechevallier

of sorting them by color, they can be sorted into 256 levels of brightness from black (value 0) to white (value 255) with 254 gray levels in between. The height of each vertical ”bar” tells you how many pixels there are for that particular brightness level. In the present paper we propose to the analyze data expressed by distributions as well as “histograms” of values. The classification of this kind of data can be useful to discover typologies of phenomena on the basis of the similarity of the frequency distributions. Dynamic Clustering (DC) ([Did71]) is proposed as a suitable method to partition a set of data represented by frequency distributions. We recall that DC needs to define a proximity function, to assign the individuals to the clusters, and to choose a way to represent the clusters by means of a description that optimizes a representation function. Further, the representation of a cluster, called ”prototype“, is consistent with the description of the clustered elements: i.e., if data to be clustered are distributions, then the ”prototype“ is also a distribution. According to the nature of data, we suggest to use a distance deriving by the Wassertein metric [GS02]. In section 2 we outline the general schema of DC. In section 3, after recalling the definition of histogram data we present an extension of the Wassertein distance in order to compare two histograms descriptions. We also prove as it is possible to define an inertia measure among data that satisfies the Huygens theorem of decomposition of inertia, considering the ”prototypes“ as barycenters. In section 4, we present some results on a climatic dataset. Section 5 reports some concluding remarks.

2 Dynamic clustering algorithm A proximity measure δ is a non negative function defined on each couple of elements of the space of descriptions of E, where the closer the individuals are, the lower is the value assumed by δ. Let E be a set of n data characterized by p continuous variables Yj (j = 1, . . . , p). Dynamic clustering algorithm looks for the partition P ∈ PK of E in K classes, among all the possible partitions PK , and the vector L ∈ LK of K prototypes representing the classes in P , such that, the following ∆ fitting criterion between L and P is minimized: ∆(P ∗ , L∗ ) = M in{∆(P, L) | P ∈ PK , L ∈ LK }.

(1)

Such a criterion is defined as the sum of dissimilarity or distance measures δ(yi , Gk ) of fitting between each element yi belonging to a class Ck ∈ P and the class representation Gk ∈ L: ∆(P, L) =

K X X k=1 yi ∈Ck

δ(yi , Gk ).

Dynamic clustering of histograms using Wasserstein metric

3

A prototype Gk associated with a class Ck is an element of the space of description of E, and it can be represented, in this context, as an histogram. The algorithm is initialized by generating K random clusters or, alternatively, K random prototypes. Generally the criterion ∆(P, L) is based on an additive distance on the p descriptors. A similar approach has been proposed by [CDL03], in a different context of analysis.

3 Wasserstein metric for histogram data Let Y be a continuous variable defined on a finite support S = [y; y], where y and y are the minimum and maximum value of the domain of Y. The variable Y is supposed partitioned into a set of contiguous intervals (bins) {I1 , . . . , Ih , . . . , IH }, where Ih = [y h ; y h ). Given N observations on the variable Y, at each PNsemi-open interval Ih is associated a random variable equals to Ψ (Ih ) = u=1 Ψyu (Ih ) where Ψyu (Ih ) = 1 if yu ∈ Ih and 0 otherwise. Thus, it is possible to associate to Ih an empirical distribution πh = Ψ (Ih )/N . An histogram of Y is then the graphical representation in which each pair (Ih , πh ) (for h = 1, . . . , H) is represented by a vertical bar, with base the interval Ih along the horizontal axis and the area proportional to πh . Consider E as a set of n empirical distributions Y(i) (i = 1, . . . , n). In the case of histogram description it is possible to assume that S(i) = [y i ; y i ], where yi ∈

Suggest Documents