Pyramidal Clustering Algorithms in ISO–3D Project Oldemar Rodr´ıguez1 and Edwin Diday1 University Paris 9 Dauphine, Ceremade. Pl. Du Ml de L. de Tassigny. 75016
[email protected] [email protected]
Abstract. Pyramidal clustering method generalizes hierarchies by allowing non-disjoint classes at a given level instead of a partition. Moreover, the clusters of the pyramid are intervals of a total order on the set being clustered. [Diday 1984], [Bertrand, Diday 1990] and [Mfoumoune 1998] proposed algorithms to build a pyramid starting with an arbitrary order of the individual. In this paper we present two new algorithms name CAPS and CAPSO. CAPSO builds a pyramid starting with an order given on the set of the individuals (or symbolic objects) while CAPS finds this order. These two algorithms allows moreover to cluster more complex data than the tabular model allows to process, by considering variation on the values taken by the variables, in this way, our method produces a symbolic pyramid. Each cluster thus formed is defined not only by the set of its elements (i.e. its extent) but also by a symbolic object, which describes its properties (i.e. its intent). These two algorithms were implemented in C++ and Java to the ISO–3D project.
1
Definitions
Diday in [5, Diday (1984)] proposes the algorithm CAP to build numeric pyramids. Algorithms are also presented with this purpose in [2, Bertrand y Diday (1990)], [10, Gil (1998)] and [11, Mfoumoune (1998)]. Paula Brito in [3, Brito (1991)] proposes a macro–algorithm that generalizes the algorithm to build numeric pyramids proposed by Bertrand to the symbolic case. In this article we propose two algorithm designed to build symbolic pyramids (CAPS and CAPSO), that is to say, a pyramid in which each node is again a symbolic object. These algorithms also calculate the extension of each one of these symbolic objects and verifie its completeness. Notation: – – – –
Ω the set of individuals. Oj the description space for the variable j. P (Oj ) the set of parts of Oj . The description of an individual ω is represented by the vector (y1 (ω), . . . , yp (ω)) where each variable yj , j = 1, 2, . . . , p is an application of Ω in P (Oj ). The value of yj (ω) can be represented by a group of values, an interval or a histogram, among others.
– Let D = P (O1 )×P (O2 )×· · ·×P (Op ) the set of the possible descriptions and d ∈ D a description, so for every j = 1, 2, . . . , p, dj represents a description like a set of values. In [9, Diday (1999)] the following definition of Symbolic Object is presented: Definition 1. A symbolic object is a triple (a, R, d) where R in a vector of relationships Ri , d = (d1 , d2 , . . . , dp ) is a vector of descriptions di , and a is an application of Ω in {T, F }. If in the previous definition we take a(w) = [y1 (w)R1 d1 ] ∧ [y2 (w)R2 d2 ] ∧ · · · ∧ [yp (w)Rp dp ] where a(w) = T iff yj (w)Rj dj for all j = 1, 2, . . . , p then the symbolic object is known like Object of Assertion. If [yj (w)Rj dj ] ∈ L = {T, F } for all j = 1, 2, . . . , p the symbolic object is known like Boolean Object and if [yj (w)Rj dj ] ∈ L = [0, 1] for all j = 1, 2, . . . , p the symbolic object is known as Modal Objet. In the case of boolean objects the extent is define for extΩ (a) = {w ∈ Ω such that a(w) = T }; while in the case of modal symbolic objects the extent of a in the level α is defined for extΩ (a, α) = {w ∈ Ω such that a(w) ≥ α}. For the construction of Symbolic Pyramids will be necessary to calculate the union among symbolic objects, this operation is defined like it continues [7, Diday (1987)]. Definition 2. Let s1 = (a1, R, d1 ) and s2 = (a2, R, d2 ) two symbolic objects, the union between s1 y s2 denoted for s1 ∪ s2 , is defined as the union of all the symbolic objects ei such that for all i we have that extΩ (ei ) ⊇ extΩ s1 ∪ extΩ s2 . An important concept inside the symbolic pyramidal classification is the completeness of the symbolic object. A symbolic object is complete if it describes in an exhaustive way its extension, A formal definition is presented it is [3, Brito (1991)]. Definition 3. Let s =
p V
ej a symbolic object, the Degree of Generality of s is
j=1
defined by:
G(s) =
p Y
G(ej )
j=1
where |Vj | if ej = [yj ∈ Vj ], Vj ⊆ Yj with Yj discreet. |Yj | length(Vj ) if ej = [yj ∈ Vj ], Vj ⊆ Yj with Yj continuous. G(ej ) = P length(Y ) k wj h=1 h if ej = [yj = {m1 (w1 ), . . . , mk (wk )}] is a distribution of k frequency of the variable yj discreet.
Diday generalizes in [5, Diday (1984)] the concept of binary hierarchy to the pyramid, like we show in the following definitions.
Definition 4. – Let be θ a total order on Ω and P a set of parts not empty of Ω. An element h ∈ P is said connected according to the total order θ, if for every w ∈ Ω that is between the max(h) and the min(h) (min(h)θwθ max(h)) we have that w ∈ h. – A total order θ on Ω is compatible with P, the set of parts of Ω, if all element h ∈ P is connected according to the total order θ. Definition 5. Let be Ω a finite set and P a set of parts not empty of Ω (called nodes), P is a pyramid if it has the following properties: 1. 2. 3. 4.
Ω ∈ P. ∀ w ∈ Ω we have that {w} ∈ P (terminal nodes). 0 0 0 ∀ (h, h ) ∈ P × P we have that h ∩ h ∈ P or h ∩ h = ∅. A total order exists θ in Ω compatible with P .
Definition 6. Let be Ω a finite set of symbolic objects and let be P a set of parts not empty of Ω (calls also nodes), P is a symbolic pyramid if it satisfies the following properties: 1. P is a pyramid. 2. Each node of P has associate a complete symbolic object. Subsequently we present the necessary definitions for the specification of the algorithm, these definitions differ a little to the definitions presented in ([3, Brito (1991)], [2, Bertrand and Diday (1990)] and [11, Mfoumoune (1998)]), because all are local to the “component connected”. For the following definitions we consider a set P ⊆ P (Ω) (the set of parts of Ω) that it is not necessarily a pyramid, is possibly a “pyramid under construction”, for abuse of the language we will denominate like a node all the element of P. Definition 7. – Let be C ∈ P, C is called connected component if a total order exists ≤C associated to C. – A node G ∈ P belongs to a connected component C of P if G ⊆ C. We will also say that the total order ≤C associated to C induces a total order ≤G on G in the following way, if x, y ∈ G then x ≤G y ⇔ x ≤C y. – Let be G1 and G2 nodes of P. Then G1 is interior G2 if: • G1 6= G2 . • G1 and G2 belong to the same connected component C. • min(G2 )