In Proceedings of 4th Groningen Intl. Information Tech. Conf. for Students, pp. 103-107, 1997.
A Parallel Algorithm to build Concept Lattice Patrick Njiwoua and Engelbert Mephu Nguifo Centre de Recherche en Informatique de Lens, Universite d'Artois - IUT de Lens, Rue de l'universite SP 16, 62307 Lens cedex. France. E-mail: fnjiwoua,
[email protected] However, the computational complexity remains exponential, thus it is hard to deal with such system in practice. With the development of new architectures in computer science which integrates more than one processor, parallelism increasingly becomes a new interesting research topic [9]. We consider a parallel program as a collection of cooperating and communicating modules called tasks. The aim of this domain is, rst to proceed more faster than the sequential approach to resolve problems i.e obtain a speedup equal to the number of processors used, and then to treat more large data by splitting the initial one. Therefore, a parallel algorithm able to construct the Galois lattice of a binary relation should have important incidence. Our approach is based on a sequential algorithm for Galois lattice construction (Bordat) used by several ML systems such as LEGAL [7] for concepts formation. The reasons of this choice come from the fact that Bordat's algorithm [2] needs as input a matrix table and proceeds by selecting lattice-nodes according to a FIFO (First In First Out) ordered list. For the current node, only a sub-matrix table is considered and its sub-nodes in construction are inserted in the list. Therefore a parallel algorithm which performs the treatment of nodes on dierent levels is derived. The parallel algorithm proposed deals with Multiple Input Multiple Data (MIMD) paradigm, for which each task is in charge of a subset of the initial data and possibly at a dierent step of the common algorithm. This paper is structured as follows: rst we describe the parallel algorithm. Second a proof of its correctness and a study of its theoretical complexity are presented. Finally, concluding remarks and possible extensions to this work are proposed.
Abstract The increasing development of machine learning lattice based systems faces a hard problem due to the lattice construction itself. This part have an exponential time and space complexity which overlaps the rest in the system complexity. We focussed our interest on Galois lattice which is lattice that builds maximal similarities over a binary relation between two sets. In this paper, a polynomial time parallel algorithm (in the number of columns and lines in the matrix relation table) to build Galois lattice is proposed. This algorithm is based on the same idea as Bordat's sequential one.
1 Introduction When talking about Lattice Theory, you have to refer principally to Birkho [1] and Dilworth [4]. Many other authors have studied this domain of mathematics which deals with hierarchical systems in general. Lattice theory is intensively studied and a body of results at the heart of the subject is bring out. One reason of this success is that lattice provides a rich mathematical framework natural to understand, and which represents the hierarchy inside the speci ed problem. For these reasons, the past few years have witnessed the development of lattice-based systems in dierent domains such as machine learning, documents retrieval, social sciences, ... Working in Machine Learning (ML) area, we are interested to nite Galois lattices which are lattices de ned from a Galois connection between two sets. Galois lattice, also termed "concept lattice" [10], is a mathematical framework that allows to build embedded classes from a set of objects. The main diculty with Galois lattice-based system comes from the lattice construction itself [5]. The useful exhaustiveness of Galois lattice, when representing the concepts hierarchy, gives an exponential increase of the lattice size according to the data. Some ML systems, such as LEGAL [7], reduced their target to the construction of a join semi-lattice.
2 Galois lattice construction In this section, we give some de nitions about concepts and Galois lattice. A brief presentation of Bordat sequential algorithm is proposed, then a description of our parallel algorithm based on the same idea 1
as Bordat's one is detailed.
Bordat's algorithm constructs the lattice step by step as shown in gure 1. A node is computed by each of its predecessor. Consider (Ok ; Ak ) a node in the lattice and Ik = I nOk (A?Ak ) , the restriction of the binary relation I to the two subsets Ok and A ? Ak . Consider Colk A ? Ak , the set of labels of a maximal subset in Ik , then the node (Oki ; Aki ) is created as a sub-node of (Ok ; Ak ) as follows: Oki = Ok ? fo 2 Ok ; 9a 2 Colk and (o; a) 62 Ik g. Aki = Ak [ Colk . An upper bound for the lattice size complexity jLj is 2min(n;m), with n = jOj and m = jAj [5]. Consider L2 = f(X; Y ) 2 L2; X Y g. jL2j is the number of edges in the lattice (jLj jL2 j). Let TC be the time needed to compute a maximal subset of a given matrix in the worst case, then the time complexity of the previous sequential algorithm is jL2 j TC . For data coming from real domains, the matrix size is sometimes huge, thus the amount of calculus needed for the lattice construction is very important in practice.
2.1 General points
Consider two sets O and A, and I O A a binary relation between these two sets. C = (O; A; I ) is called a context. We de ne the two mappings : f : O ?! A and g : A ?! O as follows :
f (o) = fa 2 A; (o; a) 2 I g. g(a) = fo 2 O; (o; a) 2 I g. f and g are extended to O1 O and A1 A by the following formulas:
f (O1 ) =
\ f (o)
o2O1
g(A1 ) =
\ g(a)
a2A1
The pair (f; g) forms a Galois connection between the two power sets over O and A.
The pair (O1 ; A1 ) is a node (concept) if f (O1 ) = A1 and g(A1 ) = O1 . O1 (resp. A1 ) is called the extent (resp. the intent) of the node (O1 ; A1 ).
OnA a b c 1 2 3
An order relation (noticed ) is de ned over the
set of all nodes. For two nodes (O1 ; A1 ) and (O2 ; A2 ), (O1 ; A1 ) (O2 ; A2 ) () O1 O2 (i.e. A1 A2 ).
1 1 1 1 1 1
, 123
X XXXXX step1 a, 13 PPPPb, 12PPPPc, 23 step2 ac, P3 bc,P 2 ab, 1 XXXXX step3 X
With , the set of all nodes has the mathemati-
cal structure of a complete lattice (i.e. two nodes of the lattice have a common super-node and a common sub-node) and is called the Galois lattice L(C ) of the context C .
abc, Figure 1: A matrix context C = (O; A; I ) and its complete Galois lattice. A rectangle represents a node and lines between rectangles denote the topdown order relation.
2.2 A sequential algorithm
We present the sequential algorithm proposed by Bordat [2] to build a Galois lattice. Other sequential algorithms for non-incremental construction of Galois lattice were reported (for example see [8]) among which only Bordat's one integrated the order relation between nodes. Recently two incremental algorithms have been published ([3],[6]) which use dierent approaches to build the lattice. When constructing the lattice L, Bordat proceeds by a top-down specialization, beginning with the most general node. For each node, the algorithm computes all its sub-nodes. The algorithm scheme is the following (F denotes the FIFO list):
2.3 A parallel algorithm
A detailed description of our parallel algorithm to build Galois lattice follows. The previous Bordat's algorithm brings out some facts: As input we have the binary matrix table I . For a given node you only need to consider a part of I to compute its sub-nodes. The algorithm idea is to repeatedly compute all sub-nodes of the current new one. The program ends when there is no more new nodes to consider.
Begin F ? (O; ) Repeat
Remove (Ok ; Ak ) in F and Insert it in L Compute Ck the set of sub-nodes of (Ok ; Ak ) Insert each new node created in F Until F is empty
End
2
other messages passing are non-blocking. The rst process (with label id0 ) stores nodes according to their intent size. The procedure Insert(L[size],Ak ) returns the value \False" if the node labelled which have Ak as intent have already been created otherwise Ak is inserted in L[jAk j].
Since the computation of all sub-nodes of a given node is an \independent task", call it Subnode, a simple scheme for the parallel algorithm is derived:
Input: The binary matrix table I . Output: The Galois lattice of I . Subroutine Subnode(Ok ; Ak ; Ik ) Begin
2.4 Parallel algorithm correctness
Theorem 1: ParGaL constructs the same lattice as
Insert (Ok ; Ak ) in L. Compute maximal subsets of Ik If a new node (Oki ; Aki ) is created, spawn Subnode(Oki ; Aki ; Iki )
the previous sequential one. Proof. Each node computes all its sub-nodes as showed by the subroutine Subnode. (O; ) is a node of the lattice constructed both by ParGaL and Bordat's algorithm. Consider (Ok ; Ak ), Ok O; Ak A, a node computed by the sequential algorithm. There is total ordered path (a chain) (O; ) (O1 ; A1 ) ; ; (Ok?1 ; Ak?1 ) (Ok ; Ak ) of length k from node (O; ) to node (Ok ; Ak ), since we have (O; ) (Ok ; Ak ). - If (O; ) is the direct predecessor of (Ok ; Ak ), then (Ok ; Ak ) is constructed by Subnode(O; ; I; id0 ). - We make assumption that all node (Oi ; Ai ), such that there is a path of length i (i < k) from (O; ) to (Oi ; Ai ) are constructed by ParGaL. Therefore the node (Ok?1 ; Ak?1 ) is in the lattice. Since (Ok ; Ak ) is a sub-node of (Ok?1 ; Ak?1 ), then (Ok ; Ak ) is constructed by a call to Subnode(Ok?1 ; Ak?1 ; Ik?1 ; id0 ). Since ParGaL uses the same sequential process as Bordat's algorithm to compute maximal subsets of a binary table, all nodes inserted in the lattice by the parallel algorithm are also created by the sequential one.
End
The main program consists of spawning Subnode (O; ; I ). Each process have to verify if the current computed sub-node have not already been created. This is the major diculty in our algorithm, since maintaining distributed lists of already created nodes will dramatically increase the algorithm space and time complexity. Therefore our solution is an ordered list updated by a special process (i.e. the rst one with number id0 ). A node have to send a request to this process before spawning a created sub-node. Below is a detailled version of the parallel algorithm.
Subroutine Subnode(Ok ; Ak ; Ik ; id0) Begin 1 newnode ? ok 2 If Ak = 6 Then 3 Receive(newnode) 4 If newnode = ok Then 5 For each Coli , maximal set of labels in Ik 6 Construct Oki , Aki , Iki 6 Then 7 If Oki = 8 id ? Spawn Subnode(Oki ; Aki ; Iki ; id0 ) 9 Send(id,Aki ; id0 ) End Algorithm ParGaL(O; A; I ) Begin 1 For i = 1 to m 2 L[i] ? 3 Spawn Subnode(O; ; I; id0 ) 4 Repeat 5 Receive(origin,Ak ) 6 newnode ? ok 7 size ? jAk j 8 If Insert(L[size],Ak ) = False Then 9 newnode ? no 10 Send(newnode,origin) 11 Until no more process executing Subnode End
Theorem 2: The Program ParGaL always ends. Proof. The subroutine Subnode always ends. Subn-
ode have rst to wait until the response to the request message sends by its predecessor arrives. If the message is \ok" (the current node is a new one) then it begins constructing its sub-nodes else it stops. Since the main program sequentially treats all messages, the response to Subnode will be send. For a given matrix, there is a nite number of maximal subsets, then after computing all of them, Subnode ends. When there is no more process executing the Subnode code, the program ParGaL ends.
Remarks :
The previous algorithm have a principal drawback, since its execution time depends on the rst process which \sequentially" treats all messages. To avoid this problem some modi cations are proposed:
L is a vector such that all nodes whith an intent size equal to i are stored in the ordered list L[i]. We make
1. Each message is treated according to the intent size of the node to be created. A dierent process is aected to each size.
assumption that only the Receive procedure for messages reception is blocking in the whole program. All
3
2. The rst process (label id0 ), after spawning the process Subnode(O; ; I; id0 ) will only have to redirect messages from Subnode processes to the process which treats the creation node request for the given size. The modi ed ParGaL algorithm follows (there is no change in subroutine Subnode):
a redirection after checking the intent size of the node to create. L (the lattice) is an ordered set which contains exactly w chains of length h. Therefore each node in a maximal anti-chain is an element of a maximal chain. Each pair of nodes in the lattice have a common super-node. Therefore two distinct chains have an intersection node. When this node sequentially computes its sub-nodes, it takes TC time for each of them. Then the computed sub-node number i is spawned (i ? 1) (TC + ") time after the rst one. The algorithm consists of a parallel construction of all maximal chains. The construction of the rst one takes h TC time. The second maximal chain ends TC + " time after the rst ... and the last chain construction is nished (w ? 1) (TC + ") time after the rst one. Then the total time complexity is h TC + (w ? 1) (TC + ").
Algorithm ParGaL(O; A; I ) Begin 1 For i = 1 to m 2 P [i] ? False 3 Spawn Subnode(O; ; I; id0 ) 4 Repeat 5 Receive(origin,Ak ) 6 If P [jAk j] = False Then 7 Send(ok,origin) 8 P [jAk j] ? Create-size(Ak ; id0 ) 9 Else 9 Send(origin,Ak ; P [jAk j]) 10 Until no more process executing Subnode 11 For i = 1 to m 12 Get the list node of process P [i] End
To compute a maximal subset of a given matrix
I O A, you have to compare (in the worst case) all pairs of columns of I . Then TC jOj jAj, and
the lattice time complexity is polynomial according to the size of the two sets O and A. In gure 2 we have a decomposition of ParGaL execution on the lattice example showed in gure 1. Each step takes TC times and all new nodes in a step are computed in parallel.
The process id0 just have to redirect messages. The structure P is used here to store the id of processes which have to treat messages. The algorithm for these special processes is given below.
4 Conclusion
Subroutine Create-size(Ak ; id0) Begin 1 L ? Ak 2 Repeat 3 Receive(origin,Ak ) 4 newnode ? ok 5 If insert(L; Ak ) = False Then 6 newnode ? no 7 Send(newnode,origin) 8 Until \Get message" from process id0 End
Lattices and especially Galois lattices represent an interesting mathematical framework. They are used in many domains such as in Arti cial Intelligence for similarities detection based machine learning systems. Their principal drawback come from the fact that the lattice size is exponential according to the matrix size. To avoid this problem a parallel algorithm for Galois lattice construction (ParGaL) is proposed and its theoretical complexity is studied. The main result concerns the lattice time construction which is polynomial according to the number of lines and columns in the matrix table. Forthcoming work will deal with ParGaL algorithm speedup. This algorithm is currently under implementation using the Parallel Virtual Machine (PVM) architecture and experimentations will be made on real and huge data. The parallelization of an excellent sequential algorithm did not guarantee good results, since the management of communications and the scheduling problems are bring out. Ongoing research is being done to analyze the usefulness of such system in machine learning.
3 Complexity analysis In this section a study of ParGaL theoretical complexity is done. Consider TC the time needed to compute a maximal subset of a given matrix in the worst case, w min(n; m) the lattice width (the maximum size of an anti-chain in (L; )), h min(n; m) the matrix height (the maximum size of a chain in (L; )), and " the initialization time needed by a child process before executing the code of the program Subnode. Theorem 3: The time complexity of ParGal is h TC + (w ? 1) (TC + "): proof. The main process (with label id0) treats each message in constant time, since it only have to make
Acknowledgements Financial support for this work has been partially made available by the Ganymede project of the Con4
trat de Plan Etat Nord-Pas-de-Calais. One of the authors, Patrick Njiwoua, is nancially supported by the Association Internationale pour le Developpement Integre de l'Afrique (AIDIA). The authors thank Stephane Allanos and Renaud Eliet for helpful technical discussions.
, 123
References
a, 13
[1] G. Birkho. Lattice Theory. American Mathematical Society, Providence, R.I., 3rd edition, 1967. [2] J. P. Bordat. Calcul pratique du treillis de galois d'une correspondance. Math. et Sci. Humaines, 24eme annee, 96:31{47, 1986. [3] C. Carpineto and G. Romano. A Lattice Conceptual Clustering System and its Application to Browsing Retrieval. Machine Learning, 24:95{122, 1996. [4] R. Dilworth. A decomposition theorem for partially ordered sets. Ann. of Math., 2(51):161{ 166, 1950. [5] R. Godin. Complexite de structures de treillis. Ann. sc. math., Quebec, 13(1):19{38, 1989. [6] R. Godin, R. Missaoui, and H. Alaoui. Learning Algorithms using a Galois Lattice Structure. In Proc. of the IEEE Intl. Conf. on Tools for AI, pages 22{29, San Jose, CA, November 1991. [7] E. Mephu Nguifo. Galois Lattice: A framework for Concept Learning. Design, Evaluation and Re nement. In Proc. of the sixth International Conf. on Tools with Arti cial Intelligence, pages 461{467, New Orleans, Louisiana, LA, November 6-9 1994. IEEE Press. [8] M. Sahami. Learning Classi cation Rules Using Lattices. In Machine Learning: ECML-95, pages 343{346, Heraclion, Crete, Greece, April 1995. Nada Lavrac and Stefan Wrobel eds. [9] M. Tchuente. Parallel Computation on Regular Arrays. Algorithms and Achitectures for Advanced Scienti c Computing. Manchester University Press, 1991. [10] R. Wille. Concept Lattices & Conceptual Knowledge Systems. Computer Mathematic Applied, 23(6-9):493{515, 1992.
, 123
a, 13
TC + "
b, 12
ab, 1
2 TC + 2 "
, 123 XXXX XX a, 13 c, 23 b, 12 PPPP P ac, 3 ab, 1 XXXXX X abc,
3 TC + 2 "
, 123 XXXX XX a, 13 c, 23 b, 12 PPPP PPPP P bc,P 2 ac, 3 ab, 1 XXXXX X abc,
4 TC + 2 "
, 123 XXXXX X a, 13 c, 23 b, 12 PPPP PPPP P bc,P 2 ac, 3 ab, 1 XXXXX X abc,
5 TC + 2 "
Figure 2: Steps of the Parallel construction for the Galois lattice in gure 1. The sequential algorithm will take at least 12 TC time.
5