A generic scheme for the design of efficient on-line algorithms for lattices

2 downloads 0 Views 194KB Size Report
Godin et al. [8] suggested ... et al.. For this purpose, we applied some structural results from the lattice as- ..... In H. Delugach and G. Stumme, editors, Proceed-.
A generic scheme for the design of efficient on-line algorithms for lattices Petko Valtchev1 , Mohamed Rouane Hacene1 , and Rokia Missaoui2 1

2

DIRO, Universit´e de Montr´eal, C.P. 6128, Succ. “Centre-Ville”, Montr´eal, Qu´ebec, Canada, H3C 3J7 D´epartement d’informatique et d’ing´enierie, UQO, C.P. 1250, succursale B Gatineau, Qu´ebec, Canada, J8X 3X7

Abstract. A major issue with large dynamic datasets is the processing of small changes in the input through correspondingly small rearrangements of the output. This was the motivation behind the design of incremental or on-line algorithms for lattice maintenance, whose work amounts to a gradual construction of the final lattice by repeatedly adding rows/columns to the data table. As an attempt to put the incremental trend on strong theoretical grounds, we present a generic algorithmic scheme that is based on a detailed analysis of the lattice transformation triggered by a row/column addition and of the underlying sub-structure. For each task from the scheme we suggest an efficient implementation strategy and put a lower bound on its worst-case complexity. Moreover, an instanciation of the incremental scheme is presented which is as complex as the best batch algorithm.

1

Introduction

Formal concept analysis (FCA) [5] studies the lattice structures built on top of binary relations (called concept lattices or Galois lattices as in [1]). As a matter of fact, the underlying algorithmic techniques are increasingly used in the resolution of practical problems from software engineering [6], data mining [7] and information retrieval [3]. Our study investigates the new algorithmic problems related to the analysis of volatile data sets. As a particular case, on-line or incremental lattice algorithms, as described in [8, 3], basically maintain lattice structures upon the insertion of a new row/column into the binary table. Thus, given a binary relation K and its corresponding lattice L, and a new row/column o, the lattice L+ corresponding to the augmented relation K+ = K ∪ {o} is computed. Most of the existing online algorithms have been designed with practical concerns in mind, e.g., efficient handling of large but sparse binary tables [8] and therefore prove inefficient whenever data sets get denser [9]. Here, we explore the suborder of L+ made up of all new nodes with respect to L and use an isomorphic suborder of L (the generators of the new nodes) that works as a guideline for the completion of L to L+ . Structural properties of the latter suborder underly the design of a generic completion scheme, i.e., a sequence

of steps that can be separately examined for efficient implementations. As a first offspring of the scheme, we describe a novel on-line algorithm that relies both on insights on the generator suborder and on some cardinality-based reasoning while bringing down the overall cost of lattice construction by subsequent completions to the current lower bound for batch construction. The paper starts by recalling some basic FCA results (Section 2) and fundamentals of lattice construction (Section 3). The structure of the generator/new suborders in the initial/target lattice, respectively, is then examined (Section 4). Next, a generic scheme for lattice completion is sketched and for each task of the scheme implementation, directions are discussed (Section 5). Finally, the paper presents an effective algorithm for lattice maintenance and clarifies its worst-case complexity (Section 6).

2

Formal concept analysis background

FCA [5] studies the partially ordered structure, known under the names of Galois lattice [1] or concept lattice, which is induced by a binary relation over a pair of sets O (objects) and A (attributes). Definition 1. A formal context is a triple K = (O, A, I) where O and A are sets and I is a binary (incidence) relation, i.e., I ⊆ O × A. Within a context (see Figure 1 on the left), objects are denoted by numbers and attribute by small letters. Two functions, f and g, summarize the context-related links between objects and attributes. Definition 2. The function f maps a set of objects into the set of common attributes, whereas g 3 is the dual for attribute sets: – f : P(O) → P(A), f (X) = X 0 = {a ∈ A|∀o ∈ X, oIa} – g : P(A) → P(O), g(Y ) = Y 0 = {o ∈ O|∀a ∈ Y, oIa} For example, f (14) = f gh4 . Furthermore, the compound operators g ◦ f (X) and f ◦ g(Y ) are closure operators over P(O) and P(A) respectively. Thus, each of them induces a family of closed subsets, called C o and C a respectively, with f and g as bijective mappings between both families. A couple (X, Y ), of mutually corresponding closed subsets is called a (formal) concept. Definition 3. A formal concept is a couple (X, Y ) where X ∈ P(O), Y ∈ P(A), X = Y 0 and Y = X 0 . X is called the extent and Y the intent of the concept. Thus, (178, bcd) is a concept, but (16, ef h) is not. Moreover, the set CK of all concepts of the context K = (O, A, I) is partially ordered by intent/extent inclusion: (X1 , Y1 ) ≤K (X2 , Y2 ) ⇔ X1 ⊆ X2 (Y2 ⊆ Y1 ). 3 4

Hereafter, both f and g are denoted by 0 . We use a separator-free form for sets, e.g. 127 stands for {1, 2, 7} and g(abc) = 127 w.r.t. the table K in figure 1, on the left, and ab for {a, b}.

a b c d e f g h 1 XXX XX 2X X 3 XXXXX 4 X 5 XX X 6XXXX 7 XXX 8 X 9 XX XXX

12345678

#2

1267 #6 26 ac

#10 6

c 13678 #7

167

#3

#1

#4 134 g

d

bcd

13 1

#11

#8

135

dgh

bcdgh

#5

h 35

3

#12

#9

efh

defgh

abcd #13 abcdefgh

Fig. 1. Left: Binary table K = (O = {1, 2, ..., 8}, A = {a, b, ..., h}, R) and the object 9. Right: The Hasse diagram of the lattice derived from K.

Theorem 1. The partial order L = hCK , ≤K i is a complete lattice with joins and meets as follows: Wk Sk Tk – i=1 (Xi , Yi ) = (( i=1 Xi )00 , i=1 Yi ), Vk Tk Sk – i=1 (Xi , Yi ) = ( i=1 Xi , ( i=1 Yi )00 ). The Hasse diagram of the lattice L drawn from K = ({1, 2, ..., 8}, A, R) is shown on the right-hand side of Figure 1 where intents and extents are drawn on both sides of a node representing a concept. For example, the join and the meet of c#6 = (26, ac) and c#3 = (13678, d) are (12345678, ∅) and (6, abcd) respectively.

3

Constructing the lattice efficiently

A variety of efficient algorithms exists for constructing the concept set or the entire concept lattice of a context (see [9] for a detailed study). As we are interested in incremental algorithms as opposed to batch ones, we consider those two groups separately. 3.1

Batch approaches

The construction of a Galois lattice may be carried out with two different levels of output structuring. Indeed, one may look only for the set C of all concepts of a given context K without any hierarchical organization: Problem Compute-Concepts Given : a context K = (O, A, I), Find : the set C of all concepts from K.

An early FCA algorithm has been suggested by Ganter [4] based on a particular order among concepts that helps avoid computing a given concept more than once. However, of greater interest to us are algorithms that not only discover C, but also infer the lattice order ≤, i.e., construct the entire lattice L. This more complex problem may be formalized as follows: Problem Compute-Lattice Given : a context K = (O, A, I), Find : the lattice L = hC, ≤i corresponding to K. Batch algorithms for the Compute-Lattice problem have been proposed first by Bordat [2] and later on by Nourine and Raynaud [10]. The former algorithm relies on structural properties of the precedence relation in L to generate the concepts in an appropriate order. Thus, from each concept the algorithm generates its upper covers which means that a concept will be generated a number of times that corresponds to the number of its lower covers. Recently, Nourine and Raynaud suggested an efficient procedure for constructing a family of open sets and showed how it may be used to construct the lattice (see Section 5.4). There is a known difficulty in estimating the complexity of lattice construction algorithms uniquely with respect to the size of the input data. Actually, there is no known bound (other than the trivial one, i.e., the number of all subsets of O or A) of the number of concepts depending on the dimensions of the binary relation, i.e., the size of the object set, of the attribute set, or of the binary relation. Even worse, it has been recently proven that the problem of estimating the size of L from K is #P -complete. For the above reasons, it is admitted to include the size of the result, i.e., the number of the concepts, in the complexity estimation. Thus, with |L| as a factor, the worst-case complexity expression of the classical algorithms solving Compute-Concept is O((k + m)lkm), where l = |L|, k = |O|, and m = |A|. The algorithm of Bordat can be assessed to be of complexity O((k + m)l|I|) where the size of the binary relation (i.e., the number of positive entries in K) is taken into account. Finally, the work of Nourine and Raynaud has helped reduce the complexity order of the problem to O((k+m)lk). 3.2

Incremental approaches

On-line or incremental algorithms do not actually construct the lattice, but rather maintain its integrity upon the insertion of a new object/attribute into the context: Problem Compute-Lattice-Inc Given : a context K = (O, A, I) with its lattice L and an object o, Find : the lattice L+ corresponding to K+ = (O ∪ {o}, A, I ∪ {o} × o0 ).

Obviously, the problem Compute-Lattice may be polynomially reduced to Compute-Lattice-Inc by iterating Compute-Lattice-Inc on the entire set O (A). In other words, an (extended) incremental method can construct the lattice L starting from a single object o1 and gradually incorporating any new object oi (on its arrival) into the lattice Li−1 (over a context K = ({o1 , ..., oi−1 }, A, I)), each time carrying out a set of structural updates. Godin et al. [8] suggested an incremental procedure which locally modifies the lattice structure (insertion of new concepts, completion of existing ones, deletion of redundant links, etc.) while keeping large parts of the lattice untouched. The basic approach follows a fundamental property of the Galois connection established by f and g on (P(O), P(A)): both families C o and C a are closed under intersection [1]. Thus, the whole insertion process is aimed at the integration into Li−1 of all concepts whose intents correspond to intersections of {oi }0 with a a intents from Ci−1 , which are not themselves in Ci−1 . These additional concepts + (further called new concepts in N (o)), are inserted into the lattice at a particular place, i.e., each new concept is preceded by a specific counterpart from the initial lattice, called its generator (the set of generators is denoted G(o)). Two other categories of concepts in L = Li−1 are distinguished: modified (M(o)) a concepts correspond to intersections of {oi }0 with members of Ci−1 that already a exist in Ci−1 , while the remaining set of concepts in the initial lattice are called old or unchanged. In the final lattice L+ = Li , the old concepts preserve all their characteristics, i.e., intent, extent as well as upper and lower covers. Generators do not experience changes in their information content, i.e., intent and extent, but a new concept is added to their upper covers. In a modified concept, the extent is augmented by the new object o while in the set of its lower covers, any generator is replaced by the corresponding new concept. In the next sections, we shall stick to this intuitive terminology, but we shall put it on a formal ground while distinguishing the sets of concepts in the initial lattice (M(o) and G(o)) from their counterparts in the final one (M(o)+ and G(o)+ , respectively). Example 1 (Insertion of object 9). Assume L is the lattice induced by the object set 12345678 (see Figure 1 on the right) and consider 9 as the new object. The set of unchanged concepts has two elements, {c#6 , c#10 }, where as the set of modified and generators are M(o) = {c#1 , c#2 , c#3 , c#4 , c#5 , c#8 } and G(o) = {c#7 , c#9 , c#11 , c#12 , c#13 } respectively. The result of the whole operation is the lattice L in Figure 2. Thus, the set of the new concept intents is: {cd, f h, cdgh, df gh, cdf gh}. Another incremental algorithm for lattice construction has been suggested by Carpineto and Romano [3]. In a recent paper [11], we generalized the incremental approach of Godin et al.. For this purpose, we applied some structural results from the lattice assembly framework defined in [14]. In particular, we showed that the incremental problem Compute-Lattice-Inc is a special case of the more general lattice assembly problem Assembly-Lattice. More recently, we have presented a theoretical framework that clarifies the restructuring involved in the resolution of

123456789

#1

#2 #3 12679 c 136789 d #6 26 ac

1679

#14 cd

#7 167 bcd #10 6

abcd

1349

#4 g

1359

#5

h

#8 139 dgh 19

#16 cdgh

#11 1 bcdgh

359 39

#18 9 cdfgh

#17 dfgh 3

#15 fh 35

#9 efh

#12 defgh

#13 abcdefgh

Fig. 2. The Hasse diagram of the concept (Galois) lattice derived from K with O = {1, 2, 3, ..., 9}.

Compute-Lattice-Inc [13] and further enables the design of procedures that explore only a part of the lattice L (see Section 6). In the next section, we recall the basic results from our framework.

4

Theoretical foundations

For space limitation reasons, only key definitions and results that help the understanding of the more topical developments are provided in this section. First, a set of mappings is given linking the lattices L and L+5 . The mapping σ sends a concept from L to the concept in L+ with the same intent whereas γ works other way round, but respects extent preservation (modulo o). The mappings χ and χ+ send a concept in L to the maximal element of its class []Q in L and L+ , respectively. Definition 1 Assume the following mappings: – – – –

γ : C + → C with γ(X, Y ) = (X1 , X10 ), where X1 = X − {o}, σ : C → C + with σ(X, Y ) = (Y 0 , Y ) where Y 0 is computed in K+ , χ : C → C with χ(X, Y ) = (Y10 , Y100 ), where Y1 = Y ∩ {o}0 , χ+ : C → C + with χ+ (X, Y ) = (Y10 , Y1 ), where Y1 = Y ∩ {o}0 (0 over K+ ).

The above mappings are depicted in Figure 3. Observe that σ is a join-preserving order embedding, whereas γ is a meet-preserving function with γ ◦ σ = id C . Moreover, both mappings underly the necessary definitions (skipped here) for the sets G(o) and M(o) in L and their counterparts G+ (o) and M+ (o) in L+ to replace the intuitive descriptions we used so far. 5

In the following, the correspondence operator 0 is computed in the respective context of the application co-domain (i.e. K or K+ ).

χ

     

   

 

   



      

      

    

  

    



                                                                        L

χ

     

    2  

   o’

Q

χ+

M (o)

σ

G (o)

γ

                                                                 L+

M+(o) N(o)

G+(o)

µ(ο)

Fig. 3. The lattices L, L+ and 2A related by the mappings χ, χ+ , σ, γ and Q.

A first key result states that G(o) and M(o) are exactly the maximal concepts in the equivalence classes induced by the function Q : C → 2A defined as Q(c) = Y ∩ {o}0 where c = (X, Y ). Moreover, the suborder of L made up of G(o) and M(o) is isomorphic, via χ+ , to ↑ ν(o), i.e., the prime filter of L+ generated by the minimal concept including o. Consequently, (G(o)∪M(o), ≤) is a meet-semilattice. Finally, the precedence order in L+ evolves from the precedence in L as follows. Given a new concept c, its generator σ(c) is a lower cover of c while the possible other lower covers of c (Cov l (c)) lay in N+ (o). The upper covers of c are the concepts from M+ (o) ∪ N+ (o), that correspond, via σ, to the upper covers of the generator σ(c) in the semi-lattice (G(o) ∪ M(o), ≤). The latter set may be extracted from the set of actual upper covers of σ(c) in L, Cov l (σ(c)), by considering the maxima of their respective classes for Q, i.e., the values of χ on Cov l (sigma(c)), and keeping only the minimal values of those values. With a modified concept c in M+ (o), its lower covers in L+ differ from the lower covers of γ(c) in L by (i) the (possible) inclusion of concepts from N+ (o), and (ii) the removal of all members of G+ (o). These facts are summarized as follows: Property 1 The relation ≺+ is obtained from ≺ as follows: ≺+ =

5

{(σ(γ(c)), c) | c ∈ N+ (o)} ∪ {(c, c¯) | c ∈ N+ (o), c¯ ∈ Min({χ(ˆ c) | γ(c) ≺ cˆ})} ∪ {(c1 , c2 ) | (γ(c1 ), γ(c2 )) ∈ (≺ − G(o) × M(o))}

A generic scheme for incremental lattice construction

The structural results from the previous paragraphs underlie a generic procedure that, given an object o, transforms L into L+ . 5.1

Principles of the method

A generic procedure solving Compute-Lattice-Inc may be sketched out of the following main tasks: (i) partition of the concepts in L into classes (by comput-

ing intent intersections), (ii) detection of maxima for every class []Q and test of its status, i.e., modified or generator, (iii) update of modified concepts, (iv) creation of new elements and computation of their intent and extent, (v) computation of lower and upper covers for each new element, and (vi) elimination of obsolete links for each generator. These tasks, when executed in the previously indicated order, complete a data structure representing the lattice L into a structure representing L+ as shown in Algorithm 1 hereafter.

1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21:

procedure Compute-Lattice-Inc(In/Out: L = hC, ≤i a lattice; In: o an object) for all c in C do Put c in its class in L/Q w.r.t. Q(c) for all []Q in L/Q do Find c = max([]Q ) if Intent(c) ⊆ o0 then Put c in M(o) else Put c in G(o) for all c in M(o) do Extent(c) ← Extent(c) ∪ {o} for all c in G(o) do cˆ ← New-Concept(Extent(c) ∪ {o}0 ,Q(c)) Put cˆ in N(o) for all cˆ in N(o) do Connect cˆ as an upper cover of its generator c Compute-Upper-Covers(ˆ c,c) for all c in G(o) do for all c¯ in Cov u (c) ∩ M(o) do Disconnect c and c¯

Algorithm 1: Generic scheme for the insertion of a new object into a concept (Galois) lattice.

The above procedure is an algorithmic scheme that generalizes the existing incremental algorithms in the sense of specifying the full scope of the work to be done and the order of the tasks to be carried out. However, the exact way a particular algorithm might instantiate the scheme deserves a further clarification. On one hand, some of the tasks might remain implicit in a particular method. Thus, the task (i) is not explicitly described in most of the methods from the literature, except in some recent work on lattice-based association rule mining [13, 12]. However, all incremental methods do compute the values of the function Q for every concept in L, as a preliminary step in the detection of class maxima. On the other hand, there is a large space for combining subtasks into larger steps, as major existing algorithms actually do. For example, the algorithms in [8, 3] perform all the sub-tasks simultaneously, whereas Algorithm 7 in [13] separates the problem into two stages: tasks (i − iii) are first carried out,

followed by tasks (iv−vi). In the next paragraphs, we discuss various realizations of the above subtasks. 5.2

Partitioning of C into classes []Q

All incremental algorithms explore the lattice, most of the time in a top-down breadth-first traversal of the lattice graph. Classes are usually not directly manipulated. Instead, at each lattice node, the status of the corresponding concept within its class is considered. Classes are explicitly considered in the methods described in [13, 12], which, although designed for a simpler problem, i.e., update of (C a , ⊆) and C a , respectively, can be easily extended to first-class methods for Compute-Lattice-Inc. Both methods apply advanced techniques in order to avoid the traversal of the entire lattice when looking for class maxima. The method in [13] skips the entire class induced by the empty intersection, i.e., Q−1 (∅). Except for small and very dense contexts where it can even be void, Q−1 (∅) is by far the largest class, and skipping it should result in substantial performance gains. An alternative strategy consists to explore class convexity (see Property 2 below) in order to only partially examine each class [12]. For this purpose, a bottom-up (partial) traversal of the lattice is implemented: whenever a non-maximal member of a class is examined, the method “jumps” straight to the maximum of that class. 5.3

Detection of class maxima

Top-down breadth-first traversal of the lattice eases the direct computation of each class maxima, i.e., without constructing the class explicitly. The whole traversal may be summarized as a gradual computation of the functions Q. Thus, it is enough to detect each concept c that produces a particular intersection Int = Intent(c) ∩ o0 , for the first time. For this task, the method of Godin et al. relies on a global memory for intersections that have already been met. This approach could be efficiently implemented with a trie structure which helps speed-up the lookups for a particular intersection (see Algorithms 3 and 4 in [13]). However, we suggest here another technique, based exclusively on locally available information about a lattice node. The technique takes advantage of the convexity of the classes []Q : Property 2 All classes []Q in L, are convex sets: ∀c, c¯, c ∈ C, c ≤ c ≤ c¯ and [¯ c]Q = [c]Q ⇒ [¯ c]Q = [c]Q . In short, for a non-maximal element c, there is always an upper cover of c, say c¯, which is in [c]Q . Thus, the status of c in [c]Q can be established by only looking at its upper covers. Moreover, as Q is a monotonous function (c ≤ c¯ entails Q(¯ c) ⊆ Q(c)), one set inclusion can be tested on set sizes.

5.4

Computation of the upper covers of a new concept

Given a generator c, “connecting” the new concept cˆ = χ+ (c) in the lattice requires the upper and lower covers of cˆ. A top-down breadth-first traversal of L allows the focus to be limited on upper covers while the work on lower covers is done for free. Moreover, at the time cˆ is created, all its upper covers in L+ are already processed so they are available for lookup and link creation. In [8], a straightforward technique for upper cover computation is presented which amounts to looking for all successors of c that are not preceded by another successor. A more sophisticated technique as in [10] uses a property of the set difference between extents of two concepts (sometimes called the face between the concepts in the literature). The property states that a concept c precedes another concept c¯ in the lattice, iff for any object o¯ in the set difference Extent(¯ c) − Extent(c), the closure of the set {¯ o} ∪ Extent(c) is Extent(¯ c): ¯ Y¯ ) ∈ L, c ≺ c¯ iff X ¯ − X = {¯ Property 3 For any c = (X, Y ), c¯ = (X, o ∈ ¯ O|({¯ o} ∪ X)00 = X}. This is easily checked through intersections of concept intents and a subsequent comparison of set cardinalities. To detect all upper covers of a concept c = (X, Y ), one needs to check the closures of {¯ o} ∪ X for every o¯ ∈ O − X and select successors of c that satisfy the above property. This leads to a complexity of k(k + m) per concept, where k comes from the factor O − X and m is the cost of set-theoretic operations on intents. To further cut the complexity of the task, we suggest a method that should at least improve the practical performances. It can be summarized as follows (see [14] for details). First, instead of considering all the potential successors of a new concept c, we select a subset of them, Candidates = {χ+ (¯ c) | c¯ ∈ Cov u (γ(c))}, i.e., the images by χ+ of all upper covers of the generator γ(c). Candidates is a (not necessarily strict) subset of ↑ c − {c}, whereby the convexity of the classes []Q and the monotonicity of Q, insure the inclusion of all upper covers of Cov u (c) = min(↑ c − {c}) in the former set. Since the concepts in Cov u (c) coincide with the minima of Candidates, the former set can be computed through a direct application of a basic property of formal concepts stating that extent faces between c and the members of Cov u (c) are pairwise disjoint. ¯ 1 , Y¯1 ), c¯2 = (X ¯ 2 , Y¯2 ) ∈ Property 4 For any c = (X, Y ) ∈ L, and c¯1 = (X u ¯ ¯ Cov (c), X1 ∩ X2 = X. ˆ 1 , Yˆ1 ) from Candidates − Cov u (c) there is an upper cover c¯ = For any cˆ = (X ¯ Y¯ ) such that c¯ ≤ cˆ whence X ˆ ∩X ¯ = X ¯ ⊇ X, where X is the extent of (X, c. The elements of Candidates − Cov u (c) can therefore be filtered by a set of inclusion tests on Candidates. To do this efficiently and avoid testing of all possible couples, a buffer of attributes can be used to cumulate all the faces of valid upper covers of c that are met so far. Provided that candidates are listed in an order compatible with ≤ (so that smaller candidates are met before larger ones), a simple intersection with the buffer is enough to test whether a candidate is un upper cover or not. This above filtering strategy eliminates non-minimal

candidates while also discarding copies of the same concept (as several upper covers of c may belong to the same class). Finally, the computation of χ+ which is essential for the upward detection of class maxima is straightforward: while modified concepts in L take their own σ values for χ+ (same intent), generators take the respective new concept, and unchanged concepts simply “inherit” the appropriate value from an upper cover that belongs to the same class []Q . To assess the cost of the operation, one may observe that |Cov u (γ(c))| operations are needed, which is at most d(L), i.e., the (outer) degree of the lattice taken as an oriented graph. Moreover, the operations of extent intersection and union, with ordered sets of objects in concept extents takes linear time in the size of the arguments, i.e., no more than k = |O|. Only a fixed number of such operations are executed per member of Candidates, so the total cost is in the order of O(kd(L)). Although the complexity order remains comparable to O(k 2 ), the factor d(L) will be most of the time strictly smaller than k, and, in sparse datasets, the difference could be significant. 5.5

Obsolete link elimination

Any modified cˆ which is an immediate successor of a generator c¯ in L should be disconnected from c¯ in L+ since χ+ (ˆ c) is necessarily an upper cover of the corresponding new element c = χ+ (¯ c): Property 5 For any c¯ ∈ G(o), cˆ ∈ M(o) : c¯ ≺ cˆ ⇒ cˆ ∈ min({χ+ (ˆ c) | cˆ ∈ Cov u (¯ c)}). As the set Cov u (¯ c) is required in the computation of Cov u (c), there is no additional cost in eliminating cˆ from the list of the upper covers of c¯. This is done during the computation of Candidates. Conversely, deleting c¯ from the list of the lower covers of cˆ (if such list is used), is done free of extra effort, i.e., by replacing c¯ with c = χ+ (¯ c).

6

An efficient instantiation of the scheme

The algorithm takes a lattice and a new object6 and outputs the updated lattice using the same data structure L to represent both the initial and the resulting lattices. The values of Q and χ+ are supposed to be stored in a generic structure allowing indexing on concept identifiers (structure ChiPlus). First, the concept set is sorted to a linear extension of the order ≤ required for the top-down traversal of L (primitive Sort on line 3). The overall loop (lines 4 to 20) examines every concept c in L and establishes its status in [c]Q by comparing |Q(c)| to the maximal |Q(¯ c)| where c¯ is an upper cover of c (line 6). To this end, the variable new-max is used. Initialized with the upper cover maximizing |Q| (line 5), new-max eventually points to the concept in L+ whose intent equals Q(c), i.e., χ+ (c). Class maxima are further divided into modified 6

The set A is assumed to be known from the beginning, i.e., {o}0 ⊆ A.

and generators (line 7). A modified concept c (lines 8 to 10) has its extent updated. Then, such a c is set as its own value for χ+ , χ+ (c) = c (via new-max ). Generators, first, give rise to a new concept (line 12). Then, the values of χ+ for their upper covers are picked up (in the Candidates list, line 13) to be further filtered for minimal concepts (Min-Closed, line 14). Minima are connected to the new concept and those of them which are modified in L are disconnected from the generator c (lines 15 to 17). Finally, the correct maximum of the class [c]Q in L+ , i.e., χ+ (c) is set (line 18) and the new concept is added to the lattice (line 19). At the end of the loop, the value of χ+ is stored for further use. 1: procedure Add-Object(In/Out: L = hC, ≤i a lattice; In: o an object) 2: 3: Sort(C) 4: for all c in C do 5: new-max ← argmax({|Q(¯ c)| | c¯ ∈ Cov u (c)}) 6: if |Q(c)| 6= |Q(new-max)| then 7: if |Q(c)| = |Intent(c)| then 8: Extent(c) ← Extent(c) ∪ {o} {c is modified} 9: M(o) ← M(o) ∪ {c} 10: new-max ← c 11: else 12: cˆ ← New-Concept(Extent(c) ∪ {o}0 ,Q(c)) {c is generator} 13: Candidates ← {ChiPlus(¯ c) | c¯ ∈ Cov u (c)} 14: for all c¯ in Min-Closed(Candidates) do 15: New-Link(ˆ c,¯ c) 16: if c¯ ∈ M(o) then 17: Drop-Link(c, c¯) 18: new-max ← cˆ 19: L ← L ∪ {ˆ c} 20: ChiPlus(c) ← new-max

Algorithm 2: Insertion of a new object into a Galois lattice.

Example 2. Consider the same situation as in Example 1. The trace of the algorithm is given in the following table which provides the intent intersection and the χ+ image for each concept. Concepts in L+ are underlined to avoid confusion with their counterparts in L). c c#1 c#4 c#7 c#10 c#13

Q(c) χ+ (c) Cat. ∅ c#1 mod. g c#4 mod. cd c#14 gen. cd c#14 old cdf gh c#18 gen.

c c#2 c#5 c#8 c#11

Q(c) χ+ (c) Cat. c c#2 mod. h c#5 mod. dgh c#8 mod. cdgh c#16 gen.

c c#3 c#6 c#9 c#12

Q(c) χ+ (c) Cat. d c#3 mod. c c#2 old f h c#15 gen. df gh c#17 gen.

To illustrate the way our algorithm proceeds, consider the processing of concept c#12 = (3, def gh). The value of Q(c#12 ) is df gh whereas Candidates contains

the images by χ+ of the upper covers of c#12 , i.e., c#8 and c#9 : Candidates= {c#8 = (139, dgh), c#15 = (359, f h)}. Obviously, neither of the intents is as big as Q(c#12 ), so c#12 is a maximum, more precisely a generator. The new concept, c#17 is (39, df gh) and its upper covers are both concepts in Candidates (since both are incomparable). Finally, as c#8 is in M(o), its link to c#12 is removed. 6.1

Complexity issues

Let ∆(l) = |C + | − |C| and let us split the cost of a single object addition into two factors: the cost of the traversal of L (lines 3 − 7 and 20 of Algorithm 2) and the cost of the restructuring of L, i.e., the processing of class maxima (lines 8 − 19). First, as sorting concepts to a linear extension of ≤ only requires comparison of intent sizes, which are bound by m, it can be done in O(l). Moreover, the proper traversal takes O(l) concept examinations which are all in O(k + m). Thus, the first factor is in O(l(k + m)). The second factor is further split into modified and generator costs whereby the first cost is linear in the size of M(o) (since lines 8 − 10 may be executed in constant time even with sorted extents) and therefore could be ignored. The generator-related cost has a factor ∆(l) whereas the remaining factor is the cost of creating and properly connecting a single new concept. The dominant component of the latter is the cost of the lattice order update (lines 14 − 17) which is in O(k 2 ) as we mentioned earlier. Consequently, the global restructuring overhead is in O(∆(l)k 2 ). This leads to a worst case complexity of O(∆(l)k 2 + l(k + m)) for a single insertion, which is a lower bound for the complexity of Compute-Lattice-Inc (see also [11]). The assessment of the entire lattice construction via incremental updates is delicate since it requires summing on all k insertions whereas the cost of steps 1 to k − 1 depends on parameters of the intermediate structures. Once again, we sum on the above high-level complexity factors separately. Thus, the total cost of the k lattice traversals is bound by k times the cost of the most expensive traversal (the last one), i.e., it is in O(kl(k + m)). The total cost of lattice restructuring is in turn bound by the number of all new concepts (the sum of ∆(li )) times the maximal cost of a new concept processing. The first factor is exactly l = |C + | since each concept in the final lattice is created exactly once which means the restructuring factor of the construction is in O(l(k + m)k), thus leading to a global complexity in the same class O(l(k + m)k). The above figures indicate that the complexity of Compute-Lattice, whenever reduced to a series of Compute-Lattice-Inc, remains in the same class as the best known lower bound for batch methods [10].

7

Conclusion

The present study is motivated by the need for both efficient and theoreticallygrounded algorithms for incremental lattice construction. In this paper, we complete our own characterization of the substructure that should be integrated into the initial lattice upon each insertion of an object/attribute into the context.

Moreover, we show how the relevant structural properties support the design of an effective maintenance methods which, unlike previous algorithms, avoid redundant computations. As guidelines for such design, we provide a generic algorithmic scheme that states the limits of the minimal work that needs to be done in the restructuring. A concrete method that instantiates the scheme is proposed whose worst-case complexity is O(ml + ∆(l)k 2 ), i.e., a function which puts a new and smaller upper bound for the cost of the problem ComputeLattice-Inc. Surprisingly enough, when applied as a batch method for lattice construction, the new algorithm shows the best known theoretical complexity, O((k + m)lk), which is only achieved by one algorithm. As a next stage of our study, we are currently examining the pragmatic benefits of the scheme, i.e., the practical performances of specific scheme instantiations.

References [1] M. Barbut and B. Monjardet. Ordre et Classification: Alg`ebre et combinatoire. Hachette, 1970. [2] J.-P. Bordat. Calcul pratique du treillis de Galois d’une correspondance. Math´ematiques et Sciences Humaines, 96:31–47, 1986. [3] C. Carpineto and G. Romano. A Lattice Conceptual Clustering System and Its Application to Browsing Retrieval. Machine Learning, 24(2):95–122, 1996. [4] B. Ganter. Two basic algorithms in concept analysis. preprint 831, Technische Hochschule, Darmstadt, 1984. [5] B. Ganter and R. Wille. Formal Concept Analysis, Mathematical Foundations. Springer-Verlag, 1999. [6] R. Godin and H. Mili. Building and maintaining analysis-level class hierarchies using Galois lattices. In Proceedings of OOPSLA’93, Washington (DC), USA, special issue of ACM SIGPLAN Notices, 28(10), pages 394–410, 1993. [7] R. Godin and R. Missaoui. An Incremental Concept Formation Approach for Learning from Databases. Theoretical Computer Science, 133:378–419, 1994. [8] R. Godin, R. Missaoui, and H. Alaoui. Incremental concept formation algorithms based on galois (concept) lattices. Computational Intelligence, 11(2):246–267, 1995. [9] S. Kuznetsov and S. Ob’edkov. Algorithms for the Construction of the Set of All Concept and Their Line Diagram. preprint MATH-AL-05-2000, Technische Universit¨ at, Dresden, June 2000. [10] L. Nourine and O. Raynaud. A Fast Algorithm for Building Lattices. Information Processing Letters, 71:199–204, 1999. [11] P. Valtchev and R. Missaoui. Building concept (Galois) lattices from parts: generalizing the incremental methods. In H. Delugach and G. Stumme, editors, Proceedings, ICCS-01, volume 2120 of Lecture Notes in Computer Science, pages 290–303, Stanford (CA), USA, 2001. Springer-Verlag. [12] P. Valtchev and R. Missaoui. A Framework for Incremental Generation of Frequent Closed Itemsets. Discrete Applied Mathematics, submitted. [13] P. Valtchev, R. Missaoui, R. Godin, and M. Meridji. Generating Frequent Itemsets Incrementally: Two Novel Approaches Based On Galois Lattice Theory. Journal of Experimental & Theoretical Artificial Intelligence, 14(2-3):115–142, 2002. [14] P. Valtchev, R. Missaoui, and P. Lebrun. A partition-based approach towards building Galois (concept) lattices. Discrete Mathematics, 256(3):801–829, 2002.

Suggest Documents