Structured Ontology and Information Retrieval for Email Search and Discovery Peter Eklund and Richard Cole School of Information Technology and Electrical Engineering The University of Queensland St. Lucia, QLD 4072
[email protected],
[email protected]
Abstract. This paper discusses an document discovery tool based on formal concept analysis. The program allows users to navigate email using a visual lattice metaphor rather than a tree. It implements a virtual file structure over email where files and entire directories can appear in multiple positions. The content and shape of the lattice formed by the conceptual ontology can assist in email discovery. The system described provides more flexibility in retrieving stored emails than what is normally available in email clients. The paper discusses how conceptual ontologies can leverage traditional document retrieval systems.
1
Introduction
Client-side email management systems are document management systems that store email as a tree structure in analog to the physical directory/file structure. This has the advantage that trees are simply explained as a direct mapping from the structure of the file system to the email. The disadvantage is that at the moment of storing an email the user must anticipate the way she will later retrieve the email. How then should email be organized? Should we store it as a specialization or a generalization hierarchy? Are we trying to give every email a unique key based on its content or cluster emails broadly on their content? This problem generalizes to other document types, organization is both context and query dependent (after the fact). One such organization of an associative store is a virtual file structure that maps the physical file structure to a view based on content. Information retrieval gives us the ability to index every meaningful word in a text by generating an the inverted file index. The index can then be reduced by stemming, compression and frequency analysis. These scalable techniques from information retrieval can be extended by re-using conceptual ontologies as a virtual file structure. In this paper, we profile HierMail 1 (previously referred to in various stages as Cem, ECA or Warp9) that follows from earlier work in medical document retrieval reported in [3]. HierMail is a lattice-based email retrieval and storage program that aids in knowledge discovery by a conceptual and virtual view over 1
see http://www.hiermail.com
email. It uses a conceptual ontology as a data structure for storing emails rather than a tree. In turn, formal concept analysis can be used to generate a concept lattice of the file structure. This permits clients to retrieve emails along different paths and discover interesting associations between email content. In HierMail, email retrieval is independent of the physical organization of the file system. This idea is not new, for instance, the concept of a virtual folder was introduced in a program called View Mail (VM)[5]. A virtual folder is a collection of email documents retrieved in response to a query. The virtual folder concept has more recently been popularized by a number of opensource projects2 . Other commercial discovery tools for email are also available, see http://80-20.com for example. HierMail differs from those systems in the understanding of the underlying structure – via formal concept analysis – as well as in the details of implementation. It therefore extends the virtual file system idea into document discovery. Concept lattices are defined in the mathematical theory of Formal Concept Analysis [4]. A concept lattice is derived from a binary relation which assigns attributes to objects. In our application, the objects are all emails stored by the system, and the attributes classifiers like ‘conferences’, ‘administration’ or ‘teaching’. We call the string matching regular expressions classifiers since HierMail is designed to accommodate any form of pattern matching algorithm against text, images or multimedia content. The idea of automatically learning classifiers from documents has been the focus of the machine learning and text classification communities[1] but is not specifically considered in this treatment.
2
Background
Formal Concept Analysis (FCA) [4] is a long standing data analysis technique. Two software tools, Toscana [7] and Anaconda embody a methodology for data-analysis based on FCA. A Java-based open-source variant of these programs, called ToscanaJ, has also been developed3 . Following the FCA methodology, data is organized as a table in a RDBMS and modeled mathematically as a multi-valued context, (G, M, W, I) where G is a set of objects, M is a set of attributes, W is a set of attribute values and I is a relation between G, M , and W such that if (g, m, w1 ) and (g, m, w2 ) then w1 = w2 . In the RDBMS there is one row for each object, one column for each attribute, and each cell can contain an attribute value. Organization over the data is achieved via conceptual scales that map attribute values to new attributes and are represented by a mathematical entity called a formal context. A conceptual scale is defined for a particular attribute of the multi-valued context: if Sm = (Gm , Mm , Im ) is a conceptual scale of m ∈ M then we require Wm ⊆ Gm . The conceptual scale can be used to produce a summary of data in the multi-valued context as a derived context. The context derived by Sm = (Gm , Mm , Im ) w.r.t. to plain scaling from data stored in the multi-valued context 2 3
see http://gmail.linuxpower.org/ see http://toscanaj.sourceforge.net
2
Fig. 1. Scale, classifier and concept lattice. The central dialogue box shows how the scale function (α) can be edited.
(G, M, W, I) is the context (G, Mm , Jm ) where for g ∈ G and n ∈ Mm gJm n ⇔: ∃w ∈ W : (g, m, w) ∈ I
and (w, n) ∈ Im
Scales for two or more attributes can be combined together into a derived context. Consider a set of scales, Sm , where each m ∈ M gives rise to a different scale. The new attributes supplied by each scale can be combined using a union: [ N := {m} × Mm m∈M
Then the formal context derived from combining these scales is (G, N, J) with gJ(m, n) ⇔: ∃w ∈ W : (g, m, w) ∈ I and (w, n) ∈ Im . The derived context is then displayed to the user as a lattice of concepts (see Fig. 1 (right)). In practice, it is easier to define a conceptual scale by attaching expressions to objects rather than attribute values. The expressions denote a range of attribute values all having the same scale attributes. To represent these expressions in conceptual scaling we introduce a function called the composition operator for attribute m, αm : Wm → Gm where Wm = {w ∈ W | ∃g ∈ G : (g, m, w) ∈ I}. This maps attribute values to scale objects. The derived scale then becomes (G, N, J) with: gJ(m, n) ⇔: ∃w ∈ W : (g, m, w) ∈ I and (αm (w), n) ∈ Im . 3
The purpose of this summary is to reinforce that in practice FCA works with structured object-attribute data in RDBMS form, in conjunction with a collection of conceptual scales. An inverted file index is a kind of relational database in which any significant, non-stemmed, non-stop word is a primary key. Documents can therefore be seen as objects and significant keywords as attributes and the formation of conceptual scales an interactive process. Given this background we now describe the system on a structural level: we abstract from implementation details. We distinguish three fundamental structures: (i) a formal context that assigns to each email a set of classifiers; (ii) a hierarchy on the set of classifier in order to define more general classifiers; (iii) a mechanism for creating conceptual scales used as a graphical interface for email retrieval. Attaching String Classifiers to Email In HierMail, we use a formal context (G, M, I) for storing email and assigning classifiers. The set G contains all emails stored in the system, the set M contains all classifiers. For the moment, we consider M to be unstructured. The incidence relation I indicates emails assigned to each classifier. The incidence relation is generated in a semi-automatic process: (i) a string-search algorithm recognizes words within sections of an email and suggests relations between email attributes; (ii) the client may accept the suggestion of the string-search algorithm or otherwise modify it; and (iii) the client may attach his own attributes to the email. It is this incidence relation (I) that can be replaced with text classification algorithms. Instead of a tree of disjoint folders and sub-folders, we consider the concept lattice B(G, M, I) as the navigation space. The formal concepts replace the folders. In particular, this means that the same emails may appear in different concepts and therefore in different folders. The most general concept contains all email in the collection and the deeper the user moves into the multiple inheritance hierarchy, the more specific the concepts, and subsequently the fewer emails they contain. Organizing Hierarchies of String Classifiers To support the semi-automatic assignment of classifiers to email, we provide the set M of classifiers with a partial order ≤. For this subsumption hierarchy, we assume that the following compatibility condition holds: ∀g ∈ G, m, n ∈ M : (g, m) ∈ I, m ≤ n ⇒ (g, n) ∈ I (‡), i.e., the assignment of classifiers respects the transitivity of the partial order. Hence, when assigning classifiers to emails, it is sufficient to assign the most specific classifiers only. More general classifiers are automatically added. For instance, the user may want to say that ismis is a more specific classifier than conferences, and that ismis2002 is more specific than ismis (i. e., ismis2002 ≤ ismis ≤ conferences). Emails concerning the creation of a paper for the ismis’02 conference are assigned by the email client to the ismis2002 label only (and possibly also to additional classifiers like cole and eklund). When the client wants to retrieve this email, she is not required to recall the complete pathname. Instead, the emails also appear under the more general label conferences. If conferences provides too large a list of email, the client can refine the search by choosing a sub-term like ismis, or adding a new classifier, for instance cole. 4
Navigating Email Conceptual scaling deals with many-valued attributes. Often attributes are not one-valued as the string classifiers given above, but allow a range of values. This is modeled by a many-valued context. A many-valued context is roughly equivalent to a relation in a relational database with one field being a primary key. As one-valued contexts are special cases of many-valued contexts, conceptual scaling can also be applied to one-valued contexts to reduce the complexity of the visualization. In this paper, we only deal with one-valued formal contexts. Readers interested in the exact definition of many-valued contexts are referred to Ganter & Wille [4]. Applied to one-valued contexts, conceptual scales are used to determine the concept lattice that arises from one vertical slice of a large context: a conceptual scale for a subset B ⊆ M of attributes is a (one-valued) formal context SB := (GB , B, 3) with GB ⊆ P(B). The scale is called consistent w.r.t. K := (G, M, I) if {g}0 ∩ B ∈ GB for each g ∈ G. For a consistent scale SB , the context SB (K) := (G, B, I ∩(G×B)) is called its realized scale. Conceptual scales are used to group together related attributes. They are determined as required by the user, and the realized scales derived from them when a diagram is requested by the user. HierMail stores all scales that the client has defined in previous sessions. To each scale, the client can assign a unique name. This is modeled by a function (S). Let S be a set, whose elements are called scale names. The mapping α: S → P(M) defines for each scale name s ∈ S a scale Ss := Sα(s) . For instance, the user may introduce a new scale which classifies emails according to being related to a conference by adding a new element ‘Conferences’ to S and by defining α(Conference) := {ISMIS02, K-CAP ‘01, ADCS ‘01, PKDD 2000}.
Observe that S and M need not be disjoint. This allows the following construction, deducing conceptual scales directly from the subsumption hierarchy: Let S := {m ∈ M |∃n ∈ M : n < m}, and define, for s ∈ S, α(s) := {m ∈ M |m ≺ s} (with x ≺ y if and only if x < y and there is no z s. t. x < z < y). This means all classifiers m ∈ M , neither minimal nor maximal in the hierarchy, are considered as the name of scale Sm and as a classifier of another scale Sn (where m ≺ n). This last construction defines a hierarchy of conceptual scales [6].
3
Requirements of HierMail
Requirements are divided along the same lines as the underlying algebraic structures given previously: (i) assist the user in editing and browsing a classifier hierarchy; (ii) help the client visualize and modify the scale function α; (iii) allow the client to manage the assignment of classifiers to emails; (iv) assist the client search the conceptual space of emails for both individual and conceptual groupings of emails. In addition to the requirements stated above, a good email client needs to be able send, receive and display emails: processing the various email formats and interact with popular protocols. Since these requirements are already well understood and implemented by existing email programs they are not discussed further. This does not mean they are not important, rather implicit in the realization of HierMail. 5
Modifying a String Classifier The classifier hierarchy is a partially ordered set (M, ≤) where each element of M is a classifier. The requirements for editing and browsing the classifier hierarchy are: (i) to graphically display the structure of the (M, ≤). The ordering relation must be evident; (ii) to make accessible a series of direct manipulations to alter the order relation. The Scale Function α The user must be able to visualize the scale function, α. The program must allow an overlap between the set of scale labels S, and the set of classifiers M , this is shown is Fig. 1. Managing Classifier Assignment The formal context associates email with classifiers via the incidence relation I. Also introduced earlier was the notion of the compatibility condition,(‡). The program should store the formal context (G, M, I) and ensure that the compatibility condition is always satisfied. It is inevitable that the program will have to modify the formal context in order to satisfy the compatibility condition after a change is made to the classifier hierarchy. The program must support two mechanisms associating classifiers to emails. Firstly, a mechanism in which emails are automatically associated with classifiers based on the email content. Secondly, the user the should be able to view and modify email classifiers. The Conceptual Space The program must allow the navigation of the conceptual space of the emails by drawing line diagrams of concept lattices derived from conceptual scales [4] as shown in Fig. 1. These line diagrams should extend to locally scaled nested-line diagrams shown in Fig. 2. The program must allow retrieval and display of emails forming the extension of concepts displayed in the line diagrams.
4
Implementation of HierMail
This section divides the description of the implementation of the HierMail into a similar structure to that presented earlier. Browsing the Hierarchy The user is presented with a view of the hierarchy (M, ≤) as a tree widget shown in Fig. 1 (left). The tree widget has the advantage that most users are familiar with its behavior and it provides a compact representation of a tree structure. The classifier hierarchy, being a partially ordered set, is a more general structure than a tree. The following is a definition of a tree derived from the classifier hierarchy for the purpose of defining the contents and structure of the tree widget. Let (M, ≤) be a partially ordered set and denote the set of all sequences of elements from M by < M >. Then the tree derived from the classifier hierarchy is comprised of (T, parent, label), where T ⊆< M > is a set of tree nodes, (the empty sequence) is the root of the tree, parent : T / → T is a function giving the parent node of each node (except the root node), and label : T → M assigns a classifier to each tree node. T = {< m1 , . . . , mn >∈< M > | mi mi+1 and mn ∈ top(M )} parent(< m1 , . . . , mn >) := < m1 , . . . , mn−1 > parent(< m1 >) := label(< m1 , . . . , mn >) := m1 6
Fig. 2. Scales and the nested-line diagram.
Each tree node is identified by a path from a classifier to the top of the classifier hierarchy. Modifying the Hierarchy ((M, ≤)) The program provides four operations for modifying the hierarchy: the insert & remove classifier and the insert & remove ordering. More complex operations provided to the client, moving an item in the hierarchy, are resolved to a sequence of these basic operations. In this section we denote the order filter of m as ↑ m := {x ∈ M | m ≤ x}, the order ideal of m as ↓ m := {x ∈ M | x ≤ m}, and the upper cover of m as m := {x ∈ M | x m}. The operation of inserting a classifier simply adds a new classifier to M and leaves the ≤ relation unchanged. The remove classifier operation takes a single parameter a ∈ M for which the lower cover is empty, and removes a from M and (↑ a) × {a} from the ordering relation. The insert ordering operation takes two parameters a, b ∈ M and inserts into the relation ≤, the set (↑ a) × (↓ b). The operation remove ordering takes two parameters a, b ∈ M where a is an upper cover of b. The remove ordering operation removes from ≤ the set ((↑ a/ ↑ (b /a)) × (↓ b)). The Scale Function α The set of scales S is not disjoint from M , thus the tree representation of M already presents a view of a portion of S. In order to 7
reduce the complexity of the graphical interface, we make S equal to M , i.e. all classifiers are scale labels, and all scale labels are classifiers. A result of this definition is that classifiers with no lower covers lead to trivial scales containing no other classifiers. The function α maps each classifier m to a set of classifiers. The program displays this set of classifiers, when requested by the user, using a dialog box (see Fig. 1 (center)). The dialog box contains all classifiers in the down-set of m, an icon (either a tick or a cross) to indicate membership in the set of classifiers given by α(m). Clicking on the icon toggles the tick or cross and changes the membership of α(m). By only displaying the down-set of m in the dialog box, the program restricts the definition of α to α(m) ⊆↓ m. This has an effect on the remove ordering operation defined on (M, ≤). When the ordering of a ≤ b is removed the image of α function for attributes in ↑ a must be checked and possibly modified. Associating Emails to Classifiers Each member of (M, ≤) is associated with a query term: in this application is a set of section/word pairs. That is: Let H be the set of sections found in the email documents, W the set of words found in email documents, then a function query: M → P(H × W ) attaches to each attribute a set of section/word pairs. Let G be a set of emails. An inverted file index stores a relation R1 ⊆ G × (H × W ) between documents and section/word pairs. (g, (h, w)) ∈ R1 indicates that document g has word w in section h. A relation R2 ⊆ G × M is derived from the relation R1 and the function query via: (g, m) ∈ R2 iff (g, (h, w)) ∈ R1 for some (h, w) ∈ query(m). A relation R3 stores user judgments saying that an email should have an attribute m. A relation R4 respecting the compatibility condition (‡) is then derived from the relations R2 and R3 via: (g, m) ∈ R4 iff there exists m1 ≤ m with (g, m1 ) ∈ R2 ∪ R3 . Compatibility Condition Inserting the ordering b ≤ a into ≤ requires the insertion of set (↑ a/ ↑ b) × {g ∈ G | (g, b) ∈ R4 } into R4 . Such an insertion into an inverted file index is O(nm) where n is the average number of entries in the inverted index in the shaded region, and m is the number of elements in the shaded region. The real complexity of this operation is best determined via experimentation with a large document sets and a large user defined hierarchy [2]. Similarly the removal of the ordering b ≤ a from ≤ will require a re-computation of the inverted file entries for elements in ↑ a. New Email and User Judgments When new emails arrive, Gb , are presented to HierMail, the relation R1 is updated by inserting new pairs, R1b , into the relation. The modification of R1 into R1 ∪ R1b causes an insertion of pairs R2b into R2 according to query(m) and then subsequently an insertion of new pairs R4b into R4 . R1b ⊆ Gb × (H × W ) R2b = {(g, m) | ∃ (h, w) ∈ query(m) and (g, (h, w)) ∈ R1b }, R4b = {(g, m) | ∃ m1 ≤ m with (g, m1 ) ∈ R2b }. For new emails presented to the system for automated indexing the modification to the inverted file index inserts new entries which is a very efficient operation. Each pair inserted is O(1). When the user makes a judgment that an 8
indexed email should be associated with an attribute m, then an update must be made to R3 , which will in turn cause updates to all attributes in the order filter of m to be updated in R4 . The expense of such an update depends on the implementation of the inverted file index and could be as bad as O(n) where n is the average number of documents per attribute. In the case that a client retracts a judgment, saying that an email is no longer associated with an attribute m, requires a possible update to each attribute, n, in the order filter of m. Conceptual Space Navigation When the user requests that the concept lattice derived from the scale with name s ∈ S be drawn, the program computes Sα(S) via the algorithm reported in Cole and Eklund[2]. In the case that the user requests a diagram combining two scales with names labels s and t, then the scale SB∪C with B = α(s) and C = α(t) is calculated by the program and its concept lattice B(SB∪C ) is drawn as a projection into the lattice product B(SB ) × B(SC ).
5
Conclusion
This paper provides a description of the algebraic structures used to create a lattice-based view of email. The structure, its implementation and operation, aid the process of knowledge discovery in large collections of email. By using such a conceptual multi-hierarchy, the content and shape of the lattice view is varied. An efficient implementation of the index promotes client iteration. The work shows that the principles of formal concept analysis can be supported by an inverted file index and that a useful and scalable document browsing system results. The program we have realized, Hiermail, is available from http://hiermail.com.
References 1. J. Brutlag and J. Meek: Challenges of the Email Domain for Text Classification, in Proceedings of the 17th International Conference on Machine Learning ICML00, p.103-110, 2000. 2. R. Cole, P. Eklund: Scalability in Formal Concept Analysis: A Case Study using Medical Texts. Computational Intelligence, Vol. 15, No. 1, pp. 11-27, 1999. 3. R. Cole, P. Eklund: Analyzing Email using Formal Concept Analysis. Proc. of the European Conf. on Knowledge and Data Discovery, pp. 309-315, LNAI 1704, Springer-Verlag, Prague, 1999. 4. B. Ganter, R. Wille: Formal Concept Analysis: Mathematical Foundations. Springer-Verlag, Heidelberg 1999. 5. K. Jones: View Mail Users Manual. http://www.wonderworks.com/vm. 1999 6. G. Stumme: Hierarchies of Conceptual Scales. Proc. Workshop on Knowledge Acquisition, Modeling and Management. Banff, 16.–22. October 1999 7. Vogt, F. and R. Wille, TOSCANA: A Graphical Tool for Analyzing and Exploring Data In: R. Tamassia, I.G. Tollis (Eds) Graph Drawing ’94, LNCS 894 pp. 226-233, 1995.
9