Information Processing and Management xxx (2011) xxx–xxx
Contents lists available at ScienceDirect
Information Processing and Management journal homepage: www.elsevier.com/locate/infoproman
Hierarchical web resources retrieval by exploiting Fuzzy Formal Concept Analysis Carmen De Maio, Giuseppe Fenza, Vincenzo Loia, Sabrina Senatore ⇑ Dipartimento di Matematica e Informatica, Università degli Studi di Salerno, via ponte don Melillo, 84084 Fisciano (SA), Italy
a r t i c l e
i n f o
Article history: Received 28 February 2010 Received in revised form 24 April 2011 Accepted 26 April 2011 Available online xxxx Keywords: Fuzzy Formal Concept Analysis Fuzzy ontology Lattice Facet-based navigation Information Retrieval
a b s t r a c t In recent years, knowledge structuring is assuming important roles in several real world applications such as decision support, cooperative problem solving, e-commerce, Semantic Web and, even in planning systems. Ontologies play an important role in supporting automated processes to access information and are at the core of new strategies for the development of knowledge-based systems. Yet, developing an ontology is a time-consuming task which often needs an accurate domain expertise to tackle structural and logical difficulties in the definition of concepts as well as conceivable relationships. This work presents an ontology-based retrieval approach, that supports data organization and visualization and provides a friendly navigation model. It exploits the fuzzy extension of the Formal Concept Analysis theory to elicit conceptualizations from datasets and generate a hierarchybased representation of extracted knowledge. An intuitive graphical interface provides a multi-facets view of the built ontology. Through a transparent query-based retrieval, final users navigate across concepts, relations and population. 2011 Elsevier Ltd. All rights reserved.
1. Introduction In the last decade, the success of Semantic Web has proved to depend on the spreading of ontologies as a means for communication and data sharing. Ontologies provide conceptualizations of the underlying knowledge and, at the same time, enable a shared understanding of common domains. They also yield a feasible solution for applications interoperability and data access and for modeling complex domains, through common standard languages such as OWL and RDF. On the other hand, the use of ontologies for describing web resources or, in general information semantically, promises effectiveness, efficiency and accuracy in the Information Retrieval activities, especially since the most of knowledge is often enclosed in collections of unstructured textual documents. Traditional keyword-based search techniques exploit simple term statistics or indexes to identify documents that are most relevant to a query, often revealing low accuracy, due to the problems of terms polysemy, ambiguity in the meaning, etc. Ontologies provide an adequate instrument for structuring knowledge in an information domain, supplying controlled vocabularies for the classification of content. Techniques such as thesaurus expansion, query formulation based on the semantics of data models (Loia & Senatore, 2007) or blind relevance feedback to get inference into the retrieval process (Finin, Mayfield, Joshi, Cost, & Fink, 2005), reveal the exigency to annotate web resources exploiting semantic markups, provided by ontologies. Recent approaches overcome classic keyword-based search through e.g., query expansion based on class hierarchies and rules on relationships, or multifaceted searching and browsing (Gibbins, Harris, Dix, & Schraefel, 2004). Yet, expressive power of an ontology-based knowledge representation is well above the keyword-based search: structured ⇑ Corresponding author. E-mail addresses:
[email protected] (C. De Maio),
[email protected] (G. Fenza),
[email protected] (V. Loia),
[email protected] (S. Senatore). 0306-4573/$ - see front matter 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.ipm.2011.04.003
Please cite this article in press as: De Maio, C., et al. Hierarchical web resources retrieval by exploiting Fuzzy Formal Concept Analysis. Information Processing and Management (2011), doi:10.1016/j.ipm.2011.04.003
2
C. De Maio et al. / Information Processing and Management xxx (2011) xxx–xxx
information enables automatic tools (as well as software agents) to collect and understand marked knowledge and draw further inferences related to the domain of interest. At the same time, the growth of digital knowledge leads an exponential proliferation of data and web resources as well as ontology instances, classes and relations of different complexity. There is a clear exigency of integration of the heterogeneity aimed at knowledge sharing, updating and communication. The construction of a unique, global ontology-based knowledge repository which holds the whole information corpus is a interesting challenge but still far from becoming a reality. The availability of domain ontologies strictly relies on the methodology exploited to support the process of ontology building. In fact, ontology building presents structural and logical difficulties in the definition of concepts, properties and relations; developing an ontology starting from scratch is a time-consuming, labor intensive task which requires an accurate understanding of the reference domain. Starting from these considerations, this work presents a formal method to automatically extract an ontology from the content analysis of a collection of web resources. It extends a previous approach (De Maio, Fenza, Loia, & Senatore, 2009) and exploits the fuzzy extension of Formal Concept Analysis (briefly, FCA), as a mathematical model to automatically build an ontology. The navigation of the ontology is accomplished by a friendly interface, which allows the browsing of the ontology concepts as well as the exploration of the relationships and the population. In this approach, we emphasize the role of the ontology as a formal and reusable model for the knowledge representation, which supports advanced queries and visualization procedures. It fact, the interface allows us to get a ‘‘multi-facet’’ visualization of the knowledge and the classification of the analyzed resources which maintains completely separated the rough data and semantic information. The rest of the paper is organized as follows. Section 2 is devoted to the related works in the context of the knowledge extraction and modeling through ontologies. Section 3 briefly introduces the workflow of the whole framework through its main activities which are detailed in Sections 4–6. Particularly Section 5 describes the formal model, based on a fuzzy extension of FCA and then the ontology building through its translation into an effective OWL-based ontological structure. Section 7 presents some possible applications of the built ontology, and particularly, Section 8 describes the user interface for the facet-based ontology exploration. Section 9 is devoted to presenting experimental results. Finally, conclusions close the paper. 2. Related works In the era of Semantic Web, ontologies have acquired a strategic value in many fields such as knowledge management, Information Retrieval, natural language processing and information integration. Ontologies play an important role in supporting automated processes to access information and are at the core of new strategies for the development of knowledge-based systems. Conventional knowledge acquisition approaches are mainly based on the acquisition of knowledge from documents and are human-driven, labor-intensive and time-consuming tasks. In order to solve these problems, many automatic and semi-automatic methods have been designed which using text mining and machine learning techniques, allow the ontology generation. Particularly, many tools have been designed for knowledge structuring in specific domains, for instance, through automatic discovery of taxonomic and non-taxonomic relationships from domain data. OntoGen (Fortuna, Grobelnik, & Mladenic, 2007), OntoLT (Buitelaar, Olejnik, & Sintek, 2004), OntoLearn (De Maio et al., 2009) and OntoEdit (Sure et al., 2002) represent well-known tools for knowledge structuring and ontology engineering activities; ASIUM (Faure, Nédellec, & Rouveirol, 1998) supports the semantic acquisition of knowledge from textual resources and the automatic ontology building, whereas in Chau (2007), an ontology-based knowledge system is modeled to assist engineering activities for sharing and maintaining knowledge. Recent approaches exploit ontologies for knowledge modeling; Text2Onto (Cimiano & Volker, 2005) is a framework for ontology learning from textual resources, able to automatically create ontologies from a corpus of documents within a certain domain. The work evidences some similarities with our approach: it generates ontologies through automatic data driven tasks, meanwhile human intervention is just aimed at selecting the input knowledge resources and tuning the parameters. Many methodologies and techniques have been reported in the literature regarding ontology building (Grüninger & Fox, 1995; Lopez & Perez, 2002; Uschold & Grüninger, 1996); in particular the methodologies presented by Grüninger and Fox (1995) deal with ontologies using first-order-logic languages, while the method in Uschold and Grüninger (1996) starts from the identification of the ontology goals and the needs for the domain knowledge acquisition. Some approaches instead, exploit techniques based on Formal Concept Analysis (FCA) theory for knowledge structuring and ontology building: in Cho and Richards (2007) a method has been adapted to maintain a concept map for the reuse of knowledge. In Carpineto and Romano (2004), instead, FCA is used in Information Retrieval applications to improve the search engine capabilities in the query/answering activities. Although the ontologies widely contribute to design the domain knowledge modeling, sometimes their expressiveness is not satisfactory to provide a coherent representation of imprecise information of the real world. In the real word, customers or end users prefer to get messages in linguistic expressions rather than in numeric values. Many research works are aimed at defining tolerance to imprecision, uncertainty and vagueness in the process of ontology generation. A possible solution to deal with this issue is to use fuzzy techniques. to relax rigid constraints, managing uncertainty in the relationship and conceptual information.
Please cite this article in press as: De Maio, C., et al. Hierarchical web resources retrieval by exploiting Fuzzy Formal Concept Analysis. Information Processing and Management (2011), doi:10.1016/j.ipm.2011.04.003
C. De Maio et al. / Information Processing and Management xxx (2011) xxx–xxx
3
Quite a few approaches integrate Formal Concept Analysis and Fuzzy Logic theory for ontology-based approaches (De Maio et al., 2009; Maedche & Staab, 2001). Pollandt (1996) introduces the L-fuzzy context, as an attempt to combine fuzzy logic with FCA, where linguistic variables are used to represent ambiguities in the context. Because a human support is required to define the linguistic variables, the approach seems to be not practicable for dealing with large document sets. Moreover, the fuzzy concept lattice generated from the L-fuzzy context tend to generate a larger number of concepts than the traditional one. In Tho, Hui, Fong, and Cao (2006), a framework named FOGA achieves the automatic generation of a fuzzy ontology from data. Like our approach, FOGA exploits the FCA theory for building a hierarchy structure of ontology classes, but it uses fuzzy conceptual clustering to cluster formal concepts into conceptual clusters. Our approach instead, besides building an OWL ontology, through the hierarchical structure derived from the FCA theory, achieves a classification system of the collected resources. Then, the exploration tree allows users to navigate across the web resources and retrieve specific data derived from the conceptualization of the extracted knowledge. An interesting proposal is presented in Dubois and Prade (2009), where the representation of uncertainty is defined by a possibility theory-inspired view of FCA. New operators have been introduced for the description of the different possible relations between a set of objects and a set of properties, leading to consider new Galois connexions that give birth to the notion of concept (Dubois & Prade, 2009). In literature, many other approaches deal with similar problems. Just for giving some examples, in Dattolo and Luccio (2008) zz-structures have been exploited to represent contextual interconnections among different information. Similar to our approach, it provides an automatic knowledge structuring and visualization which helps the user to create semantic interconnection among web resources, yielding personalized concept maps. In Suksomboon, Herin, and Sala (2006) in turn, an S-node representation of web graphs has been exploited to model learning objects. The main advantage of this approach is the space-efficiency in the structure representation as well as the reduced query execution time. Our approach instead, aims at eliciting intrinsic relations among the given data and resources, without any human mediation. The fuzzy FCA extent considers a degree of interrelations (i.e. an approximate subsumption) between linked concepts. The fuzzy lattice reveals knowledge-based, hierarchical dependences among concepts, emphasizing the taxonomic nature of the structure. Besides, it naturally provides a classification of collected data and the generated ontology enables an promising retrieval model. 3. Overview of the framework The building of an ontology and facet-based visualization requires some preliminary activity which is part of the execution process of the system. Fig. 1 provides a general idea of the framework, sketching the whole workflow through the main phases. Initially, a collection of web resources is given as the input to the system. A text-based parsing of these documents produces a meaningful set of keywords associated to each resource. In general, preprocessing activities such as normalization,
Fig. 1. The whole process of the framework.
Please cite this article in press as: De Maio, C., et al. Hierarchical web resources retrieval by exploiting Fuzzy Formal Concept Analysis. Information Processing and Management (2011), doi:10.1016/j.ipm.2011.04.003
4
C. De Maio et al. / Information Processing and Management xxx (2011) xxx–xxx
POS tagging (Cutting, Kupiec, Pederson, & Sibun, 1992; Greene & Rubin, 1971; Klein & Simmons, 1963; Schmid, 1994), stop-word removal, stemming and lemmatization (Hooper & Paice, 2005; Porter, 1980, 2001) produce a matrix composed of vector-based representations of each analyzed resource. Then, through the Fuzzy Formal Concept Analysis theory, the fuzzy formal context (i.e. the computed matrix) and the fuzzy formal lattice are generated. The process continues through the translation of the lattice into an OWL-based ontology. Output of this phase is indeed, an OWL ontology codification. The ontology generated by the lattice mapping becomes the tool for the facet-based navigation: through a user interface a navigable version of the ontology allows the exploration of concepts and relative instances. Following sections are devoted to describe the phases depicted in Fig. 1. 4. Frequent term measures The parsing activity is aimed to extract content descriptors for the collection of web resources. This activity is required to map the real domain (resources from Web) into a fuzzy formal context, for the modeling through Formal Concept Analysis (additional details will be given in the following sections). The mathematical modeling of the fuzzy formal context needs the keywords which are representative of the textual content of the web resources collection. The goal is to exploit a vector-based representation of each web resource and then build the matrix which represents the fuzzy formal context. This matrix will show the relationships (in terms of degree values) between the extracted keywords and resources in the domain application. After a preprocessing phase, all the text has been normalized, POS-tagged and stemmed. We have exploited he well known lexical dictionary, WordNet.1 Specifically, the system achieves a normalization on the text expanding abbreviations, converting numbers, acronyms, dates, etc. into their spoken form, in order to uniform the text. Then, the system uses Tree Tagger algorithm (Schmid, 1994) to assign part-of-speech (i.e., noun, verb, adverb, etc.) to normalized input sentences. Non-informative words, such as articles, prepositions, conjunctions and stop words and all words not included in Wordnet are filtered. At the end, the system uses WordNet Lemmatizer2 code (WordNet database is used to lookup lemmas) to reduce inflectional forms and derivative related forms of words extracted to a common base form. Then, we have exploited the known technique TF-IDF (Ramos, 2003), to extract the set of most recurrent keywords appearing in the textual content of resources. The nouns and adjectives extracted in the text analysis represent the set of keywords which will contribute to built the matrix for the mathematical modeling of the formal context. Specifically, let W = {w1, w2, . . . , wm} be the set of keywords extracted by means of text processing activities. Then, in order to capture the correct meaning of the keywords, a further analysis has been taken into account. Let us consider the WordNet synonyms according to the sense associated to the words. The evaluation of the synonyms helps us to extract further terms related to the given word wi. In particular, the weight associated to each selected word in the content of web resources is computed constructing the set of the synonym dictionary, Dh = Dictsyn(wh), with wh e W. h Let us compute the term-frequency tf ij , as the measure of the importance of a term wi belonging to a synonym dictionary Dh = {w1, w2, . . . , wh, . . . , wl} in the content of the resource j: h tf i;j
¼
fi;jh
, l X
h fk;j
ð1Þ
k¼1
where fijh is the number of occurrences of some term wi e Dh in resource j. h Then, fixed the set Dh, let us select the maximum value tf ij , among all the terms wi 2 Dh (i = 1, . . . , l) in the resource j: h
h
tf ^i;j ¼ max ðtf i;j Þ
ð2Þ
i¼1;...;l
h
where ^i is the index of terms wi corresponding to maximum value tf ij . This value is exploited to compute the final value that characterizes the frequency associated to each synonym dictionary Dh:
0
t^f h;j ¼
h tf ^i;j
1
X hC B @1 þ tf i;j A l
ð3Þ
i¼1 i–^i
In particular, this sum represents the relevance of the dictionary Dh with respect to the given web resource j. Then, given all m dictionaries, let us select the maximum value t^f m;j , in the resource j:
t^f ^i;j ¼ max ðt^f m;j Þ i¼1;...;m
ð4Þ
Finally, according to the augmented normalized term-frequency (Salton, 1987) the final weight associated to the dictionary Dh of the word wh for the resource j, is: 1 2
http://wordnet.princeton.edu/. http://nltk.googlecode.com/svn/trunk/doc/api/nltk.stem.wordnet.WordNetLemmatizer-class.html.
Please cite this article in press as: De Maio, C., et al. Hierarchical web resources retrieval by exploiting Fuzzy Formal Concept Analysis. Information Processing and Management (2011), doi:10.1016/j.ipm.2011.04.003
5
C. De Maio et al. / Information Processing and Management xxx (2011) xxx–xxx
wtfh;j ¼ 0:5 þ ð0:5 t^f h;j =t^f ^i;j Þ
ð5Þ
At this point, for each web resource j, the associated vector is compound of all the weight wtfh,j, relative to each synonym dictionary Dh for all the words in the set W. 5. Fuzzy Formal Concept Analysis: theoretical background The formal model behind the construction of the ontology is Formal Concept Analysis (briefly FCA) (Ganter & Wille, 1999). FCA is a theoretical framework which supplies a basis for conceptual data analysis and knowledge processing. It enables the representation of the relationships between objects and attributes in a given domain. FCA provides an alternative graphical representation of tabular data that is somewhat natural to navigate and use (Ganter & Wille, 1999). Formal concepts can be interpreted from the concept lattice (Birkhoff, 1993) using FCA. The following subsections will give the basics of FCA as far as they are needed for this paper. A more extensive overview is given in Ganter and Wille (1999). 5.1. Formal Concept Analysis FCA considers the notion of a formal context specifying the attributes of each object; thus a formal context may be viewed as a binary relation between the object set and the attribute set with the values 0 and 1. FCA starts with a formal context defined as a triple K = (G, M, I), where G is a set of objects, M is a set of attributes, and I is a binary relation, s.t. I # G M. (g, m) e I is read ‘‘object g has attribute m’’. The use of ‘‘object’’ and ‘‘attribute’’ is indicative because in many applications it may be useful to choose object-like items as formal objects and then choose their features as formal attributes. For instance in the Information Retrieval domain, doc-
x x
URL_4
x
URL_5
x
x
URL_1
0.61
0.89
x
URL_2
0.94
x
URL_3
1.00
URL_4
0.70
URL_5
0.78
x x
x
x
URL_1 URL_2 URL_3 URL_4 URL_5
c1
series
mathematics
URL_2 URL_3
study
x
series
mathematics
x
science
study URL_1
science
Fuzzy Formal Context
Formal Context
0.43
0.56
0.71
1.00
0.2
0.76
0.97 1.00
0.64
1.00
(0.61) (0.94) (1.00)
c1 0.31
(0.70) (0.78)
0.52 c2
c2
c3 c4
URL_2 URL_3 URL_5
c3 URL_2 URL_4 URL_5
(0.94) (0.76) (0.78)
c5
0.60
URL_2 URL_5
(0.71) (0.78)
(0.71)
c4
(0.70) (0.78)
c5
URL_1 URL_5
(0.61) (0.64)
0.51
0.43
c6
c6 URL_5 (0.64)
Formal Concept Lattice
Fuzzy Formal Concept Lattice
(a)
(b) Fig. 2. FCA vs. FFCA.
Please cite this article in press as: De Maio, C., et al. Hierarchical web resources retrieval by exploiting Fuzzy Formal Concept Analysis. Information Processing and Management (2011), doi:10.1016/j.ipm.2011.04.003
6
C. De Maio et al. / Information Processing and Management xxx (2011) xxx–xxx
uments could be considered object-like and terms considered attribute-like. The context is often represented as a ‘‘cross table’’ (see Fig. 2a): the rows represent the formal objects and the columns are formal attributes; the relations between them are represented by the crosses. In our approach, keywords and resources play the role of attributes and objects respectively, in the matrix describing the formal context, as shown in Fig. 1. Definition 1. Formal Concept. Given a context (G, M, I), for A # G, applying a derivation operator, A0 = {m e M| "g e A: (g, m) e I} and for B # M, B0 = {g e G| "m e B: (g, m) e I}. A formal concept is identified with a pair (A, B), where A # G, B # M, such that A0 = B and B0 = A. A is called the extent and B is called the intent of the concept (A, B). The subconcept–superconcept relation is formalized as follows7 Definition 2. Given two concepts c1 = (A1, B1) and c2 = (A2, B2), then c1 is a subconcept of c2 (equivalently, c2 is a superconcept of c1), ðA1 ; B1 Þ 6 ðA2 ; B2 Þ () A1 # A2 ð () B2 # B1 Þ. The set of all concepts of a particular context, ordered in this way, forms a complete lattice. Fig. 2a shows a so-called line diagram of a concept lattice corresponding to the formal context in the top of the figure. A concept lattice is composed by the set of concepts of a formal context and the subconcept-superconcept relation between the concepts (Ganter & Wille, 1999). The nodes represent formal concepts. Formal objects are noted below and formal attributes above the nodes which they label (see Fig. 2). Let us note the names c1, c2, etc. represents the identifier of the concepts. We have introduced them for simplify their reference, but they are not part of the lattice representation. For instance, the node labeled with the formal attribute ‘‘series’’ and formal object ‘‘URL_3’’ shall be referred to as c2. To retrieve the extension of a formal concept one needs to trace all paths which lead down from the node to collect the formal objects. In the example of Fig. 2a, the formal objects of c2 are URL_3, URL_2 and URL_5. To retrieve the intension of a formal concept one needs to trace all paths which lead up in order to collect all the formal attributes. In the example, there is a node above c2 with ‘‘study’’ as formal attributes attached. Thus c2 represents the formal concept with the extension ‘‘URL_3, URL_2, URL_5’’ and the intension ‘‘study, series’’. Finally, c2 is a sub-concept of c1. The subconcept–superconcept relation is transitive: a concept is subconcept of all the concepts which can be reached by traveling upwards from it. If a formal concept has a formal attribute then its attributes are inherited by all its subconcepts. This corresponds to the notion of ‘‘inheritance’’. In fact, the lattice can also support multiple inheritance. 5.2. Fuzzy extension of FCA Recently, FCA has been exploited in many applications in which uncertain and vague information occur in the representation of the domain. Pioneer studies to exploit the fuzziness into FCA were a generalized Wille’s model (Burusco & Fuentes-Gonzalez, 1998) of FCA to fuzzy formal contexts or, mainly, the extension of the original formal concept analysis by setting truth degree for the propositions ‘‘object x has property y’’ in fuzzy formal contexts by employing a residuated lattice (Belohlavek, 1999, 2001, 2002). Degrees are taken from an appropriate scale L of truth degrees. Usually, L is valued with real values in [0, 1]. Thus the entries of a table describing objects and attributes become degrees from L instead of values from {0, 1} as is the case of the basic setting of FCA. This extension is known as Fuzzy Formal Concept Analysis (FFCA). Some definitions which incorporate fuzzy logic into Formal Concept Analysis are given (Tho et al., 2006). Definition 3. A Fuzzy Formal Context is a triple K = (G, M, I = u(G M)), where G is a set of objects, M is a set of attributes, and I is a fuzzy set on domain G M. Each pair (g, m) e I has a membership value lI(g, m) in [0, 1]. The set I = u(G M) = {((g, m), lI(g, m))| "g e G, m e M, lI: G M ? [0, 1]} is a fuzzy relation G M. In literature, a similar definition considers a multi-valued context (Ganter & Wille, 1999) which emphasizes the weight (rather than a membership value) associated to each pair (object, attribute). Definition 4. Fuzzy Representation of Object. Each object g in a fuzzy formal context K can be represented by a fuzzy set U(g) as U(g) = {(m1, lI(m1)), (m2, lI(m2)), . . . , (mm, lI(mm))} where {m1, m2, . . . , mm} is the set of attributes in K and lI(mi) is the membership associated to attribute mi. U(g) is called the fuzzy representation of g. Fig. 2b shows a fuzzy version of the formal context by means of a cross-table. According to the fuzzy theory, the definition of Fuzzy Formal Concept is given as follows. Definition 5. Fuzzy Formal Concept. Given a fuzzy formal context K = (G, M, I) and a confidence threshold T, we define A⁄ = {m e M| "g e A: lI(g, m) P T} for A # G and B⁄ = {g e G| "m e B: lI(g, m) P T} for B # M. A fuzzy formal concept (or fuzzy concept) Af, of a fuzzy formal context K with a confidence threshold T, is a pair (u(A), B), where A # G, u(A) = {g, lu(A)(g)| "g e A}, B # M, A⁄ = B and B⁄ = A. Each object g has a membership lu(A)(g) defined as
luðAÞ ðgÞ ¼ min lI ðg; mÞ m2B
Please cite this article in press as: De Maio, C., et al. Hierarchical web resources retrieval by exploiting Fuzzy Formal Concept Analysis. Information Processing and Management (2011), doi:10.1016/j.ipm.2011.04.003
C. De Maio et al. / Information Processing and Management xxx (2011) xxx–xxx
7
where lI(g, m) is the membership value between object g and attribute m, which is defined in I; Note that if B = {} then lg = 1 for every g. A and B are the extent and intent of the formal concept (u(A), B) respectively. In Fig. 2b the fuzzy formal context has a confidence threshold T = 0.6 (as said, all the relations between objects and attributes with membership values less than 0.6 are not shown). Definition 6. Let (u(A1), B1) and (u(A2), B2) be two fuzzy concepts of a fuzzy formal context (G, M, I). (u(A1), B1) is the subconcept of (u(A2), B2), denoted as (u(A1), B1) 6 (u(A2), B2), if and only if u(A1) # u(A2) ð () B2 # B1 Þ. Equivalently, (u(A2), B2) is the super-concept of (u(A1), B1). For instance, let us observe in Fig. 2b, the concept c5 is a sub-concept of the concepts c2 and c3. Equivalently the concepts c2 and c3 are super-concepts of the concept c5. Definition 7. A Fuzzy Concept Lattice of a fuzzy formal context K with a confidence threshold T is a set F(K) of all fuzzy concepts of K with the partial order 6 with the confidence threshold T. The FCA theory proposes a hierarchical model, where the concepts (objects and their attributes) are arranged in a subsumption relations (known as ‘‘hyponym–hypernym’’ or ‘‘is–a’’ relationship too). The fuzzy formal lattice evidences the membership associated to the objects and the class-subclass relationship. More formally: Definition 8. The Fuzzy Formal Concept Similarity between concept K1 = (u(A1), B1) and its subconcept K2 = (u(A2), B2) is defined as
EðK 1 ; K 2 Þ ¼
juðA1 Þ \ uðA2 Þj juðA1 Þ [ uðA2 Þj
where \ and U refer intersection and union operators3; on fuzzy sets, respectively. As an example, the Fuzzy Formal Concept Similarity computed between the concept c2 = {(URL_2, URL_3, URL_5), (series, study)}, and the concept c5 = {(URL_2, URL_5), (science, study, series)}, shown in Fig. 2b, is given as follows:
Eðc2 ; c5 Þ ¼
jðminf0:71; 0:94gÞ þ ðminf0:78; 0:78gÞj ¼ 0:60 jðmaxf0:71; 0:94gÞ þ ðmaxf0:78; 0:78gÞ þ ðmaxf0:76gÞj
Fig. 2 presents a compared view of FCA (Fig. 2a) and fuzzy FCA (Fig. 2b), evidencing the differences in the modeling of two methods, considering the same sample of objects (i.e., web resources). In the classical FCA, the matrix representing the formal context contains binary values indicating the existence or not of the relation between objects and attributes. In the corresponding ‘‘fuzzy’’ table, a cell contains a value in the range [0, 1] says if there is a relation and in addition, gives a valuation about the strength of such relation. The concept lattice represents a mathematical modeling of knowledge, which is more informative than traditional treelike conceptual structures (Zhou, Hui, & Chang, 2005). Compared to formal lattice, the fuzzy lattice introduces further information about the knowledge structuring and relationship, such as the fuzziness enclosed in each object/resource and the similarities between fuzzy formal concepts.
6. Ontology building Our approach aims at building an OWL ontology including information from the fuzzy lattice. A sequence of mapping steps transforms the fuzzy formal context using the fuzzy lattice, into an ontology. The building of an ontology requires the mapping of both intent and extent of the concepts of the lattice. In particular, the mapping of a concept of the fuzzy lattice to ontology class exploits the ‘‘owl:Class’’ construct; instances of these class, i.e. individuals, represent the extent of concepts of the fuzzy lattice; the mapping of the intent (i.e. attributes) to ontology properties exploits ‘‘owl:DatatypeProperty’’ and ‘‘owl:ObjectProperty’’. Moreover, each attribute in a concept of the lattice is mapped as an ontological class that represents the dictionary of its synonyms. The mapping process enables us to extract the knowledge embedded in the resources collection, producing the ontology conceptualization and population. In order to focus on the steps required to map the lattice to an ontology, Fig. 3 shows on
3 The fuzzy intersection and union are calculated using t-norm and t-conorm, respectively. The most commonly adopted t-norm is the minimum, while the most common t-conorm is the maximum. That is, given two fuzzy sets A and B with membership functions lA(x) and lB(x) lA\B(x) = min(lA(x), lB(x)) and lAUB(x) = max(lA(x), lB(x)).
Please cite this article in press as: De Maio, C., et al. Hierarchical web resources retrieval by exploiting Fuzzy Formal Concept Analysis. Information Processing and Management (2011), doi:10.1016/j.ipm.2011.04.003
8
C. De Maio et al. / Information Processing and Management xxx (2011) xxx–xxx
Fig. 3. Building of ontology from fuzzy FCA.
the left, a portion of fuzzy formal context and the associated concepts in the lattice; while on the right, the corresponding generated OWL ontology. The mapping concerns the formal concepts and the super/sub-concept relations, evinced into the lattice, as described formally in the following definitions. Please cite this article in press as: De Maio, C., et al. Hierarchical web resources retrieval by exploiting Fuzzy Formal Concept Analysis. Information Processing and Management (2011), doi:10.1016/j.ipm.2011.04.003
C. De Maio et al. / Information Processing and Management xxx (2011) xxx–xxx
9
Definition 9. Let K = (u(A), B) be a fuzzy concept of the fuzzy formal context (G, M, I). The function of mapping M on K produces a class C, where the objects of the extension A # G becomes individuals or instances of the concept C while the attributes of the intent B # M describe the properties relative to the class concept C. Thus, if C1 and C2 be the classes obtained by the mapping of two fuzzy formal concepts K1 = (u(A1), B1) and K2 = (u(A2), B2) of the fuzzy formal context (G, M, I), respectively. Let K1 be the subconcept of K2 (i.e., K1 6 K2). The function of mapping M on the subsumption relation K1 6 K2 produces a correspondent subclass relation, i.e., C1 is subclass of C2. Equivalently, as K2 is the Superconcept of K1, then C2 is a superclass of C1. The building of the ontology is based on the object-attribute relationships of the formal context. In our approach, the objects of the fuzzy formal context are web resources and attributes are keywords extracted by parsing the content of these resources. According to Definition 9, the translation of a fuzzy concept is described by the following mapping steps, sketched in Fig. 3. The mapping of a fuzzy concept in Definition 9 requires the characterization of the class and its properties, and then the specification of the related population. Specifically, four steps have been defined as follows: – Dictionary mapping: Each attribute (i.e., keywords/words/terms) from a formal fuzzy context is mapped onto a Synonym Dictionary that represents the set of its synonyms. In other words, for each attribute (i.e., study) we have defined an ontological class that represents the dictionary of synonyms (i.e., StudyDictionary). More precisely, we define an abstract class called Dictionary (that is, the root concept of synonyms dictionaries). Its specialization represents an effective class composed of a set of synonyms associated to a specific term. For instance, Fig. 3 shows the OWL code of the class StudyDictionary associated to the attribute study and composed of two other terms ‘‘learning’’ and ‘‘work’’, besides the term itself ‘‘study’’. Actually, there are no relations among the built dictionaries. – Property/relation mapping: Two kinds of OWL properties have been defined from mapping the attributes in the concepts of the fuzzy lattice: hasAttribute is an owl:ObjectProperty that enables us to associate each attribute (i.e., word) of the intent of a concept of the fuzzy lattice to a corresponding object (i.e., resource). The domain of OWL ObjectProperty is the top concept (i.e., Concept_0) and the range is the class Dictionary, as shown in Fig. 3. hasMembership is an owl:DatatypeProperty that represents the membership values associated to object, according to Definition 5. The domain of hasMembership is the top concept (i.e., Concept_0) while the range is a float datatype. This property allows us to associate a membership value to an actual individual of a class. – Class mapping: As said, each formal concept of the lattice becomes an ontology class. The construct owl:Class describes it. Our approach automatically produces a class name, identified by a specific progressive number. In Fig. 3, the OWL code describes Concept_2 (evidenced in the lattice too) which has some value from the classes SeriesDictionary and StudyDictionary, for the property hasAttribute. This mapping allows DL reasoner to infer which attributes are associated to a concept and vice versa. In general, this association enables DL reasoner to classify new objects (resources) according to their content (words). Once the classes are defined, an automatic labeling will be applied on the concepts, as detailed in the following sections. – Individuals generation: For each web resources in the extent of a concept, an instance of the corresponding ontology class is generated. The individuals of the class Concept_2 of Fig. 3 are URL_2, URL_3 and URL_5; furthermore they have associated two attributes (as evidenced in the lattice in Fig. 3): the inherited attribute study and its own attribute series. Particularly, the OWL code describing the individual URL_2 is shown in Fig. 3.
7. Applications of the ontology The outcome of our approach achieves a fair arrangement of web resources through the extraction of the ontology and supports hierarchical exploring and query processing. Particularly, the extracted ontology, as a model to get a formal and reusable knowledge representation, includes several applications and potential uses. The most important ones are detailed as follows, 7.1. Classification A mechanism for automatically generating web resources classification according to the conceptualization from the ontology. This aspect has been accomplished by performing a training workflow that carries out an ontology based classification of the input collection of resources as described in the mapping steps for the ontology building (see Section 6). The ontology built according to the mapping steps, enables a DL reasoner to classify new resource with respect to the concepts available in the extracted ontology. As an example, let us take into account the definition of ‘‘owl:Class’’ given in the mapping steps for the ontology building (see Section 6).Then, once an incoming resource is processed and vector-based translated (see Section 4), a generic DL-reasoner is able to infer that the resource is an instance of specific ‘‘owl:Class’’, according to its characterization, i.e., the content (words) of the resource. Specifically, let us consider the following ‘‘owl:Class’’ definition extracted by our framework:
Please cite this article in press as: De Maio, C., et al. Hierarchical web resources retrieval by exploiting Fuzzy Formal Concept Analysis. Information Processing and Management (2011), doi:10.1016/j.ipm.2011.04.003
10
C. De Maio et al. / Information Processing and Management xxx (2011) xxx–xxx
Now, let us suppose that the incoming resource is processed and its content is reduced to a vector based representation, according to the frequent term measures, described in Section 4. Thus, the resource can be represented by the following illustrative OWL code:
. . .
This code outlines the resource has two attributes Series and Study. These attributes are part of two available dictionaries StudyDictionary and SeriesDictonary respectively, as evidenced in Section 6. Then, by applying DL-reasoning and according to the owl:Class definition, the new resource is classified as an instance of Concept_2. In this way, thanks to the characterization of the ontology classes, new resources can be classified, although they are not in the input training data set.
7.2. Query system A tool for querying the ontology-based structure in a hierarchical or free-form manner. In this version of the framework, the ontology is exploited to support the resources classification and mainly the hierarchical facet based browsing (as better detailed in the following sections). Anyway, according to the building of the ontology, presented in Section 6, it is possible to support DL queries through plain DL-based syntax or in a convenient interface. In particular, users could create any complex query involving search terms – including union, intersection, or more complex restrictions – to retrieve matching documents. The system could define and then formalize an owl: Class, through the query specified by the user and, then, by applying DL reasoning it could retrieve the right set of documents/resources. Just to give an example, let us suppose that a user would retrieve resources on mathematics or just related studies that deal with differential equations. The query could contain the following terms with the relative constraints: – Study or Math – Equation – Differential
Please cite this article in press as: De Maio, C., et al. Hierarchical web resources retrieval by exploiting Fuzzy Formal Concept Analysis. Information Processing and Management (2011), doi:10.1016/j.ipm.2011.04.003
C. De Maio et al. / Information Processing and Management xxx (2011) xxx–xxx
11
Then, the system should be able to define an owl:Class as follows:
In this way, the ‘‘virtual’’ class is exploited by the DL-reasoner in order to infer right correlation among resources and to retrieve those documents, with respect to the given class definition. 7.3. A facet-based navigation A‘‘multi-facet’’ web interface, ad hoc tailored on the extracted ontology, provides a conceptual representation and visualization of the knowledge extracted by a web collection and supplies an instrument to browse the ontology concepts and navigate the relationships. The remainder of this paper is mainly devoted to develop this aspect of the ontology application. 8. Hierarchical facet-based ontology navigation The facet-based navigation (Yee, Swearingen, Li, & Hearst, 2003) is an efficient techniques for accessing a collection of information (or resource content) through a navigation based on filtering. The goal of this data exploration technique is to restrict the search space to a set of relevant resources. Unlike a simple traditional hierarchical category scheme, users have the ability to drill down to concepts based on multiple dimensions. These dimensions are called facets and represent important characteristics of the information units. Each facet has multiple restriction values and the user selects a restriction value to constrain relevant items in the information space. A known project, called SWED4 exploits the facet navigation; it uses a number of thesauri, ontologies and lists to categorize, publish and arrange data in its directory web site. The facet based navigation can improve Semantic Web approaches which make Web-based data more accessible via the use of Community Portals – i.e. Web-portals that provide customized ‘views’ of information (Paul, 2004). Our approach provides a prototype directory web site, which, thanks to the lattice structure, collects data in categories. It is navigable by users through customized hierarchical facet-based views of the collected resources. Fig. 4 provides a sketched overview of the whole process of mapping starting from the lattice representation to the ontology tree. The lattice is mapped into an ontology schema and relative population, through OWL language, as described above. Then, the OWL ontology, represented by a navigation tree, will be the browsable structure in our interface. Similar to Protegè
4
www.swed.org.uk/.
Please cite this article in press as: De Maio, C., et al. Hierarchical web resources retrieval by exploiting Fuzzy Formal Concept Analysis. Information Processing and Management (2011), doi:10.1016/j.ipm.2011.04.003
12
C. De Maio et al. / Information Processing and Management xxx (2011) xxx–xxx
Fig. 4. Ontology tree generated by OWL-based lattice representation.
editor (Gennari et al., 2002), our tree structure enables the multiple inheritance, by duplicating the concepts whenever they appear in a path between concepts. This generates different views of subsumption paths in the tree. In Fig. 4, the mapping process is evidenced for the class in the OWL ontology, named Concept_2. Note that Concept_2 is from the mapping of the formal concept c2 in the lattice; thus it is translated into a concept of the ontology tree. The own individuals of Concept_2 appear, by clicking for expanding the relative node in the tree (see Fig. 4). Let us note that only URL_3 is a proper object of the Concept_2. Then, URL_2 and URL_5 appear in the Concept_5 which is a specialization of the Concept_2. Moreover, each node/concept of the ontology tree shows, in squared parenthesis, the number of proper individuals, i.e., the instances associated to that concept. Let us remark that we use the term ‘‘ontology tree’’ to refer the ontology generated by the mapping process even though it is a hierarchy. In order to facilitate the exploration and the navigation of an ontology as well as its own individuals (i.e. resources), a user-friendly interface has been designed. Fig. 5 shows the interface that contains the ontology tree, associated to the concepts coming from the described mapping. It provides a conceptual representation and visualization of the knowledge extracted by the web collection and supplies an instrument to browse the ontology concepts and navigate the relationships. The hierarchy of concept classes and the relative individuals indeed, can be explored by the hierarchical facet-based navigation. The nature of facet-based structuring provides a flexible navigation of the dataset and the exploratory interface allows users to find information without a priori knowledge of its schema.
Fig. 5. Ontology representation through facet-based web interface.
Please cite this article in press as: De Maio, C., et al. Hierarchical web resources retrieval by exploiting Fuzzy Formal Concept Analysis. Information Processing and Management (2011), doi:10.1016/j.ipm.2011.04.003
C. De Maio et al. / Information Processing and Management xxx (2011) xxx–xxx
13
The ontology tree generated by FCA theory, provides a natural and automatic classification of data, maintaining completely separated the rough data and semantic information. This ‘self publishing’ makes more flexible the navigation and the interpretability of data. The following sections will present a simple method to label the generated ontology; then the user interface will be depicted, through the exploration technique. 8.1. Ontology labeling As said, the ontology generation process does not produce meaningful labels for the concepts, but only an identification of each concept (through a univocal progressive number) to which the resources/attributes are related. Usually, human intervention is required to assign a proper naming to the generated ontology concepts, according to the meaning of the resources content. We propose an automatic labeling of the ontology. Each class is labeled by considering the meaning of the associated attributes (i.e., keywords/words/terms). Specifically, each class takes the name of the most representative attribute belonging to it. In order to compute the most representative attribute, we exploit the matrix of the fuzzy formal context to get the value of membership in the crux between a resource and an attribute. We can exploit the fuzzy formal context, because there is a direct correspondence between the attributes and individuals of the ontology and the attributes and objects of the formal context, respectively. Thus, we use the name of the attribute with the highest membership value to label the relative class. If the class has more than one attribute with the same membership, the label is composed of a concatenation of all the names of these attributes. If the class have not own attributes, a possible solution is to ask for help to domain expert or ontology designer. Otherwise, an automatic solution is to generate a name as a concatenation of the labels of the proper ancestors. The following pseudo-code gives an idea about how to select the most representative name attribute as a candidate label of a class. Let an ontology class C be composed of a couple (A, I) where A is a set of attributes A and I is a set of individuals. for each class C = (A, I) do for each attribute a e A do la minieIl(i, a) end for max compute maximum among all la label(C) attribute name whose la = max end for
In other words, according to Definition 4, we compute la (dually to the definition of lg) and then for all the attributes, we assign, as a label, the attribute name, whose membership is the maximum among all la. 8.2. Facet-based retrieval Due to the lattice-generated ontology, the built ontology tree reveals intrinsic relationships among attributes or objects, through the exploration of concepts and relative population. This allows a flexible navigation among its concepts and provides many different criteria to explore and retrieve its data. More emphasis to the facet-based navigation is given by a visual query paradigm (Gibbins et al., 2004): the user builds gradually a query by exploring the concepts/subconcepts in the ontology tree and, selecting a concept, he adds constraints: each action executed on the interface is a step for the query construction. Moreover, the user sees intermediate results of query while he browses the tree. Fig. 5 shows the graphical interface of our system. It is composed of two main parts: on the left, the labeled ontology tree for the exploration and, on the right, a web page which provides details about the concept selected in the ontology tree. Once selected a concept on the tree, additional details about it appear on the right hand of interface. In fact, as shown in Fig. 5, by clicking on the concept named ‘‘mathematics’’, on the tree, on right hand of the interface this concept appears with its descendants: all the subconcepts and individuals. The navigation can proceed forward by considering the subconcepts of ‘‘mathematics’’ (frame in the foreground, at the bottom of Fig. 5, ‘‘fourier, series’’, ‘‘math, modeling’’ and ‘‘differential, equation’’) and the relative individuals (in the lower frame). Note that the application domain is the Web thus, the individuals, i.e., instances are the URLs of web pages that represent the concept. Thus, in Fig. 5 the individuals of the concept labeled ‘‘fourier, series’’ i.e., some effective web pages are emphasized. The frame in the foreground, at the bottom of Fig. 5 represents a facet of ‘‘mathematics’’ in the ontology tree. This facet can be interpreted as an implicit query on the system which returns all pages related to mathematics; in particular, the system not only returns the all the pages, but classifies them according some elicited categories, in this case, three: ‘‘fourier, series’’, ‘‘math, modeling’’ and ‘‘differential, equation’’. In summary, the ontology tree provides a general view of the global structure, i.e., objects and relative individuals; details about an object appear on the right hand. In particular, clicked on a
Please cite this article in press as: De Maio, C., et al. Hierarchical web resources retrieval by exploiting Fuzzy Formal Concept Analysis. Information Processing and Management (2011), doi:10.1016/j.ipm.2011.04.003
14
C. De Maio et al. / Information Processing and Management xxx (2011) xxx–xxx
Fig. 6. Inheritance richness tendency by varying the threshold value.
concept, the first sub level of structure is shown in the frame in the foreground, at the bottom of Fig. 5, the lower frame instead, provides the list of all the involved objects, descendant of the object. The synergy between semantic technologies and fuzzy data analysis allows an automatic characterization of the subject domain and its categorization with respect to ontology concepts. Moreover, the fuzzy FCA based construction of our ontology tree enables the generation of a family of ontologies that may have a deep and specialized structure or just an high level abstraction conceptualization. In fact, by varying the threshold T, related to fuzzy context (see Definition 3), some relations between objects and attributes are discarded, thus a different ontological structure is generated (additional details are given in the following section). This scalability is one of the main required properties in facet-based navigation, especially when the dataset has a considerable size. Finally, the ontology tree generated by our approach can be exploited as a controlled vocabulary. Generally, cataloging or indexing systems (e.g. libraries and product catalogues, etc.) use controlled vocabularies, i.e. they index all items using a constrained set of terms (e.g. in the Dewy decimal system for classification of books a book about zoology will be classified under the term ‘‘Zoological Sciences’’). The use of controlled vocabularies simplifies the consistency maintenance of the classification and makes it easier to find relevant items. 9. System validation and experimental results The approach has been tested on a collection of web resources extracted from OpenLearn Project,5 a public, online accessible repository of learning materials. OpenLearn provides course materials from the Open University manually arranged in categories, according to the main educational subjects and courses. Our goal is to automatically elicit a categorization of collected materials, and then to evaluate the retrieval performance by comparison with OpenLearn categories. The sample is composed of 488 web resources. Let us outline the analysis of the computed ontology which can be achieved at two different abstraction levels: on one hand, by analyzing the categorization coming from the ontology structuring, generated by the lattice; on the other hand, by comparing the meaning of concepts (by the analysis of the enclosed resources and terms) and the naming to the OpenLearn categories. 9.1. Analysis of the ontology structure This framework achieves a fair arrangement of web resources through the extraction of the ontology tree. and supports hierarchical exploring and query processing. Specifically, two are the main contributions: A mechanism for automatically generating web resource classification according to the conceptualization form the ontology. A tool for querying the tree in a hierarchical or free-form manner. In literature, many methods extract ontologies but do not annotate the content during the process of ontology extraction. Our approach instead, produces the annotation of the resources. By exploiting the Formal Concept Analysis theory, objects are grouped by in the concepts of the fuzzy lattice. Each concept is described (i.e., annotated) by the attributes associated to the resources in that concept.
5
http://openlearn.open.ac.uk/.
Please cite this article in press as: De Maio, C., et al. Hierarchical web resources retrieval by exploiting Fuzzy Formal Concept Analysis. Information Processing and Management (2011), doi:10.1016/j.ipm.2011.04.003
C. De Maio et al. / Information Processing and Management xxx (2011) xxx–xxx
15
Fig. 7. Fuzzy contexts with relative generated lattices (by varying a threshold T) and the computed inheritance richness (ir) values.
Because the ontology built by fuzzy FCA is a taxonomic structure, it is difficult to compare our framework with other tools for ontology extraction that often can extract any kind of relations among concepts. Comparisons with other approaches would be based on the analysis of the generated ontology structure in term of resources distribution, class hierarchy depth and wideness. We exploit the measure of inheritance richness (briefly ir) Tartir, Arpinar, Moore, Sheth, & Aleman-Meza, 2005 for the ontology structure analysis. The ir is a measure which describes the distribution of information across different levels of the inheritance tree or, in other words, the fan-out of parent classes. More formally: Definition 11. The inheritance richness (ir) of a ontology schema is defined as the average number of subclasses per class:
P ir ¼
jHC ðC s ; C i Þj
C i 2C
jCj
C
where H (Cs, Ci) is the set of classes CS that are subclasses of Ci;. HC is a hierarchy of classes (HC is a directed, transitive relation HC # C C which is also called class taxonomy) and C is the set of all the classes. Fig. 6 shows the ir measure, according to the nature of the ontology structure. Specifically, Fig. 6a shows the inheritance richness of our ontology extraction system using a resources sample from OpenLearn, by considering different threshold T values. Recall the fuzzy extension of the lattice is strictly related to a confidence threshold T (see Section 5). Please cite this article in press as: De Maio, C., et al. Hierarchical web resources retrieval by exploiting Fuzzy Formal Concept Analysis. Information Processing and Management (2011), doi:10.1016/j.ipm.2011.04.003
16
C. De Maio et al. / Information Processing and Management xxx (2011) xxx–xxx
Fig. 8. Example of concepts in our ontology and the corresponding OpenLearn categories.
As said, fuzzy FCA allows the building of a family of fuzzy formal lattices by varying the threshold T. The lattice structure can change varying T, because some relations can be revealed or pruned, according to the values associated to the entries of the fuzzy formal context (all the matrix entries whose value is lesser than T are not considered, see Definition 5). Compared with other tools, our model allows a flexible construction of an ontology which, by varying the threshold, can provide a high-level conceptualization of the modeling domain or a deep specialization making explicit all the relations. For instance, fixing the threshold T = 0.5 (Fig. 6a) means to consider the ontology (lattice) generated by a fuzzy formal context whose values associated to the relative matrix are greater or equal 0.5. Fig. 6a emphasizes that the ontology presents quite high values of ir when the threshold is smaller than 0.5. A high value of ir characterizes an ‘‘horizontal’’ ontology (Tartir et al., 2005), i.e. an ontology with a small number of inheritance levels, where each class has a relatively large number of subclasses. This means a general high level representation of content. On the other hand, the ontology with a low ir is considered ‘‘vertical’’, i.e. it is composed by many inheritance levels and the classes have a small number of subclasses. A more accurate specialization is provided in some path from the concept root to the concept leaf. Comparisons with other tools in term of the inheritance richness measure, are shown in Fig. 6b. OntoGen,6 OntoLT7 and TermMine8 are the candidate tools. Let us evidence that, for these systems, the evaluation considers a fixed ontology, because no input thresholds exist. In general it is not possible to know a priori which value of ir is the best one. Fig. 6b shows that, with ir 6 0.5 and the threshold T P 0.5, our approach generates a vertical ontology comparable to value computed for TermMine, while similar value appear for OntoLT and our approach when T = 0.8. Fig. 7 shows some sketched views of the lattice by considering two values of threshold T, 0.2 and 0.8 respectively (the relative fuzzy formal contexts computed by discarding the values below the given thresholds are shown too). Fig. 7 evidences two different ontology structures: an horizontal one (with T = 0.2) with a more general conceptualization and a vertical, more specialized one (with T = 0.8). where resources left appear distributed in the concepts. 9.2. Analysis of the lattice consistency and retrieval performance A further analysis of the ontology tree focuses on resources classification and aims at estimating its consistency in term of concepts generated by the mapping. We compare our ontology with the OpenLearn categories. In this study, we have considered the ontology generated from lattice by setting the threshold T = 0.5. Fig. 8 shows the ontology as a lattice-like structure rather than the navigation tree: this visualization allows us to facilitate the comparison with OpenLearn providing clear representation of the resources classification. From the analysis of the content of each ontology class, we have verified that 86% of whole collection of individuals (i.e., resources) finds a coherent classification in OpenLearn; i.e. the same groups of resources are collected together to form corresponding OpenLearn category. From this percentage, the 33% of resource appears in more specific classes. That means the ontology generated by lattice allows a natural refinement of classification of resources. Fig. 8 sketches an example which shows some OpenLearn categories (at high level) and the relative specialization obtained by our approach. For instance, the OpenLearn category
6 7 8
http://ontogen.ijs.si/. http://olp.dfki.de/OntoLT/OntoLT.htm. http://protegewiki.stanford.edu/wiki/TerMine_Plugin.
Please cite this article in press as: De Maio, C., et al. Hierarchical web resources retrieval by exploiting Fuzzy Formal Concept Analysis. Information Processing and Management (2011), doi:10.1016/j.ipm.2011.04.003
C. De Maio et al. / Information Processing and Management xxx (2011) xxx–xxx
17
Mathematics and Statistics appears more specialized in our approach. Furthermore, by a comparison with OpenLearn categories, the remaining 14% of resources appears badly classified: the analysis of them reveals some semantic ambiguity in the content, because topics of different categories are gathered in a same category. Additional studies in Natural Language Processing domain could help us to improve the textual parsing of these resources and reduce this error percentage. At this point, let us evaluate the retrieval effectiveness on our generated ontology. First step has been to analyze the relevance of the resources enclosed in each class of ontology. We re-adapt the standard measures of the precision, recall and F-measure (Van Rijsbergen, 1979) to our applicative context. Our goal indeed is to get a synthetic measure which represents how each resource (i.e., individual belonging to a class) is semantically relevant with respect to a user query. To do this, we use the F-measure which is based on a harmonic mean function on the values of precision and recall. This way, a ranked list of resources is the result to a given query. Let us note that the facet-based navigation described in Section 8 runs queries on the system too, although they are implicit because they are related to the exploration of the classes. In this case, the user explicitly submits a query to the system. More formally, let us define: D ¼ fd1 ; d2 ; . . . ; dn g the whole set of dictionaries defined for each attribute (terms and keywords) A ¼ fa1 ; a2 ; . . . ; an g of a given fuzzy formal context. I = {i1, i2, . . . , it} the whole set of individuals (resources) of a given OWL ontology. Ci = (Di, Ii) ith ontology class (where Ii I represents the individuals and Di D the synonym dictionaries associated to attributes of Ci). Qi = {q1, q2, . . . , qm} a query, i.e. a set of attributes (i.e., terms) to search in the lattice. Thus, given an ontological class Ci we define a local precision Pi and recall Ri
Pi ¼
jQ i \ Aj j jAj j
Ri ¼
jQ i \ Aj j jQ i j
ð6Þ
where Aj is a set of all the attributes in the dictionaries associated to the attributes of Ci. Pi can be seen as a measure of exactness or fidelity of ontological class Ci. It is computed as ratio between the number of relevant attributes of dictionaries in Ci and all the attributes of associated to Ci. Ri represents a measure of completeness of ontological class Ci and it is given as the number of relevant attributes of dictionaries in Ci divided by the total number of query terms. The F-measure value Fi relative to the class Ci is computed as follows:
Fi ¼ 2
P i Ri P i þ Ri
ð7Þ
Final result of this computation is the list of the F-measure values F = {F1, . . . , FS} on all the ontological classes in the given ontology, given a query. Now, for each individual of ontology, ik e I we calculated its own score as follows:
scoreðik Þ ¼
s X ðlj ðik Þ F j Þ
ð8Þ
j¼1
Table 1 Summary of training datasets and queries. Domain Art and history Science and nature Mathematics and statistics Education Law IT and computing Technology Study skills Society Business and management Modern languages Health and lifestyle Tot.
Number of resources
Number of queries
65 40 37 42 40 26 9 16 85 65 30 33
3 3 2 2 2 1 1 1 2 2 1 1
488
21
Please cite this article in press as: De Maio, C., et al. Hierarchical web resources retrieval by exploiting Fuzzy Formal Concept Analysis. Information Processing and Management (2011), doi:10.1016/j.ipm.2011.04.003
18
C. De Maio et al. / Information Processing and Management xxx (2011) xxx–xxx
Fig. 9. Micro-averaging precision/recall by varying the ir values.
where lj(ik) represents the membership of resource ik mapped in the ontology (hasMembership, see Section 6). This synthetic value represents how this resource (i.e., individual) is relevant with respect to the given query. Next step is the evaluation of the retrieval performance on the whole ontology hierarchy. Given the owl ontology O, we select all the individuals in the ontology whose score is greater than the given confidence threshold a: let Ia ðOÞ be this set. Then, these resources belonging to Ia ðOÞ are ranked with respect to their scores. The retrieval performance of the approach is assessed in terms of precision and recall measures, considering the analysis b ¼ fQ ; Q ; . . . ; Q g be a set of through micro-average of the individual precision-recall curves (Van Rijsbergen, 1979). Let Q 1 2 n queries, D all the relevant resources in Ia ðOÞ for the given set of queries Q. For each query Qi, we consider k = 20 steps up to its maximum recall value and measure the number of relevant documents retrieved at each step k. Table 1 summarizes all the categories taken into account and the number of the relative resources and queries, exploited in the evaluation of the micro-average of the precision and the recall. In particular, twenty-one queries are analyzed in this study. According to (Van Rijsbergen, 1979) the micro-averaging of recall and precision (at the generic step k), is defined as follows:
Reck ¼
X jRQ \ Bk;Q j i i jRj Q i
Preck ¼
X jRQ \ Bk;Q j i i jBk j Q
ð9Þ
i
where RQ i is the set of relevant resources for a given query Qi, Bk the set of retrieved resources at the step k and Bk;Q i is the set of all relevant resources, retrieved at the step k, for the query Qi. Fig. 9 shows the tendency of the micro-average of recall/precision curve evaluated on the collection set, comparing our approach with a known keyword based search engine called Lucene.9 Precisely, three different ontology hierarchy have been built for different values of inheritance richness (ir). Let us note that the curve ir = 0.3 has associated an ontology with a vertical structure; in particular, with values of recall greater than 0.6, the precision is higher than the curve ir = 0.7 because the resources are distributed among ontology classes through more relations of specialization. On the contrary, in the case of ir = 0.7 the resources are arranged together, because each class of the ontology is richer of individuals (i.e., resources) and poor in the specialization. So, with different ontology hierarchy associated to different ir values, we have observed that the systems performance can change in term of data retrieval. The performance of web resources retrieval depends on the vertical or horizontal nature of the generated ontology by varying ir levels of the generated knowledge structure. In particular, with the curve ir = 0.5, the precision/recall tendency is better than other cases (ir = 0.3, ir = 0.7 and Lucene Search). Furthermore, the curve ir = 0.5 highlights that the precision is quite constant, it does not decrease meaningfully when the recall grows. The prevalence of this curve emphasizes that the best performances are obtained considering a median ir value, i.e. exploiting a right trade-off between the horizontal and vertical structures. 10. Conclusions This paper presents an approach that, starting from the fuzzy extension of Formal Concept Analysis, automatically extracts an ontology from the content analysis of a collection of web resources. The generated ontology is exploited as a retrieval model whose data can be clustered, organized, and visualized in different ways to support the user navigation. A friendly interface allows the browsing of the ontology concepts as well as the exploration of the relationships and the population. This framework represents a worth contribution in terms of information processing and management. Particularly, it accomplishes the knowledge extraction and structuring as well as the ontology-driven discovery; then it supports hierarchical exploring and query processing. The implementation outcomes focus on the hierarchical facet-based navigation, even
9
http://lucene.apache.org.
Please cite this article in press as: De Maio, C., et al. Hierarchical web resources retrieval by exploiting Fuzzy Formal Concept Analysis. Information Processing and Management (2011), doi:10.1016/j.ipm.2011.04.003
C. De Maio et al. / Information Processing and Management xxx (2011) xxx–xxx
19
though there are many other applications that not extensively developed in this paper (see Section 7). In brief, the main benefits are summarized as follows: – – – – –
concepts elicitation to support semi-automatically ontology design; automatic conceptualization of the knowledge embedded in web resources; automatic generation of web resource classification according to the conceptualization from the ontology; classification of the analyzed resources maintaining completely separated the rough data and semantic information. the fuzzy FCA model, which introduces more flexibility in the relation between object and attribute, yields families of lattices, according to the threshold fixed in the context description and the weight on the subsumption relation. – support to query the tree in a hierarchical or free-form manner. In our previous work (De Maio, Fenza, Loia, Orciuoli, & Senatore, 2010), the exploitation of fuzzy FCA has contributed to design a system for augmenting personalized e-learning experiences by supporting learners in their learning activities. Furthermore, the overall framework will be applied in the next future to support recommender system in activity of collaborative knowledge discovery in order to support capabilities of human resources and enterprise competency management systems.
References Belohlavek, R. (1999). Fuzzy Galois connections. Mathematical Logic Quarterly, 45(4), 497–504. Belohlavek, R. (2001). Lattices of fixed points of fuzzy Galois connections. Mathematical Logic Quarterly, 47, 111. 1998. Belohlavek, R. (2002). Fuzzy relational systems, foundations and principles. New York: Kluwer. Birkhoff, G. (1993). Lattice theory (3rd ed.). American Mathematical Society. In Incremental clustering for dynamic information processing, proceedings of the ACM transaction on information processing systems (pp. 143–164). Buitelaar, P., Olejnik, D., & Sintek, M. (2004). A protege plug-in for ontology extraction from text based on linguistic analysis. In Proceedings of the 1st European semantic web symposium (ESWS). Burusco, A., & Fuentes-Gonzalez, R. (1998). Construction of the L-fuzzy concept lattice. Fuzzy Sets and Systems, 97, 109icss. Carpineto, C., & Romano, G. (2004). Exploiting the potential of concept lattices for information retrieval with CREDO. Journal of Universal Computer Science, 10(8), 985–1013. Chau, K. W. (2007). An ontology-based knowledge management system for flow and water quality modeling. Advances in Engineering Software, 38(3), 172–181. Cho, W. C., & Richards, D. (2007). Ontology construction and concept reuse with formal concept analysis for improved web document retrieval. Web Intelligence and Agent Systems, 5(1), 109–126. Cimiano, P., & Volker, J. (2005). A framework for ontology learning and data-driven change discovery. In Proc. of the NLDB’2005. Cutting, D., Kupiec, J., Pederson, J., & Sibun, P. (1992). A practical part-of-speech tagger. In Proceedings of the third conference on applied 5atural language processing (pp. 133–140), ACL, Trento, Italy. Dattolo, A., & Luccio, F. (2009). A new concept map model for E-learning environments (pp. 404–417). Web Information Systems and Technologies (WEBIST 2008, Funchal, Madeira, Portugal, May 2008, revised selected papers), LNBIP 18. Springer. De Maio, C., Fenza, G., Loia, V., & Senatore, S. (2009). A multi facet representation of a fuzzy ontology population, web intelligence and intelligent agent technology. In IEEE/WIC/ACM international conference on (Vol. 3, pp. 401–404). In IEEE/WIC/ACM international joint conference on web intelligence and intelligent agent technology. De Maio, C., Fenza, G., Loia, V., Orciuoli, F., & Senatore, S. (2010). An enhanced approach to improve enterprise compentency management, FUZZ-IEEE 2010 Barcelona, Spain, July 18–23, 2010 in the 2010 IEEE WCCI 2010. Dubois, D., & Prade, H. (2009). Possibility theory and formal concept analysis in information systems. In Proc 13th international fuzzy systems association world congress IFSA-EUSFLAT 2009, Lisbon, July 20–24, 2009. Faure, D., Nédellec, C., & Rouveirol, C. (1998). Acquisition of semantic knowledge using machine learning methods: The system ASIUM. Technical report number ICS-TR-88-16. Finin, T., Mayfield, J., Joshi, A., Cost, R. S., & Fink, C. (2005). Information retrieval and the semantic web, hicss. In Proceedings of the 38th annual Hawaii international conference on system sciences (Vol. 4, p. 113a). Fortuna, B., Grobelnik, M., & Mladenic, D. (2007). OntoGen: Semi-automatic ontology editor. HCI International, Beijing, July 2007. Ganter, B., & Wille, R. (1999). Formal concept analysis: Mathematical foundations. Berlin, Heidelberg: Springer. Gennari, J., Musen, M. A., Fergerson, R. W., Grosso, W. E., Crubézy, M., Eriksson, H., et al. (2002) The evolution of Protègè: An environment for knowledgebased systems development, Tech. rep. SMI-2002-0943. Stanford University. Gibbins, N., Harris, S., Dix, A., & Schraefel, M. C. (2004). Applying mspace interfaces to the semantic web. Tech. rep. 8639, ECS, Southampton. Greene, B., & Rubin, G. (1971). Automatic grammatical tagging of English, Technical report. Department of Linguistics, Brown University, Providence, Rhode Island. Grüninger, M., & Fox, M. S. (1995). Methodology for the design and evaluation of ontologies. In IJCAI’95, Workshop on basic ontological issues in knowledge sharing, Montreal. Hooper, R., & Paice, C. (2005). The Lancaster stemming algorithm. Accessed 18.04.06. Klein, S., & Simmons, R. (1963). A computational approach to grammatical coding of English words. JACM 10. Loia, V., & Senatore, S. (2007). Synergistic approach to efficient web searching. Journal of Intelligent Decision Technologies, 1(2), 83–96 (IOS Press). Lopez, M. F., & Perez, A. G. (2002). Overview and analysis of methodologies for building ontologies. Knowledge Engineering Review, 17(2). Maedche, A., & Staab, S. (2001). Ontology learning for the semantic web. IEEE Intelligent Systems, 16(2), 72–79. Paul, Shabajee (2004). Summary report from SWARA survey of biodiversity/wildlife information in the UK, HP technical report. Pollandt, S. (1996). Fuzzy-Begriffe: Formale Begriffsanalyze unscharfer Daten. Berlin, Heidelberg: Springer-Verlag. Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14, 130–137. Porter, M. F. (2001). Snowball: A language for stemming algorithms. . Ramos, J. (2003). Using TF-IDF to determine word relevance in document queries. First International Conference on. Machine Learning. Salton, G., & Buckley, C. (1987). Term weighting approaches in automatic text retrieval. Technical report. UMI order number: TR87-881. Cornell University. Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In Proceedings of the international conference on new methods in language processing (pp. 44–49), Manchester, UK. Suksomboon, P., Herin, H., & Sala, M. (2006). Pedagogical resources representation conceptual model (pp. 45–50). KICSS 2006, Ayutthaya Thailand, August, 1–4.
Please cite this article in press as: De Maio, C., et al. Hierarchical web resources retrieval by exploiting Fuzzy Formal Concept Analysis. Information Processing and Management (2011), doi:10.1016/j.ipm.2011.04.003
20
C. De Maio et al. / Information Processing and Management xxx (2011) xxx–xxx
Sure, Y., Erdmann, M., Angele, J., Staab, S., Studer, R., & Wenke, D. (2002). OntoEdit: Collaborative ontology development for the semantic web. In The semantic web–ISWC 2002, First international semantic web conference, sardinia, Italy, June 9–12. In Ian Horrocks & James A. Hendler (Eds.) Proceedings, lecture notes in computer science (Vol. 2342). Springer. Tartir, S., Arpinar, I. B., Moore, M., Sheth, A. P., & Aleman-Meza, B. (2005). OntoQA: Metric-based ontology quality analysis. IEEE ICDM 2005 Workshop on Knowledge Acquisition from Distributed, Autonomous, Semantically Heterogeneous Data and Knowledge Sources. Houston: Texas. Tho, Q. T., Hui, S. C., Fong, A. C. M., & Cao, T. H. (2006). Automatic fuzzy ontology generation for semantic web. IEEE Transactions on Knowledge and Data Engineering, 18(6), 842–856. Uschold, M., & Grüninger, M. (1996). Ontologies principles methods and applications. Knowledge Engineering Review, 11(2). Van Rijsbergen, C. J. (1979). Information retrieval (2nd ed.). University of Glasgow: Dept. of Computer Science. Van Rijsbergen, C. (1979). Information retrieval (2nd ed.). Butterworths. Yee, K. -P. Swearingen, K., Li, K., & Hearst, M. (2003). Faceted metadata for image search and browsing. In CHI 2003. Zhou, B., Hui, S. C., & Chang, K. (2005). A formal concept analysis approach for web usage mining. In Z. Shi & Q. He (Eds.), Intelligent information processing II (pp. 437–441). London: Springer-Verlag.
Please cite this article in press as: De Maio, C., et al. Hierarchical web resources retrieval by exploiting Fuzzy Formal Concept Analysis. Information Processing and Management (2011), doi:10.1016/j.ipm.2011.04.003