Towards an Automatic Fuzzy Ontology Generation

0 downloads 0 Views 678KB Size Report
(emails: {loia, cdemaio, gfenza, ssenatore}@unisa.it). Corresponding author: V. Loia (phone: +39 ..... The evaluation of sim helps us in the extraction of further ...
FUZZ-IEEE 2009, Korea, August 20-24, 2009

Towards an Automatic Fuzzy Ontology Generation Carmen De Maio, Giuseppe Fenza, Vincenzo Loia, Sabrina Senatore iis

Abstract—In recent years, the success of Semantic Web is strongly related to the diffusion of numerous distributed ontologies enabling shared machine readable contents. Ontologies vary in size, semantic, application domain, but often do not foresee the representation and manipulation of uncertain information. Here we describe an approach for automatic fuzzy ontology elicitation by the analysis of web resources collection. The approach exploits a fuzzy extension of Formal Concept Analysis theory and defines a methodological process to generate an OWL-based representation of concepts, properties and individuals. A simple case study in the Web domain validates the applicability and the flexibility of this approach.

O

I. INTRODUCTION

are becoming the conceptual backbones of the Semantic Web era. They enable integrated access to web resources as well as intelligent application for information processing, through conceptualizations that underlie knowledge and, at the same time, enable the knowledge sharing. The different nature of application domains facilitates the proliferation of ontologies, which range from general purpose ontologies (for instance, Suggested Upper Merged Ontology or briefly SUMO [14], Cyc [15]) to domainspecific schemas (i.e. GlycO [16]). They can differ in authoring style and formal modeling; can contain a restricted number of concepts (such as Friend Of A Friend, FOAF ontology [17]) or thousand of terms and relationships (NCIOntology [18]). The availability of domain ontologies strictly relies on the methodologies aimed at supporting the crucial process of ontology building. Often, ontology building presents structural and logical difficulties in the characterization of concepts and conceivable relationships. Thus, the manual generation of an ontology is a hard task that is often required expert interpretation. Moreover, two-valued-based logical methods are not always adequate to represent imprecise information of real world. Fuzzy ontology models capture uncertainty of information and knowledge. This paper proposes a system for automatic fuzzy ontology generation. The approach is based on the idea of fuzzy theory and Formal Concept Analysis (FCA) [4][3]. In particular, a fuzzy extension of FCA, called Fuzzy Formal Concept Analysis (FFCA) has been exploited. It produces a NTOLOGIES

Vincenzo Loia, Carmen De Maio, Giuseppe Fenza and Sabrina Senatore are with the Dipartimento di Matematica e Informatica, Università degli Studi di Salerno, via ponte don Melillo, 84084 - Fisciano (SA), Italy (emails: {loia, cdemaio, gfenza, ssenatore}@unisa.it). Corresponding author: V. Loia (phone: +39 089 96 3377 ; fax: +39 089 96 3303). 978-1-4244-3597-5/09/$25.00 ©2009 IEEE

fuzzy lattice which enables the managing of further information such as membership values of object in each fuzzy formal concept and subsumption relationships among concepts for the construction of concept hierarchy. The work introduces a formal approach to translate a fuzzy lattice, generated by FFCA theory, into a fuzzy ontology. In particular, the FFCA techniques exhibit a knowledge structure which reveals an intrinsic classification of processed web resources. The paper is organized as follows. Section II provides a overview of ontology-oriented application domain; Section III presents the theoretical aspects associated to the fuzzy formal context and the relative fuzzy lattice. The process of translation of fuzzy lattice into a fuzzy is described in Section IV. Then, in order to validate the applicability of this method, in Section V a case study is detailed. Conclusions close the paper. II. RELATED WORK Ontology is a conceptualization of an application domain into a human understandable, machine-readable format. It consists of a collection of concepts and their interrelationships which collectively define an abstract view of that domain [2]. In recent approaches, ontologies play an important role for knowledge modeling and structuring; OntoLearn [19] and OntoEdit [11] represent well-known tools for generating knowledge structuring and ontology engineering activities; for instance, in [21], an ontologybased knowledge system is modeled to assist engineers in sharing and maintaining knowledge. Nevertheless, ontology models are often unable to deal with many cases of real world, where information is vague in meaning. A possible solution to deal with this issue is to integrate fuzzy logic with ontologies to represent data uncertainty. Fuzzy techniques can relax rigid definitions, managing uncertainty in hierarchical representations of concepts. The fuzzy ontology contains fuzzy concepts and fuzzy membership values to indicate the "confidence degree" between the attributes and relationships of concepts. Different approaches have been proposed in literature: Lee et al. [2] present a fuzzy ontology exploited in the news summarization context. Quite a few approaches integrate FCA theory and fuzzy logic [1], [13]. Pollandt [7] introduces the L-Fuzzy Context, as an attempt to combine fuzzy logic with FCA, where linguistic variables are used to represent ambiguities in the context. Human intervention is required in order to define the linguistic variables, thus this approach seems to be not practicable for dealing with large document sets. An approach similar to ours is FOGA [1] which exploits the FCA theory for the automatic fuzzy ontology generation.

1044

FUZZ-IEEE 2009, Korea, August 20-24, 2009

Fig. 1. An example of a portion of Fuzzy Formal Context(a) and corresponding Concept Lattice (b) with threshold T = 0.6. The application domain is a set of scientific documents. Relevant terms extracted from this collection are evaluated though a membership value in the range [0, 1]. This way, documents are not described as term vectors but as fuzzy sets: for each term, a membership value represents its relevance in the document set.

of attributes in K and μi is the membership of O with attribute Ai in K. Φ(O) is called the fuzzy representation of O. A fuzzy formal context is often represented as a cross-table as shown in Fig 1(a). According to fuzzy theory, the definition of Fuzzy Formal Concept is given as follows

III. FUZZY APPROACH TO FORMAL CONTEXT ANALYSIS The formal approach beyond this component requires a theory of Formal Concept Analysis (briefly FCA) [4]. Formal Concept Analysis is a technique of data analysis, which exploits the ordered lattice theory. Through formal contexts, it enables the representation of the relationships between objects and attributes in a given domain. Formal concepts can be interpreted from the concept lattice using FCA. The concept lattice represents a mathematical modeling of knowledge which is more informative than traditional tree-like conceptual structures [5]. FCA may not be sufficient to represent uncertain and vague information found in many application domains. A possible solution is the “Fuzzy Formal Concept Analysis” (FFCA) [6], that incorporates fuzzy logic into FCA, representing the uncertain information by a real number of membership value in the range [0,1]. In the sequel, some definitions which incorporate fuzzy logic into Formal Concept Analysis are given. Definition 1. A Fuzzy Formal Context is a triple K = (G,M, I = φ(G × M)), where G is a set of objects, M is a set of attributes, and I is a fuzzy set on domain G × M. Each relation (g, m) ∈ I has a membership value µ(g, m) in [0, 1]. Definition 2. Fuzzy Representation of Object. Each object O in a fuzzy formal context K can be represented by a fuzzy set Φ(O) as Φ(O)={A1(μ1), A2(μ2),…, Am(μm)}, where {A1, A2,…, Am} is the set 1045

Definition 3. Fuzzy Formal Concept. Given a fuzzy formal context K =(G, M, I ) and a confidence threshold T , we define A∗= {m ∈ M |∀ g ∈ A:

µ(g, m) ≥ T } for A ⊆ G and B∗= {g ∈ G|∀ m∈ B: µ(g, m) ≥ T } for B ⊆ M . A fuzzy formal concept (or fuzzy concept) of a fuzzy formal context K with a confidence threshold T is a pair (Af = φ (A), B), where A ⊆ G, B ⊆ M , A∗=B and B∗=A. Each object g ∈ φ(A) has a membership µg defined as

μ g = min μ ( g , m) m∈B

where µ(g, m) is the membership value between object g and attribute m, which is defined in I . Note that if B={} then µg = 1 for every g. A and B are the extent and intent of the formal concept (φ(A),B) respectively. Definition 4. Let (A1, B1) and (A2, B2) be two fuzzy concepts of a fuzzy formal context (G, M, I). (φ(A1), B1) is the Subconcept of (φ(A2),B2), denoted as (φ(A1), B1) ≤ (φ(A2),B2), if and only if φ(A1) ⊆ φ(A2) (⇔B2 ⊆ B1). Equivalently, (A2, B2) is the Superconcept of (A1, B1). Definition 5. A Fuzzy Concept Lattice of a fuzzy formal context K with a confidence threshold T is a set F(K) of all fuzzy concepts of K with the partial order ≤ with the confidence threshold T .

FUZZ-IEEE 2009, Korea, August 20-24, 2009

Fig. 2. Building of Fuzzy Ontology from Fuzzy FCA. Fig 1(b) shows an example of lattice with T = 0.6 and the relative set of concepts. As said, Fig. 1 (a) shows a cross table which describes the relative fuzzy formal context. The rows represent the web resources (objects) whereas the columns are the words (attributes) related to the given collection. In the “fuzzy” extension of Formal Context Analysis, each cell of table (i, j) describes an existing “fuzzy” relation between the object j and the attribute i by means of a membership value in the range [0, 1]. The Fuzzy FCA-based Analysis generates fuzzy formal concepts from the fuzzy formal context and organizes the generated concepts as a fuzzy concept lattice. The lattice generation is constrained by a confidence threshold T which is applied to the context for pruning all the relations (i, j) whose membership value is less than T (see Fig. 1). The fuzzy lattice provides information about the knowledge structuring, the relationships and similarities between fuzzy formal concepts. Formally: Definition 6. The Fuzzy Formal Concept Similarity between concept K1=(φ(A1), B1) and its subconcept K2=(φ(A2),B2) is defined as | | , | | where ∩ and refer intersection and union operators on fuzzy sets, respectively.

of main mapping steps transforms the fuzzy formal context using the fuzzy lattice, into an ontology. Let us note the concepts defined in the fuzzy formal context offer both intentional and extensional information [1], whereas a concept in the ontology instead, emphasizes just its intentional aspects. Then, the building of a fuzzy ontology requires a mapping of both intentional and extensional information into the corresponding classes and relations of the ontology. Then the final results are the fuzzy ontology conceptualization and population coming from the mapping of intentional and extensional information respectively. In particular, Fig. 2 shows a portion of fuzzy formal context and the relative concepts in the lattice (on the left hand); the corresponding generated fuzzy ontology is shown on the right hand. In the fuzzy formal context, web resources (objects of the formal context) are indeed, described by keywords extracted by their contents. The translation of a lattice (taking into account the objectattribute relationships of the formal context) into a fuzzy ontology can be described by the following mapping steps.

Thanks to FCA theory, the concepts (objects and their attributes) are arranged in a hierarchy, emphasizing semantic relationships like subsumption (often known as a “hyponymhypernym or Is-A” relationship). IV. FUZZY ONTOLOGY GENERATION This process automatically generates a fuzzy ontology through a semantic OWL-based representation. A sequence 1046

-

Class Mapping: this step translates each concept of the lattice into an ontology class. The analysis of extent and intent of the fuzzy context guarantees the appropriate characterization of the ontology class (as described in the next steps). Let us note that human interpretation is required to label the ontology class name. In fact, automatic mapping identifies with a progressive number the ontology concepts. For instance, the concept on the top of the lattice shown Fig. 2 (described by the object URL_2 and the attributes equation and mathematics) is translated by Concept_1 in the OWL ontology, as shown in the following sketched OWL code:

FUZZ-IEEE 2009, Korea, August 20-24, 2009



Similar consideration can be done for the lower concept that is called Concept_2 in the OWL code. Let us note this process is automatic, so, after the analysis of its intent and extent, a human could properly label it. Hierarchy Mapping: this step takes into account connections (ordering relations) among concepts in the lattice. For example, let us consider again the portion of the lattice in Fig. 2, in particular the connection between the upper concept and the lower concept. The mapping of the relation which exists between these concepts is straightforwardly described by the predicate “rdfs:subClassOf” in the resulting ontology. Yet, the fuzzy nature of the lattice implies a membership value associated to the relation subClassOf. Exploiting the reification, the membership values can be assigned to the statement for the description of the relation subClassOf, as follows:

-



0.94 0.94

V. CASE STUDY

where the predicate Membership is trivially coded as a datatype property:

Relation Mapping: this step achieves a mapping of the intent (attributes in the formal context) in the lattice into a set of properties in the resulting ontology. Just to give an example, let us consider the concepts in Fig. 2. The intent is the keywords sketched in the gray boxes. Specifically, the intent of the Concept_1 is composed of two keywords: equation and mathematics. For each keyword, this mapping step defines an OWL DatatypeProperty whose domain is Concept_2 and the range is a float datatype. Thanks to the float datatype, a membership value in range [0, 1] can be referred to the individuals, whose datatype property is applied on. The mapping focuses on the ontology concepts, exploiting just DatatypeProperty (no mappings are defined for the relationship between concepts). For the keyword mathematics, the OWL code is as follows:

Individuals Generation: at the last step, the mapping process takes into account the extent (i.e. objects or specifically web resources in the fuzzy formal context) of the concepts in the lattice. Thus, for each web resources in the extent of the class Concept_1, this step generates an instance of the corresponding ontology class; in other words, it generates individuals of class Concept_1. Furthermore, for each datatype property, whose domain is the ontology concept Concept_1 (and its relative super concepts), this step produces a predicate instance. The range of the predicate instance is the corresponding membership value of the fuzzy formal context. The associated OWL code is:

The generated fuzzy ontology provides, then, a knowledge-based conceptual model.

0.63

-

-

The applicability of this approach has been validated on a collection of RSS-feeds. In the last years, there is a growing interest in using RSS feeds among web sites. RSS (RDF Site Summary) is one of the most popular formats for Web content distribution; it is based on XML language and thus, it inherits the simplicity, extensibility and flexibility. Each RSS-feed is composed of a “channel” which contains information of the feed. More specifically, the tag includes a list of tags which describe a title, a link, a short description (or summary), a publication date, etc. In this study, the tag title and the textual description have been taken into account. The approach has been validated on a collection of RSS-Feeds, coming from OpenLearn Project [8], a public repository, accessible online for consult learning materials. OpenLearn provides course materials from the Open University, manually arranged in feeds categories, according to main educational subjects and courses. In order to apply the approach described in the previous sections, initially a text analysis of the RSS-feeds for data extraction is described in the following subsection. A. Preliminary RSS-Feed Analysis Preprocessing activities such as normalization, POS tagging, lemmatize and stop-word removal processing have been applied on the collection of feeds. Final result is the set of relevant keywords in RSS Feeds collection. be the set of keywords extracted , ,…, Let by means of text processing activities. Let us note the text parsing process can extract atomic words (for instance “expression”) or composite words (i.e. “differential equation”). Nevertheless, in order to capture the semantics

1047

FUZZ-IEEE 2009, Korea, August 20-24, 2009

beyond the keywords, a further analysiss for discovering conceivable compound words has been takeen into account. To reach this goal, let us consider the WordNet similarity measure [11] based on the sense associatted to the words, herein called similarity adjacency , 0,1 where , . In fact an adjacency matrix has been built: the cell (i, j) contains the , value associated to the new term composed of the meaninngful sequence of the two words , . The evaluation of sim milarity adjacency helps us in the extraction of further terms reelated to the given word . Substantially, for each keywoord , we evaluate all the words in W whose meaningg is similar or just related. More formally, Definition 7. The set of words associiated to the word is called local dictionary of and it is defined as follows: , , where 0 is a fixed simila larity threshold.

Obviously,

.

Goal of this activity is to build a vector-bassed representation of each RSS-Feed, where each element oof the vector is a weight (or relevance) of a given local dictioonary in the feed. The building of the local dictionary of w enables us to enrich the vocabulary relative to the word w . In particular, the weight associated to the generic worrd in the feed is , with computed with respect to each set will bbe renamed (for simplicity of notation, in the sequel). In other words, this weigght represents a membership degree evaluated between the feed content and each local dictionary. It is computed bby exploiting the augmented normalized term-frequency propposed in [10] and adapted for this approach. Formally, be the measure Definition 8. Let the term-frequency of the importance of a term which bbelongs to a local dictionary , …, ,…, , within the RSSFeed j and it is computed as follows:

where

is the number of occurrennces of some term in RSS-Feed j.

, let us select the maximum value Fixed the set among all the terms (i=1,.., l) in thee feed j:

In particular, this sum represents the relevance of local dictionary with respect to the giv ven RSS-feed j. Finally, according to the augmeented normalized termfrequency [10] the final weight associated to the local for the feeed j, is: dictionary of the word 0.5

0.5

tf

,

/ max tf

,

)

At this point for each RSS Feed j, the associated vector is compound of all the weight so com mputed, relative to each local dictionary for all the wordss in the set W. B. Validation of the Generated Fuzzy Fu Ontology The approach has been validated on o a collection of RSSFeeds, coming from OpenLearn Projject. Our goal is to automatically eliicit a categorization of collected Rss-Feeds, and then comp pare the results with the classification provided by OpenL Learn. A sample of 443 Object (RSS-Feed) has been selected. s The Concept Hierarchy Producer builds a latttice composed of 193 concepts. From the analysis of thee lattice, we verified the most of RSS-Feeds are groupeed consistently in the appropriate categories; for the sake s of coherency, the labelling of our categories is suppo orted by the OpenLearn’s public naming. More specifically, we have obtained that 87% of whole collection of feeds is classified coherently ntology reveals that 38% with OpenLearn. The generated on of well-classified feeds are speciaalized in new and more specific categories. By comparison with OpenLearn’s f result misclassified, classes, just the remaining 13% of feeds because they reveal ambiguity in th he contents (in fact, some feeds with different topics are placed d in the same category). Our method reveals a more speccialized classification of feeds; indeed, some feeds are placed in categories that are gories met in OpenLearn. specializations of the original categ The automatic classification of feeds allows the users to discover specialized feed content and improve the RSS information retrieval. dered for validating the Two measures have been consid effectiveness of this approach thro ough the analysis of the retrieval performance: Average Uninterpolated U Precision (AUP) [3] and Precision. The AU UP is defined as the ratio between the sum of the precision value at each point (or node) in a hierarchical structure where a relevant item appears, and the total number of releevant items.

,

1 0.8 0.6

h

tfMAX, j

h

max tfi,j

This value is exploited to compute the final value that characterizes the frequency associated to each local dictionary : tf

,

h tfMAX, j

,

Precision

0.4

..

AUP

0.2 0 2

h tfMAX, j

3

4

5

6

7

N keywords

Fig. 3. Performance evaluation on Precision and AUP. 1048

FUZZ-IEEE 2009, Korea, August 20-24, 2009

We have manually chosen keywords from the RSS-Feed to form retrieval queries and evaluate the retrieval performance using AUP. The performance for Precision and AUP are shown in Fig. 3, considering a different number N of the query keywords. Let us note that, for N larger than 5, the values of Precision and AUP provide good performance results (between 0.8 and 1). In Fig. 4, a graph of the Inheritance richness [19] is shown. This measure describes the distribution of information across different levels of the ontology’s inheritance tree or the fanout of parent classes. It describes how well knowledge is grouped into categories and subcategories of the ontology. Definition 9. Formally, the inheritance richness of the schema (IRs) is defined as the average number of subclasses per class. The number of subclasses 1 for a class is defined as ∑ | , | | , | | | In particular, Fig. 4. emphasizes the Inheritance richness value by varying a threshold. For instance, fixing the threshold to 0.5 means to consider the ontology composed by all the concepts that are connected to each other by edge (relationship “subClassOf”) with membership greater 0.5. In our study, let us note that with threshold is small then 0.5, the ontology has low IR . s This value characterizes “vertical” ontology, i.e. contains a large number of inheritance levels where classes have a small number of subclasses; while the ontology with a high IR (when the threshold is larger than 0.5) is “horizontal”, i.e. s it has a small number of inheritance levels, and each class has a relatively large number of subclasses.

has been evaluated on a real case study (RSS-Feeds collection), in order to measure its applicability and effectiveness. Exploited metrics provide straight criteria for evaluating the performance and the validity of the approach. REFERENCES [1] [2] [3]

[4] [5]

[6]

[7] [8] [9] [10] [11] [12] [13]

[14]

0.8 0.6 IRs

0.4

[15]

0.2

[16]

IR

0 0.2 0.3 0.4 0.5 0.6 0.7 0.8

[17]

Threshold

[18]

Fig. 4. Performance evaluation on Inheritance Richness. [19]

VI. CONCLUSIONS This paper presents an approach for automatic generation of a fuzzy ontology. The approach indeed presents the mapping steps for translating the fuzzy lattice into an OWLbased ontology. The “fuzzy” nature of lattice provides a hierarchical structure, where a similarity value is associated to each pair of connected concepts. The proposed method

[20] [21]

1049

Q. T. Tho, S. C. Hui, A. C. M. Fong, and T. H. Cao, “Automatic Fuzzy Ontology generation for Semantic Web”, IEEE Transactions on Knowledge and Data Engineering, Vol. 18(6), pp. 842- 856, 2006. C.S. Lee, Z.W. Jian and L.K. Huang, “A fuzzy ontology and its application to news summarization”, IEEE Transactions on Systems, Man, and Cybernetics, Part B:Cybernetics 35 (5), pp.859–880, 2005. N.Nanas, V.Uren and A. de Roeck, “Building and Applying a Concept Hierarchy Representation of a User Profile", In Proceedings of the 26th annual international ACM SIGIR Conference on Research and Development in Information Retrieval, ACM Press, 2003. B. Ganter and R. Wille. Formal Concept Analysis: Mathematical Foundations. Springer, Berlin - Heidelberg, 1999. Zhou, B., Hui, S. C., and Chang, K., “A formal concept analysis approach for Web Usage Mining” in Intelligent information Processing II, Z. Shi and Q. He, Eds. Springer-Verlag, London, 437441, 2005. T.T. Quan, S.C. Hui, and T.H Cao, “A Fuzzy FCA-based Approach to Conceptual Clustering for Automatic Generation of Concept Hierarchy on Uncertainty Data”. In Proc. of the 2004 Concept Lattices and Their Applications Workshop, pp. 1-12. S. Pollandt, Fuzzy-Begriffe: Formale Begriffsanalyze unscharfer Daten. Berlin-Heidelberg: Springer-Verlag, 1996 OpenLearn Project. Available: http://openlearn.open.ac.uk/ OWL Web Ontology Language Overview, Deborah L. McGuinness and Frank van Harmelen, Editors, W3C Recommendation, 10 February 2004, Available: http://www.w3.org/TR/owl-features/. G. Salton, and C. Buckley. Term Weighting Approaches in Automatic Text Retrieval. Technical Report. UMI Order Number: TR87-881., 1987 - Cornell University. A. Maedche and S. Staab. “Ontology learning for the Semantic Web”. IEEE Intelligent Systems, 16(2):72–79, 2001. G. Pirrò, N. Seco, "Design, Implementation and Evaluation of a New Similarity Metric Combining Feature and Intrinsic Information Content". ODBASE 2008, LNCS, Springer Verlag, 2008. G. Fenza, V. Loia, , S. Senatore, "Concept Mining of Semantic Web Services by Means of Fuzzy Formal Concept Analysis (FFCA)" in The IEEE International Conference on Systems, Man, and Cybernetics (IEEE SMC) 2008, 12 - 15 October, Singapore. A. Pease, I. Niles, and J. Li, “The Suggested Upper Merged Ontology: A Large Ontology for the Semantic Web and its Applications” in Working Notes of the AAAI-2002 Workshop on Ontologies and the Semantic Web, Edmonton,Canada, July 28-August 1, 2002. Lenat, D. B. "Cyc: A Large-Scale Investment in Knowledge Infrastructure." Communications of the ACM 38, no. 11, 1995. C. J. Thomas, A. P. Sheth, W. S. York, "Modular Ontology Design Using Canonical Building Blocks in the Biochemistry Domain". In Proceedings of the International Conference on Formal Ontology in Information Systems (FOIS). IOS Press, November 2006 FOAF: Friend Of A Friend Project Available: http://xmlns.com/foaf/spec/ J. Golbeck, G. Fragoso, F. Hartel, J. Hendler, J. Oberthaler and B. Parsia, The National Cancer Institute's Thesaurus and Ontology, Web Semantics: Science, Services and Agents on the World Wide Web, Volume 1, Issue 1, December 2003, Pages 75-80, ISSN 1570-8268. Samir Tartir, I.Budak Arpinar, Michael Moore, Amit P. Sheth, and Boanerges Aleman-Meza, “OntoQA: Metric-Based Ontology Quality Analysis” LSDIS Lab, Department of Computer Science University of Georgia Athens, GA 30602 USA. Navigli R., Velardi P., Gangemi A. “Ontology Learning and its application to automated terminology translation”. IEEE Intelligent Systems, vol. 18:1, January/February 2003. K.W. Chau, “An ontology-based knowledge management system for flow and water quality modeling”, Advances in Engineering Software 38 (3) (2007), pp. 172–181.