Document not found! Please try again

An Approach for the Automatic Recommendation of Ontologies Using ...

6 downloads 4907 Views 346KB Size Report
Finally, ontologies are ranked according to their score. 4 Evaluation and ... We also have interest in studying if reasoning tech- niques can ... Seventeenth International FLAIRS Conference, Miami, Florida, USA, pp. 220–228 (2004) ... Studer, R., Benjamins, V.R., Fensel, D.: Knowledge engineering: principles and methods.
An Approach for the Automatic Recommendation of Ontologies Using Collaborative Knowledge Marcos Martínez-Romero1, José M. Vázquez-Naya1, Cristian R. Munteanu2, Javier Pereira1, and Alejandro Pazos2 1

IMEDIR Center, University of A Coruña, 15071 A Coruña, Spain 2 Department of Information and Communication Technologies, University of A Coruña, 15071 A Coruña, Spain {marcosmartinez,jmvazquez,cmunteanu,javierp,apazos}@udc.es

Abstract. In recent years, ontologies have become an essential tool to structure and reuse the exponential growth of information in the Web. As the number of publicly available ontologies increases, researchers face the problem of finding the ontology (or ontologies) which provides the best coverage for a particular context. In this paper, we propose an approach to automatically recommend the best ontology for an initial set of terms. The approach is based on measuring the adequacy of the ontology according to three different criteria: (1) How well the ontology covers the given terms, (2) the semantic richness of the ontology and, importantly, (3) the popularity of the ontology in the Web 2.0. In order to evaluate this approach, we implemented a prototype to recommend ontologies in the biomedical domain. Results show the importance of using collaborative knowledge in the field of ontology recommendation. Keywords: ontology recommendation, Web 2.0, Semantic Web.

1 Introduction During the last years, ontologies have proven to be a crucial element in many intelligent applications. They are considered a practical way to describe entities within a domain and the relationships between them in a standardized manner, facilitating the exchange of information between people or systems that utilize different representations for the same knowledge. And, which is even more important, knowledge represented using ontologies can be computable by machines, allowing to perform automatic large-scale data search, analysis, inference and mining [1-2]. Due to the importance of ontologies as the backbone of emerging fields like the Semantic Web [3], a large number of ontologies have been developed and published online during the last decade. At the time of writing this paper, the Swoogle Semantic Web search engine1, which is a famous tool for searching ontologies, indexes more than 10,000 ontologies from multiple areas, which present different levels of quality and reliability. This huge ontological ocean makes difficult for users to find the best 1

http://swoogle.umbc.edu/

R. Setchi et al. (Eds.): KES 2010, Part II, LNAI 6277, pp. 74–81, 2010. © Springer-Verlag Berlin Heidelberg 2010

An Approach for the Automatic Recommendation of Ontologies

75

ontology for a given task. As an example, a query in Swoogle for the keyword cancer provides 264 different results, and asking users to manually select the most suitable ontology is infeasible since it adds much labor costs for them. Due to this, as the number and variety of publicly available ontologies continues steadily growing, so does the need for proper and robust strategies and tools to facilitate the task of identifying the most appropriate ontologies for a particular context. This paper presents an approach for ontology recommendation which combines the assessment of three criteria: context coverage, semantic richness and popularity. The combination of these different but complimentary evaluation methods is aimed to give a robust reliability and performance in large-scale, heterogeneous and open environments like the Semantic Web.

2 Related Work Ontology evaluation is a complex task, which can be considered the cornerstone of the ontology recommendation process, The need for evaluation strategies in the field of ontologies showed up as soon as 1994 with the work of Gómez-Pérez in Stanford [4]. Since that moment, several works proposed to evaluate the quality of an ontology on the basis of ontology metrics such as number of classes, properties, taxonomical levels, etc. [5]. More recently, other authors have provided different strategies to assess the coverage that an ontology provides for a particular context or domain [6-9]. A complete review of evaluation strategies has been provided by Brank et al. [10]. We argue that even though it is important to know the quality of an ontology as a function of metrics applied to its elements, and to assess how well the ontology represents a particular context, it is also vital to have into account its social relevance or popularity in the ontological community. Some authors have previously stated the importance of taking popularity into account in ontology evaluation [11-12]. To the best of our knowledge, the approach proposed in this paper is the first that aims to recommend the most appropriate ontologies for a given context using collaborative knowledge extracted from the Web 2.0 and also taking into account the semantic richness of the candidate ontologies.

3 Methods In this section, we present the approach for automatic ontology recommendation, whose overall workflow is shown in Fig. 1. We start describing how the initial terms are semantically cleaned and expanded. Then, we illustrate the evaluation methods that constitute the core of the recommendation strategy. Finally, we describe the method followed to aggregate the different obtained scores into a unique score. 3.1 Semantic Cleaning and Expansion The process starts from a set of initial terms, which can be provided by a user, a Semantic Web agent, or automatically identified from an information resource (e.g. text, image, webpage, etc.). On the basis of one (or several) lexical or terminological

76

M. Martínez-Romero et al.

resources (e.g. WordNet2 as the general-purpose lexical resource for the English language, or UMLS3 as a specialized resource for the biomedical field), the set of initial terms are cleaned from erroneous or redundant terms. A term that is not contained in the terminological resources is considered an erroneous term and it is removed. Also, if there are terms from the initial set with the same meaning (synonyms), only one of them remains. After that, each remaining term is expanded with its synonyms to increase the possibility of finding it into the ontology. Example 1. Suppose that the initial set of terms is {leukocyte, white cell, blood, celdj}, and that the selected lexical resource is WordNet. Then, the set of cleaned terms would be {leukocyte, blood}. The term white cell has been removed for being a synonym of leukocyte, and celdj because it does not exist. The set of expanded terms for leukocyte would be {leukocyte, leucocyte, white blood cell, white cell, white blood corpuscle, white corpuscle, wbc}, and for blood would be just {blood}.

Fig. 1. Workflow of the ontology recommendation process

3.2 Context Coverage Evaluation This step is addressed to obtain the ontologies that (partially or completely) represent the context. This is achieved by evaluating how well each ontology from a given repository covers the expanded sets of terms. We propose a metric, called CCscore, which is based on the CMM metric proposed by Alani and Brewster [13] adapted to have into account only exact matches, because it has been found that considering partial matches (e.g. anaerobic, aerosol, aeroplane) can be problematic and reduce the search quality [14]. The CCscore consists on counting the number of initial terms that are contained in the ontology. An initial term is contained in the ontology if there is a class in the ontology that has a label matching the term or one of its expanded terms. Definition 1. Let o be an ontology, T = {t1, …, tn} is the initial set of terms, C = {t1, …, tm} is the set of terms after the semantic cleaning step, and S = {S1, …, Sm} is the 2 3

http://wordnet.princeton.edu/ http://www.nlm.nih.gov/research/umls/

An Approach for the Automatic Recommendation of Ontologies

77

set that contains all the sets of expanded terms, where Si is the set of expanded terms for the term ti ∈ C , with 1 ≤ i ≤ m . CCscore ( o , T ) =

M (o, S ) S

(1)

where M(o, S) is the function that counts the number of sets of expanded terms Si that contain at least one term matching one label from the ontology. Example 2. Suppose that we start from the set of terms presented in the Example 1. Then S = {S1, S2}, with S1={leukocyte, leucocyte, white blood cell, white cell, white blood corpuscle, white corpuscle, wbc} and S2={blood}. Also suppose that we are working with an ontology o, which has only five classes, whose labels are {finger, white cell, cell, bone, skull}. Then M ( o, S ) = 1 (the matching term is white cell from S1), S = 2 and CCscore(o, S ) = 0.5 . 3.3 Semantic Richness Evaluation

Ontologies with a richer set of elements are potentially more useful than more simple ones. In this section, we present a metric to assess the semantic richness of an ontology. It is called the Semantic Richness Score (SRscore), and it is based on the constructors provided by the Ontology Web Language (OWL)4, which is the most popular and widely used language for ontology construction. In OWL, concepts are called classes, and each class is considered as a set of individuals that share similar characteristics. We propose to measure the semantic richness of an ontology according to the presence of the following OWL elements: ─ Object properties. They represent relations between instances of classes. A widely-known example of an object property is the is_a hierarchical relation. ─ Datatype properties. Datatype properties relate individuals to RDF literals or simple types defined in accordance with XML Schema datatypes. For example, the datatype property hasAddress might link individuals from a class Person to the XML Schema datatype xsd:string. ─ Annotation properties. They provide additional semantic information to the ontology, such as definitions of terms, labels, comments, creation date, author, etc. As an example, the NCI Thesaurus5, which is a popular ontology for the cancer domain provides several definitions for the concept colorectal carcinoma. One of them is: A cancer arising from the epithelial cells that line the lumen of the colon. Other information provided by this ontology using annotation properties are the synonyms for this concept (e.g. Cancer of the Large Bowel, Carcinoma of Large Intestine, Colorectal Cancer, etc.), the most common name for this concept, its semantic type, etc.

On the basis of these elements, the structural richness score (SRscore) is defined. Definition 2. Let o be an ontology. 4 5

http://www.w3.org/TR/owl-features/ http://ncit.nci.nih.gov/

78

M. Martínez-Romero et al.

SRscore(o) = wobjobj (o) + wdatdat (o) + wannann(o )

(2)

where obj(o) is the number of object properties in the ontology, dat(o) is the number of datatype properties, ann(o) is the number of annotation properties and wobj,, wdat, wann are the weight factors. Example 3. The NCI Thesaurus (version 09.12d) has 124 object properties, 72 datatype properties and 78 annotation properties. Its SRscore would be calculated as: SRscore = 0.3 ∗ 124 + 0.3 ∗ 72 + 0.4 ∗ 78 = 90 (wobj,=0.3, wdat=0.3, wann =0.4). 3.4 Popularity Evaluation

Apart from evaluating how well an ontology covers a specific context and the richness of its structure, there is another aspect that requires special attention. According to the definition of ontology provided by Studer [15], which can be considered one of the most precise and complete definitions up to the moment, "an ontology captures consensual knowledge, that is, it is not private to some individual, but accepted by a group". With this in mind, any method addressed to evaluate an ontology should consider the level of collective acceptability or popularity of the ontology. Our approach proposes to evaluate the popularity of an ontology by using the collective knowledge stored in widely-known Web 2.0 resources, whose value is created by the aggregation of many individual user contributions. Our proposal consists on counting the number of references to the ontology in each resource. Depending of the kind of resource, the criteria to measure the references will be different. For example, in Wikipedia, we could count the number of results obtained when searching for the name of the ontology; in Del.icio.us, we could measure the number of users who have added the URL of the ontology to their favorites; in Google Scholar, retrieving the number of papers which reference the ontology; or even in Facebook, by checking the number of users in specific groups dedicated to the ontology. Apart from this generalpurpose resources, it could also be taken into account Web 2.0 resources aimed to cover special shared domains, like the NCBO Bioportal6 for the biomedical area. According to this idea, we propose to assess the popularity of an ontology on the basis of the following metric. Definition 3. Let o be an ontology, R = {r1, ..., rn} is the set of Web 2.0 resources taken into account and P = {p1, ..., pn} the number of references to the ontology from each resource in R, according to a set of predefined criteria.

n Pscore(o, R ) = ∑ wiαiPi i =1

(3)

where wi is a weight factor and αi is a factor to equilibrate the number of references among criteria that provide results in different ranges (default value 1). Example 4. Suppose that we are evaluating the popularity of Gene Ontology (GO)7, which is a well-known ontology aimed to provide consistent descriptions of gene products, and that we just want to take into account Wikipedia and Del.icio.us 6 7

http://bioportal.bioontology.org/ http://www.geneontology.org/

An Approach for the Automatic Recommendation of Ontologies

79

resources. At the moment of writing this paper, in Wikipedia there are 11 pages with the text "Gene Ontology", and at Del.icio.us there are 313 users who have added the GO website to their bookmarks. The Pscore would be worked out according to the following expression: Pscore = 0.5 ∗ 1 * 76 + 0.5 ∗ 1 * 313 = 194.5 (w1=0.5, w2=0.5, α1=1, α2=1). 3.5 Aggregated Ontology Score

The final score for an ontology is calculated once the three previously explained measures have been applied for all the ontologies that are being evaluated. It is calculated by aggregating the obtained values, taking into account the weight (importance) of each measure, according the following definition. Definition 4. Let o be an ontology from a set of candidate ontologies O, T is an initial set of tems and R is a set of Web 2.0 resources. Score (o, T , R ) = wCC ∗ CCscore (o, T ) +

wSR ∗ SRscore (o ) wP ∗ Pscore ( o, R ) + (4) max( SRscore (O )) max( Pscore (O , R ))

where wCC, wSR and wP are the weights assigned to CCscore, SRscore and Pscore, respectively. As it can be seen in the expression, values of SRscore and Pscore are normalised to be in the range [0, 1] by dividing them by the maximum value of the measure for all ontologies. Finally, ontologies are ranked according to their score.

4 Evaluation and Results On the basis of the proposed approach, we have implemented an early prototype of an ontology recommender system for the biomedical domain and carried out an experiment to test the validity of the approach. Starting from the following set of initial terms {colorectal carcinoma, apoptosis, leep, combination therapy, chemotherapy, DNA, skull, pelvic girdle, cavity of stomach, leukocyte}, randomly chosen from the areas of cancer and anatomy, the prototype provides the results that are shown in Table 1 (only top 10 recommended ontologies are shown). The repository of candidate ontologies is composed by 30 ontologies. 15 of them are widely known biomedical ontologies, while the other 15 correspond to the top results returned by Swoogle when searching the initial terms. The prototype uses WordNet and UMLS to semantically clean and expand the initial terms. Weights for Semantic Richness Evaluation are: wobj=0.2, wdat=0.2, wann=0.6. As the Web 2.0 resources, we chose Wikipedia and BioPortal. The prototype counts the number of search results returned by Wikipedia when searching for the name of the ontology and also checks if the ontology is listed in BioPortal. We used the following weights and parameters for popularity: w1=0.5, w2=0.5. α1=1, α2=10. Finally, weights for Score Aggregation are wCC=0.6, wSR=0.1, wP=0.3. As we consider vital to recommend ontologies that provide a good coverage of the domain, the greater weight was given to the Context Coverage Score, followed by the Popularity Score and the Semantic Richness Score.

80

M. Martínez-Romero et al. Table 1. Results of the evaluation Ontology NCI Thesaurus Medical Subject Headings SNOMED Clinical Terms Gene Ontology Foundational Model of Anatomy ACGT Master Ontology Ontosem ResearchCyc Systems Biology Ontology TAP

CCscore 0.900 0.600 0.800 0.200 0.500 0.400 0.400 0.300 0.100 0.100

SRscore 0.089 0.006 0.014 0.045 0.065 0.050 0.001 0.103 0.001 0.000

Pscore 0.140 0.628 0.128 1.000 0.186 0.128 0.000 0.070 0.186 0.000

Total Score 0.591 0.549 0.520 0.425 0.362 0.283 0.240 0.211 0.116 0.060

On examining the results, some interpretations stand out: (1) the NCI Thesaurus cancer ontology is most recommended ontology, providing a 90% of coverage for the initial terms. Considering that most of the initial terms belong to the cancer domain, this result makes sense. (2) The top 5 recommended ontologies are widely-known biomedical ontologies. (3) ontologies with a low popularity are recommended low in the results, even if they provide and acceptable context coverage (e.g. Ontosem). (4) An ontology with a high popularity can have a better aggregated score than another ontology which has better domain coverage (e.g. Medical Subject Headings and SNOMED).

5 Conclusions and Future Work Automatic ontology recommendation is a crucial task to enable ontology reusing in emerging knowledge-based applications and domains like the upcoming Semantic Web. In this work, we have presented an approach for the automatic recommendation of ontologies which, to the best of our knowledge, is unique in the field of ontology recommendation. Our approach is based on evaluating three different aspects of an ontology: (1) how well the ontology represents a given context, (2) its semantic richness and (3) the popularity of the ontology. Results show that ontology popularity is a crucial criteria to obtain a valid recommendation, and that the Web 2.0 is an adequate resource to obtain the collaborative knowledge required to this kind of evaluation. As an immediate future work, we plan to develop strategies to automatically identify the best weights and parameters for the operating context, instead of having to adjust their values by hand. Investigating new methods to evaluate ontology popularity is another future direction. We also have interest in studying if reasoning techniques can help to improve our scoring strategies.

Acknowledgements Work supported by the “Galician Network for Colorectal Cancer Research” (REGICC, Ref. 2009/58), funded by the Xunta de Galicia, and the "Ibero-NBIC Network" (209RT0366) funded by CYTED. The work of Marcos Martínez-Romero is supported by a predoctoral grant from the UDC. The work of José M. Vázquez-Naya

An Approach for the Automatic Recommendation of Ontologies

81

is supported by a FPU grant (Ref. AP2005-1415) from the Spanish Ministry of Education and Science. Cristian R. Munteanu acknowledges the support for a research position by “Isidro Parga Pondal” Program, Xunta de Galicia.

References 1. Gómez-Pérez, A., Fernández-López, M., Corcho, O.: Ontological Engineering: With Examples from the Areas of Knowledge Management. In: E-Commerce and the Semantic Web. Springer, Heidelberg (2004) 2. Gruber, T.R.: Toward principles for the design of ontologies used for knowledge sharing. Int. Journal of Human Computer Studies 43, 907–928 (1995) 3. Berners-Lee, T., Hendler, J., Lassila, O.: The semantic Web. Scientific American 284, 34– 43 (2001) 4. Gómez-Pérez, A.: From Knowledge Based Systems to Knowledge Sharing Technology: Evaluation and Assessment. Knowledge Systems Lab., Stanford University, CA (1994) 5. Supekar, K., Patel, C., Lee, Y.: Characterizing quality of knowledge on semantic web. In: Seventeenth International FLAIRS Conference, Miami, Florida, USA, pp. 220–228 (2004) 6. Alani, H., Noy, N., Shah, N., Shadbolt, N., Musen, M.: Searching ontologies based on content: experiments in the biomedical domain. In: The Fourth International Conference on Knowledge Capture (K-Cap), Whistler, BC, Canada, pp. 55–62. ACM Press, New York (2007) 7. Netzer, Y., Gabay, D., Adler, M., Goldberg, Y., Elhadad, M.: Ontology Evaluation Through Text Classification. In: Chen, L., Liu, C., Zhang, X., Wang, S., Strasunskas, D., Tomassen, S.L., Rao, J., Li, W.-S., Candan, K.S., Chiu, D.K.W., Zhuang, Y., Ellis, C.A., Kim, K.-H. (eds.) RTBI 2009. LNCS, vol. 5731, pp. 210–221. Springer, Heidelberg (2009) 8. Vilches-Blázquez, L., Ramos, J., López-Pellicer, F., Corcho, O., Nogueras-Iso, J.: An approach to comparing different ontologies in the context of hydrographical information. In: Heidelberg, S.B. (ed.) Information Fusion and Geographic Information Systems, vol. 4, pp. 193–207. Springer, Heidelberg (2009) 9. Jonquet, C., Shah, N., Musen, M.: Prototyping a Biomedical Ontology Recommender Service. In: Bio-Ontologies: Knowledge in Biology, Stockholm, Sweden (2009) 10. Brank, J., Grobelnik, M., Mladenic, D.: A survey of ontology evaluation techniques. In: 8th International Multi-Conference Information Society IS, pp. 166–169 (2005) 11. Ding, L., Pan, R., Finin, T., Joshi, A., Peng, Y., Kolari, P.: Finding and ranking knowledge on the semantic web. In: Gil, Y., Motta, E., Benjamins, V.R., Musen, M.A. (eds.) ISWC 2005. LNCS, vol. 3729, pp. 156–170. Springer, Heidelberg (2005) 12. Buitelaar, P., Eigner, T., Declerck, T.: Ontoselect: A dynamic ontology library with support for ontology selection. Demo Session at the International Semantic Web Conference. Hiroshima, Japan (2004) 13. Alani, H., Brewster, C.: Ontology ranking based on the analysis of concept structures. In: 3rd International Conference on Knowledge Capture (K-Cap), Banff, Canada, pp. 51–58 (2005) 14. Jones, M., Alani, H.: Content-based ontology ranking. In: 9th Int. Protégé Conference, Stanford, CA (2006) 15. Studer, R., Benjamins, V.R., Fensel, D.: Knowledge engineering: principles and methods. Data & Knowledge Engineering 25, 161–197 (1998)

Suggest Documents