Available online at www.sciencedirect.com
Data & Knowledge Engineering 64 (2008) 600–623 www.elsevier.com/locate/datak
Learning non-taxonomic relationships from web documents for domain ontology construction David Sa´nchez *, Antonio Moreno Intelligent Technologies for Advanced Knowledge Acquisition (ITAKA) Research Group, Department of Computer Science and Mathematics, Universitat Rovira i Virgili (URV), Avda, Paı¨sos Catalans, 26, 43007 Tarragona, Spain Received 16 April 2007; accepted 2 October 2007 Available online 7 October 2007
Abstract In recent years, much effort has been put in ontology learning. However, the knowledge acquisition process is typically focused in the taxonomic aspect. The discovery of non-taxonomic relationships is often neglected, even though it is a fundamental point in structuring domain knowledge. This paper presents an automatic and unsupervised methodology that addresses the non-taxonomic learning process for constructing domain ontologies. It is able to discover domain-related verbs, extract non-taxonomically related concepts and label relationships, using the Web as corpus. The paper also discusses how the obtained relationships can be automatically evaluated against WordNet and presents encouraging results for several domains. 2007 Elsevier B.V. All rights reserved. Keywords: Ontology learning; Non-taxonomic relationships; Web mining; Knowledge acquisition
1. Introduction Ontologies are defined as formal, explicit specifications of a shared conceptualization [49]. They are an essential component in many knowledge-intensive areas like the Semantic Web [54], knowledge management, and electronic commerce. The construction of domain ontologies relies on domain modellers and knowledge engineers, which are typically overwhelmed by the potential size, complexity and dynamicity of a specific domain. In consequence, the definition of exhaustive domain ontologies is a barrier that very few projects can overcome. Due to these reasons, nowadays, there is a need of methods that can tackle, or at least ease, the construction of domain ontologies. Automated Ontology Learning methods allow a reduction in the time and effort needed in the ontology development process [2]. From a formal point of view, an ontology boils down to an object model represented by a set of concepts or classes C, which are taxonomically related by the transitive IS-A relation H C · C and non-taxonomically related by named object relations R * C · C · String. Even though many approaches for ontology learning *
Corresponding author. Tel.: +34 977 559681; fax: +34 977 559710. E-mail address:
[email protected] (D. Sa´nchez).
0169-023X/$ - see front matter 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.datak.2007.10.001
D. Sa´nchez, A. Moreno / Data & Knowledge Engineering 64 (2008) 600–623
623
[42] P. Cimiano, S. Staab, Learning by Googling, SIGKDD Explorations 6 (2) (2004) 24–33. [43] P. Wiemer-Hastings, A. Graesser, K. Wiemer-Hastings, Inferring the meaning of verbs from context, in: Proc. of the 20th Annual Conference of the Cognitive Science Society, 1998, pp. 1142–1147. [44] P.D. Turney, Mining the Web for synonyms: PMI-IR versus LSA on TOEFL, in: Proc. of the Twelfth European Conference on Machine Learning, Freiburg, Germany, 2001, pp. 491–498. [45] R. Byrd, Y. Ravin, Identifying and extracting relations from text, in: Proc. of NLDB’99 – Fourth International Conference on Applications of Natural Language to Information Systems, 1999. [46] R. Cilibrasi, P.M.B. Vitanyi, Automatic meaning discovery using Google, 2004, Available from: . [47] R. Girju, D. Moldovan, Text Mining for causal relations, in: Proc. of the FLAIRS Conference, 2002, pp. 360–364. [48] R. Navigli, P. Velardi, Learning domain ontologies from document warehouses and dedicated web sites, Computational Linguistics 30 (2) (2004) 151–179. [49] R. Studer, V.R. Benjamins, D. Fensel, Knowledge engineering: principles and methods, IEEE Transactions on Knowledge and Data Engineering 25 (1-2) (1998) 161–197. [50] S. Lamparter, M. Ehrig, C. Tempich, Knowledge Extraction from Classification Schemas, in: Proc. of CoopIS/DOA/ODBASE 2004, Lecture Notes in Computer Science, vol. 3290, 2004, pp. 618–636. [51] S. Patwardhan, T. Pedersen, Using WordNet-based context vectors to estimate the semantic relatedness of concepts, in: Proc. of the EACL 2006 Workshop, Making Sense of Sense: Bringing Computational Linguistics and Psycholinguistics Together, Trento, Italy, 2006, pp. 1–8. [52] S. Vintar, L. Todorovski, D. Sonntag, P. Buitelaar, Evaluating context features for medical relation mining, in: ECML/PKDD Workshop on Data Mining and Text Mining for Bioinformatics, 2003. [53] S. Walde, Clustering verbs semantically according to their alternation behaviour, in: Proc. of the 18th International Conference on Computational Linguistics, Saarbrucken, Germany, 2000, pp. 747–753. [54] T. Berners-lee, J. Hendler, O. Lassila, The semantic web, Scientific American (2001). [55] T. Pedersen, S. Patwardhan, J. Michelizzi, WordNet::Similarity – Measuring the Relatedness of Concepts, 2004, Available from: . [56] T.B. Jans, The effect of query complexity on Web searching results, Information Research 6 (1) (2000).
David Sa´nchez is currently developing his Ph.D. at Technical University of Catalonia. He is a member of the ITAKA research group (Intelligent Techniques for Advanced Knowledge Acquisition) at the Computer Science and Mathematics Department in the University Rovira i Virgili. His research interests are intelligent agents and ontology learning from the Web.
Antonio Moreno is a Lecturer at the University Rovira i Virgili’s Computer Science and Mathematics Department. He is the founder and head of the ITAKA research group (Intelligent Techniques for Advanced Knowledge Acquisition). His main research interests are the application of agent technology to healthcare problems and ontology learning from the Web. He received a PhD on Artificial Intelligence from UPC (Technical University of Catalonia) in 2000. He is the coordinator of the Spanish network on agents and multi-agent technology.