ORL [5], a very recently proposed mark-up language for the logical layer, extends .... the Armenian is usually listed among Middle Eastern ethnic groups. In 1996 ...
An ILP Approach to Semantic Web Mining Francesca A. Lisi and Floriana Esposito Dipartimento di Informatica, University of Bari, Italy, {lisi, esposito}@di.uniba.it
Abstract. This paper deals with mining the logical layer of the Semantic Web. Our approach adopts the hybrid system AL-log as a KR&R framework and ILP as a methodological apparatus. We illustrate the approach by means of an example of frequent pattern discovery in data and ontologies extracted from the on-line CIA World Fact Book.
1
Introduction
The Semantic Web is the vision of the WWW enriched by machine-processable information which supports the user in his tasks [2]. The architecture of the Semantic Web consists of several layers, each of which is equipped with an ad-hoc mark-up language. Therefore it poses several challenges in the field of Knowledge Representation and Reasoning (KR&R), mainly attracting people doing research on Description Logics (DLs). E.g., the design of the mark-up language OWL [6] for the ontological layer has been based on DLs derived from ALC [16]. Also ORL [5], a very recently proposed mark-up language for the logical layer, extends OWL ’to build rules on top of ontologies’. It bridges the expressive gap between DLs and Horn clausal logic (or its fragments) in a way that is similar in the spirit to hybridization in KR&R systems such as AL-log [4]. The Semantic Web is also gaining the attention of the machine learning and data mining communities, thus giving rise to the new application area of Semantic Web Mining [1]. Most work in this area simply extends previous work to the new application context, see [11], or concentrates on the RDF/RDFSchema layer, see [12]. Yet learning/mining in representation languages such as OWL and ORL raises more interesting issues. A crucial issue is the definition of generality orders for inductive hypotheses. Indeed many approaches in machine learning, e.g. Inductive Logic Programming (ILP) [14], and data mining, e.g. Mannila’s framework for frequent pattern discovery [13], are centered around the mechanism of generalization and propose algorithms based on a search process through a partially ordered space of inductive hypotheses. In this paper we show how our framework for learning in AL-log [9] can be considered as an ILP approach for mining the logical layer of the Semantic Web. We illustrate the features of the approach by means of an example of frequent pattern discovery in data and ontologies extracted from the on-line CIA World Fact Book. Our goal is to show that AL-log, though less expressive than ORL, is powerful enough to satisfy the actual needs of expressiveness in many current Semantic Web applications.
2
Preliminaries
The system AL-log [4] integrates ALC [16] and Datalog [3]. For the sake of brevity we concentrate on hybridization. A constrained Datalog clause is a clause α0 ← α1 , . . . , αm &γ1 , . . . , γn where m ≥ 0, n ≥ 0, αi are Datalog atoms and γj are constraints of the form s : C where s is either a constant or a variable already appearing in the clause, and C is an ALC concept. The symbol & separates constraints from Datalog atoms in a clause. An AL-log knowledge base B is the pair hΣ, Πi where Σ is an ALC knowledge base and Π is a set of constrained Datalog clauses. The interaction between Σ and Π allows the notion of substitution to be straightforwardly extended from Datalog to AL-log. Also it is at the basis of a model-theoretic semantics for AL-log. In particular an interpretation J for B is defined as the union of an O-interpretation IO (i.e. an interpretation compliant with the unique names assumption) for Σ and an Herbrand interpretation IH for ΠD (i.e. the set of clauses obtained from the clauses of Π by deleting their constraints). The notion of logical consequence paves the way to the definition of answer set for queries, i.e. constrained Datalog clauses of the form ← β1 , . . . , βm &γ1 , . . . , γn . Reasoning in AL-log is based on constrained SLD-resolution that extends SLD-resolution to deal with constraints. In particular, constrained SLD-refutation is a complete and sound method for answering queries. Note that in AL-log the derivation of a constrained empty clause does not represent a refutation. It actually infers that the query is true in those models of B that satisfy its constraints. Therefore in order to answer a query it is necessary to collect enough derivations ending with a constrained empty clause such that every model of B satisfies the constraints associated with the final query of at least one derivation. These satisfiability checks are performed by means of a tableau calculus that operates on both the terminological and the assertional part of Σ. In our framework for learning in AL-log hypotheses are represented as constrained Datalog clauses that are linked and connected (as usual in ILP [14]), and OI-compliant [8]1 . The hypothesis space is ordered according to the generality relation of B-subsumption. The link between B-subsumption and constrained SLD-resolution provides a decidable procedure for checking B [9]. Theorem 1. Let H1 , H2 be two hypotheses, B an AL-log knowledge base, and σ a Skolem substitution for H2 w.r.t. {H1 } ∪ B. We say that H1 B H2 iff there exists a substitution θ for H1 such that (i) head(H1 )θ = head(H2 ) and (ii) B ∪ body(H2 )σ ` body(H1 )θσ where body(H1 )θσ is ground. It can be proved that B is a quasi-order, therefore it enables the definition of refinement operators for searching the hypothesis space [9]. There are very few attempts at learning in hybrid languages [15,7]. Unfortunately none of these works is motivated by and/or aimed at a specific application. Conversely our 1
The bias of Object Identity (OI) can be considered as an extension of the unique names assumption from the semantic level of ALC to the syntactic one of AL-log. Bindings in OI-substitutions avoid the identification of terms.
framework for learning in AL-log has been implemented in an ILP system, ALQuIn [9], that supports the task of frequent pattern discovery at multiple levels of description granularity. A previous version of AL-QuIn is described in [10].
3
Discovering frequent patterns with AL-QuIn
A frequent pattern is an intensional description of a subset of a given data set r whose cardinality exceeds a user-defined threshold. The task of frequent pattern discovery is to generate all frequent patterns expressible in a given language L. In particular, we consider a variant of this task which takes concept hierarchies into account during the discovery process, thus yielding descriptions of r at multiple granularity levels. More formally, given – a data set r including a taxonomy T where a reference concept and taskrelevant concepts are designated, – a set {Ll }1≤l≤maxG of languages – a set {minsupl }1≤l≤maxG of support thresholds the problem of frequent pattern discovery at l levels of description granularity, 1 ≤ l ≤ maxG, is to find the set F of all the patterns P ∈ Ll frequent in r, namely P ’s with support s such that (i) s ≥ minsupl and (ii) all ancestors of P w.r.t. T are frequent. In AL-QuIn, the data set r is actually an AL-log knowledge base. Example 1. As a running example, we consider an AL-log knowledge base B that adds ALC ontologies to Datalog facts2 extracted from the on-line 1996 CIA World Fact Book3 . Note that an ontology already available for the same domain4 is not good for our illustrative purposes because it contains only primitive concepts and shallow concept hierarchies. The structural subsystem Σ focus on the concepts Country, EthnicGroup, Language, and Religion. The intensional part of Σ encompasses inclusion statements such as AsianCountry @ Country. MiddleEasternEthnicGroup @ EthnicGroup. MiddleEastCountry=AsianCountryu∃Hosts.MiddleEasternEthnicGroup. IndoEuropeanLanguage @ Language. IndoIranianLanguage @ IndoEuropeanLanguage. MonotheisticReligion @ Religion. ChristianReligion @ MonotheisticReligion. MuslimReligion @ MonotheisticReligion. that define four taxonomies, one for each concept above. Note that Middle East countries (concept MiddleEastCountry) have been defined as Asian countries that host at least one Middle Eastern ethnic group. Assertions like 2 3 4
http://www.dbis.informatik.uni-goettingen.de/Mondial/mondial-rel-facts.flp http://www.odci.gov/cia/publications/factbook/ http://www.daml.org/2003/09/factbook/factbook-ont
’ARM’:AsianCountry. ’IR’:AsianCountry. ’Arab’:MiddleEasternEthnicGroup. ’Armenian’:MiddleEasternEthnicGroup. :Hosts. :Hosts. ’Armenian’:IndoEuropeanLanguage. ’Persian’:IndoIranianLanguage. ’Armenian Orthodox’:ChristianReligion. ’Shia’:MuslimReligion. ’Sunni’:MuslimReligion. belong to the extensional part of Σ. In particular, Armenia (’ARM’) and Iran (’IR’) are classified as Middle East countries. The relational subsystem Π contains of facts such as language(’ARM’,’Armenian’,96). language(’IR’,’Persian’,58). religion(’ARM’,’Armenian Orthodox’,94). religion(’IR’,’Shia’,89). religion(’IR’,’Sunni’,10). and the constrained Datalog clauses speaks(CountryID, LanguageName)← language(CountryID,LanguageName,Percent) & CountryID:Country, LanguageName:Language believes(CountryID, ReligionName)← religion(CountryID,ReligionName,Percent) & CountryID:Country, ReligionName:Religion that define two views on language and religion. The languages Ll , 1 ≤ l ≤ maxG, contain O-queries, i.e. constrained Datˆ γ2 , . . . , γn where X is the alog clauses of the form q(X) ← α1 , . . . , αm &X : C, distinguished variable bound by a ALC concept Cˆ of reference, and the remaining variables occurring in body(Q) are the existential variables. Example 2. The following O-queries Q4 = q(X) ← believes(X,Y), speaks(X,Z) & X:MiddleEastCountry, Y:MonotheisticReligion, Z:IndoEuropeanLanguage Q5 = q(X) ← believes(X,Y), speaks(X,Z) & X:MiddleEastCountry, Y:MuslimReligion, Z:IndoIranianLanguage describe Middle East countries with respect to the religions believed and the languages spoken at two different levels of granularity. In [13] the space of patterns is organized according to a generality order between patterns and searched one level at a time, starting from the most general
patterns and iterating between candidate generation and candidate evaluation phases. Furthermore, must be monotonic w.r.t. the function supp for evaluating support of patterns. It has been proved that B fulfills this requirement [10]. In the following we illustrate the advantages of B in frequent pattern discovery. Example 3. Given the O-queries Q1 = q(A) ← believes(A,B) & A:MiddleEastCountry, B:MonotheisticReligion Q2 = q(X) ← believes(X,Y) & X:MiddleEastCountry, Y:MuslimReligion we want to check whether Q1 B Q2 holds. Let σ={X/a, Y/b} be a Skolem substitution for Q2 w.r.t. B ∪ {Q1 } and θ={A/X, B/Y} a substitution for Q1 . The condition (i) of Theorem 1 is immediately verified. Also it holds that (ii) B ∪ {believes(a,b)&a:MiddleEastCountry, b:MuslimReligion} ` believes(a,b) & a:MiddleEastCountry, b:MonotheisticReligion. Therefore we can say that Q1 B Q2 . Conversely, Q2 6B Q1 . Given the O-query Q3 = q(X) ← believes(X,Y), believes(X,Z) & X:MiddleEastCountry, Y:MonotheisticReligion it can be easily verified that Q1 B Q3 holds by choosing σ={X/a, Y/b, Z/c} as a Skolem substitution for Q3 w.r.t. B ∪ {Q1 } and θ={A/X, B/Y} as a substitution for Q1 . Note that Q3 6B Q1 under the OI bias because the substitution {X/A, Y/B, Z/B} for Q3 is not legal. These generality relations between Q1 , Q2 , and Q3 can be verified by comparing their answer sets (for the definition of correct/computed answer to O-queries, see [10]). Iran satisfies all the three patterns. Armenia, as opposite to Iran, is a well-known borderline case for the geo-political concept of Middle East, though the Armenian is usually listed among Middle Eastern ethnic groups. In 1996 the on-line CIA World Fact Book considered Armenia as part of Asia. But modern experts tend nowadays to consider it as part of Europe, therefore out of Middle East. This trend emerges from our small experiment because, among the three characterizations of the Middle East, Armenia satisfies only the weakest one, i.e. Q1 . Also this confirms that Q2 6B Q1 and Q3 6B Q1 .
4
Conclusions
As opposite to traditional ILP, the background knowledge in our approach does not ignore the latest developments in knowledge engineering. Indeed Bsubsumption, presented as the core ingredient for learning in AL-log, allows for ontologies expressed with ALC to be embedded into the more usual inductive reasoning in Datalog. The approach not only takes ontologies as input to the discovery process but also returns knowledge that can be used to make the input ontologies evolve. E.g., Example 3 highlights the potential usefulness of B -ordered spaces of O-queries in conceptual clustering. Indeed each
O-query describes a cluster of the individuals of the reference concept and Bsubsumption relations stand for containment relations between clusters. These links are promising and encourage us to carry on working along this direction of research. Also we wish to extend our framework for learning in AL-log to more expressive hybrid languages, notably to ORL. This will allow us - as soon as huge ORL data sources will be made available - to evaluate empirically the intended Semantic Web application sketched in this paper.
References 1. B. Berendt, A. Hotho, and G. Stumme. Towards semantic web mining. In I. Horrocks and J.A. Hendler, editors, International Semantic Web Conference, volume 2342 of Lecture Notes in Computer Science, pages 264–278. Springer, 2002. 2. T. Berners-Lee. Weaving the Web. Harper: San Francisco, 1999. 3. S. Ceri, G. Gottlob, and L. Tanca. Logic Programming and Databases. 1990. 4. F.M. Donini, M. Lenzerini, D. Nardi, and A. Schaerf. AL-log: Integrating Datalog and Description Logics. J. of Intelligent Information Systems, 10(3):227–252, 1998. 5. I. Horrocks and P.F. Patel-Schneider. A Proposal for an OWL Rules Language. In Proc. of the 13th Int. World Wide Web Conference, pages 723–731. ACM, 2004. 6. I. Horrocks, P.F. Patel-Schneider, and F. van Harmelen. From SHIQ and RDF to OWL: The making of a web ontology language. Journal of Web Semantics, 1(1):7–26, 2003. 7. J.-U. Kietz. Learnability of description logic programs. In S. Matwin and C. Sammut, editors, Inductive Logic Programming, volume 2583 of Lecture Notes in Artificial Intelligence, pages 117–132. Springer, 2003. 8. F.A. Lisi, S. Ferilli, and N. Fanizzi. Object Identity as Search Bias for Pattern Spaces. In F. van Harmelen, editor, ECAI 2002. Proceedings of the 15th European Conference on Artificial Intelligence, pages 375–379, Amsterdam, 2002. IOS Press. 9. F.A. Lisi and D. Malerba. Ideal Refinement of Descriptions in AL-log. In T. Horvath and A. Yamamoto, editors, Inductive Logic Programming, volume 2835 of Lecture Notes in Artificial Intelligence, pages 215–232. Springer, 2003. 10. F.A. Lisi and D. Malerba. Inducing Multi-Level Association Rules from Multiple Relations. Machine Learning, 55:175–210, 2004. 11. A. Maedche and S. Staab. Discovering Conceptual Relations from Text. In W. Horn, editor, Proceedings of the 14th European Conference on Artificial Intelligence, pages 321–325. IOS Press, 2000. 12. A. Maedche and V. Zacharias. Clustering Ontology-Based Metadata in the Semantic Web. In T. Elomaa, H. Mannila, and H. Toivonen, editors, Principles of Data Mining and Knowledge Discovery, volume 2431 of Lecture Notes in Computer Science, pages 348–360. Springer, 2002. 13. H. Mannila and H. Toivonen. Levelwise search and borders of theories in knowledge discovery. Data Mining and Knowledge Discovery, 1(3):241–258, 1997. 14. S.-H. Nienhuys-Cheng and R. de Wolf. Foundations of Inductive Logic Programming, volume 1228 of Lecture Notes in Artificial Intelligence. Springer, 1997. 15. C. Rouveirol and V. Ventos. Towards Learning in CARIN-ALN . In J. Cussens and A. Frisch, editors, Inductive Logic Programming, volume 1866 of Lecture Notes in Artificial Intelligence, pages 191–208. Springer, 2000. 16. M. Schmidt-Schauss and G. Smolka. Attributive concept descriptions with complements. Artificial Intelligence, 48(1):1–26, 1991.