Keywords: pattern discovery, semantic data mining, SPARQL, meta-learning, ontology, ... represented in complex forms like relational databases, logic programs, and in .... methods in that it uses a variable-free notation of description logic to ...
Pattern based feature construction in semantic data mining
Agnieszka Ławrynowicz, Poznan University of Technology, Poland Jędrzej Potoniec, Poznan University of Technology, Poland
ABSTRACT We propose a new method for mining sets of patterns for classification, where patterns are represented as SPARQL queries over RDFS. The method contributes to so-called semantic data mining, a data mining approach where domain ontologies are used as background knowledge, and where the new challenge is to mine knowledge encoded in domain ontologies, rather than only purely empirical data. We have developed a tool that implements this approach. Using this we have conducted an experimental evaluation including comparison of our method to state-ofthe-art approaches to classification of semantic data and an experimental study within emerging subfield of meta-learning called semantic meta-mining. The most important research contributions of the paper to the state-of-art are as follows. For pattern mining research or relational learning in general, the paper contributes a new algorithm for discovery of new type of patterns. For Semantic Web research, it theoretically and empirically illustrates how semantic, structured data can be used in traditional machine learning methods through a pattern-based approach for constructing semantic features. Keywords: pattern discovery, semantic data mining, SPARQL, meta-learning, ontology, intelligent system INTRODUCTION Pattern discovery is a fundamental data mining task. It deals with the automatic detection of patterns in data. Pattern is any regularity, relation or structure inherent in some source of data (Shawe-Taylor & Cristianini, 2004). Various methods have been proposed for finding patterns in a variety of forms such as item sets, association rules, correlations, sequences, episodes etc. From the point of view of this paper, we are interested in structured domains, where data is represented in complex forms like relational databases, logic programs, and in particular semantic data such as ontology-based knowledge bases or Linked Open Data (LOD)1. Relational pattern discovery has been investigated since the development of WARMR (Dehaspe & Toivonen, 1999), an algorithm for mining patterns using the Datalog subset of first-order logic as the representation language for data and patterns. This has been followed by subsequently proposed relational pattern mining algorithms such as FARMER (Nijssen & Kok, 2001) or c-armr (De Raedt & Ramon, 2004). They can all be classified under Inductive Logic 1
http://linkeddata.org/
Programming (ILP) (Nienhuys-Cheng & Wolf, 1997) methods since they use subsets of logic programs as the representation language. With the rise of the Semantic Web (Berners-Lee, Hendler, & Lassila, 2001), also called Web of Data, an interest has grown in employing languages, and knowledge representation formalisms underpinning the Semantic Web in data mining. This interest is motivated by increase of popularity, number and size of such semantic data sources as LOD (containing billions of pieces of data linked together2) that require statistical approaches able to handle Semantic Web knowledge representation formalisms. These formalisms include logic-based ontology languages such as description logics (DLs) (Baader, Calvanese, McGuinness, Nardi, & Patel-Schneider, 2003) that constitute the formalism underlying the standard ontology language for the Web, the Web Ontology Language (OWL) (McGuinness & van Harmelen, 2004). In this line, in (Lisi & Esposito, 2008) the foundations have been laid of an extension of relational learning, called ontorelational learning, to account for ontologies. (Fanizzi, d'Amato, & Esposito, 2010) propose the term ontology mining for all such activities that allow to discover hidden knowledge from ontological knowledge bases, by possibly using only a sample of data. Finally, (Kralj-Novak, Vavpetic, Trajkovski, & Lavrac, 2009) coined the term semantic data mining3 to denote a data mining approach where domain ontologies are used as background knowledge, and where the new challenge is to mine knowledge encoded in domain ontologies, rather than to mine purely empirical data. The above-mentioned interest has been reflected in the development of relevant pattern mining algorithms, firstly onto-relational ones like SPADA (Lisi & Malerba, 2004), SEMINTEC (Józefowska, Ławrynowicz, & Łukaszewski, 2010) or AL-QuIn (Lisi F.A., 2011), and subsequently fully based on a description logic based ontology language like the algorithm Fr-ONT (Ławrynowicz & Potoniec, 2011). In recent years, a topic of using patterns in predictive models has drawn a lot of attention (Bringmann, Nijssen, & Zimmermann, 2009). Especially in complex, structured domains, such as graphs and sequences, pattern mining can be helpful to obtain models. The main idea is that patterns can be used as features to build a predictive model. For instance, pattern-based classification is a process of learning a classification model where patterns are used as features. According to recent studies, classification models making use of pattern-based features may be more accurate or simpler to understand than the original feature set (Cheng, Yan, Han, & Hsu, 2007). In structured domains, pattern mining may work as a propositionalisation approach that enables using classical propositional data mining/machine learning methods by decoupling the data representation from the learning task. This paper describes a method for pattern-based classification based on a novel algorithm for pattern mining. The proposed algorithm discovers patterns represented as SPARQL (Prud'hommeaux & Seaborne, 2008) queries over a subset of RDF Schema (RDFS) (Brickley & Guha, 2004) suitable to represent lightweight ontologies. The algorithm takes the semantics of RDFS vocabulary into account, which enables it to exploit knowledge encoded in ontologies. Through a propositionalisation approach the patterns are used as features in classification. Subsequently, we describe a tool we have developed to support semantic data mining approaches in general, where the proposed method is implemented. The tool, an extension to a leading open source data mining environment RapidMiner (Mierswa, Wurst, Klinkenberg, Scholz, & Euler, 2006), enables building data mining processes (workflows) from small blocks (operators) by 2 3
http://lod-‐cloud.net/state/ http://semantic.cs.put.poznan.pl/SDM-‐tutorial2011/doku.php?id=start
connecting their inputs and outputs, and contributes to so-called third generation data mining systems (Piatetsky-Shapiro, 1997) (Hilario, Lavrac, Podpecan, & Kok, 2010). Finally, we describe the results of experiments including an experimental study within emerging subfield of meta-learning called semantic meta-mining (Hilario, Nguyen, Do, Woznica, & Kalousis, 2011) which is an ontology-based, process-oriented form of meta-learning which aims to learn over full knowledge discovery processes rather than over individual algorithms. Consequently, the paper presents important research contributions to the state-of-art: (i) for pattern mining research, or semantic data mining and relational learning in general, it contributes a new algorithm for discovery of new type of patterns – SPARQL queries over RDFS; (ii) for Semantic Web research, it illustrates how semantic, structured data can be used in traditional data mining/machine learning methods through a propositionalisation approach; (iii) for metalearning research, it contributes with a pattern-based method for learning over knowledge discovery processes with support of ontologies as background knowledge. The rest of the paper is organized as follows. The next section discusses the work related to ours. In the Preliminaries section, we introduce notions and definitions used in the text. Further sections contain the description of the proposed algorithm, the implemented tool, and the conducted experimental study. In the last section we conclude.
RELATED WORK The work relevant to ours may be grouped into the following research threads: relational pattern mining, pattern-based classification, and classification methods for the Semantic Web data. As it has been already mentioned, there have been proposed such relational pattern discovery methods like WARMR (Dehaspe & Toivonen, 1999), FARMER (Nijssen & Kok, 2001), and carmr (De Raedt & Ramon, 2004). All of them employ Datalog as the representation language for data, background knowledge and patterns, and are aimed at frequent pattern discovery. These systems were generally not designed to exploit ontological knowledge, e.g. in the form of taxonomies of classes. In particular, WARMR and FARMER employ a syntactic generality relation (e.g. so-called θ-subsumption (Plotkin, 1970)) that does not take background knowledge into account. c-armr uses the semantic generality relation, a kind of generalized subsumption (Buntine, 1988), but it is not fully exploited w.r.t. taxonomic information. More precisely, when c-armr discovers that a pattern containing atom C(x) is infrequent, it does not stop it from subsequently testing a pattern containing atom C’(x), where C is more general than C’. In turn, onto-relational methods SPADA (Lisi & Malerba, 2004), SEMINTEC (Józefowska, Ławrynowicz, & Łukaszewski, 2010), and AL-QuIn (Lisi, 2011) all exploit taxonomies in some way, and all use semantic generality relations such as generalized subsumption or query containment. SPADA (further refined to AL-QuIn) uses a hybrid knowledge representation formalism, AL–log (Donini, Lenzerini, Nardi, & Schaerf, 1998), that combines Datalog with description logic. Patterns in SPADA/AL-QuIn are represented as constrained Datalog clauses, where description logic concepts (that is ontology classes) are used as constraints in the body. SPADA/AL-QuIn solves a variant of frequent pattern discovery task, where classes from ontological taxonomies are used in the constraints of the clauses to produce patterns in multiple levels of granularity. These levels of granularity are exploited very systematically with such a drawback that classes from different granularity levels are never mixed in a one clause. One may expect that it may be a problem in case of unbalanced class hierarchies. SEMINTEC method
(Józefowska, Ławrynowicz, & Łukaszewski, 2010) does not have this restriction. It uses so-called DL-safe rules (Motik, Sattler, & Studer, 2005) as a representation language of the knowledge base on which it operates. The formalism of DL-safe rules combines Semantic Web ontologies (represented in description logic) and rules (represented in disjunctive Datalog). Description logic and disjunctive Datalog rules are integrated by allowing description logic concepts and roles (ontology classes and properties) to occur in rules as unary and binary predicates, respectively, forming so-called DL-atoms. Patterns in SEMINTEC are represented as conjunctive (DL-safe) queries over such a knowledge base, and are basically composed of sets of atoms. More in depth analysis of the properties of the above-mentioned systems may be found in (Józefowska, Ławrynowicz, & Łukaszewski, 2010). The Fr-ONT algorithm (Ławrynowicz & Potoniec, 2011) differs from the already described methods in that it uses a variable-free notation of description logic to represent patterns. Patterns in this approach are represented as concepts of the description logic EL++ (corresponding to classes of the OWL 2 EL profile). With our proposed method we go into different direction than all the already described relational pattern mining methods, that is into the direction of mining patterns from Resource Description Framework (RDF) (Manola & Miller, 2004)graphs of the Semantic Web. Therefore, we explicitly use the Semantic Web query language SPARQL as the language for representing patterns. While it may be argued that SPARQL basically does not go further with expressive power than Datalog (Angles & Gutierrez, 2008), this argument does not apply to our case. Despite of targeting RDF, we also explicitly employ the semantics of a subset of RDFS to properly handle ontological knowledge, and hence we work with SPARQL queries over RDFS. It is also worth mentioning that we handle data types (which was not a case e.g. with SEMINTEC method). Finally, using explicitly the language of SPARQL (and not transforming SPARQL queries to e.g. Datalog), allows us to use specific constructs (like FILTER) of the SPARQL language in the patterns. This paper assumes that patterns are mined with a goal to be subsequently used as features to train a classification model. A comprehensive study of the problem of mining sets of patterns for classification is presented in (Bringmann, Nijssen, & Zimmermann, 2009). The authors of the study categorize pattern-based classification methods along the following dimensions: (1) whether they post-process already pre-computed set of patterns or pattern mining algorithm is executed iteratively; (2) whether they select patterns model-independently or whether the pattern selection is guided by a model. Another study discussing the problem of computing so-called class-sensitive patterns may be found in (Kralj-Novak, Lavrac, & Webb, 2009). In this paper, we take the task of computing the classification model into account already during pattern construction that is reflected in the chosen pattern quality measures and search strategy. In general, our pattern-mining method is not dedicated to a single pattern quality measure, and indeed, our tool implementation supports several pattern quality measures that may be chosen during pattern mining. The subject of performing classification task on Semantic Web data have already been considered in several works such as DL-FOIL (Fanizzi, D'Amato, & Esposito, 2008), DLLearner (Lehmann, 2009), SPARQL-ML (Kiefer, Bernstein, & Locher, 2008) and kernel based methods proposed by (Bloehdorn & Sure, 2007) and by (Loesch, Bloehdorn, & Rettinger, 2012). We observed that those methods are mainly designed to work on two polar knowledge representations: only RDF data or complex DLs. However, semantic data (such as LOD datasets)
are rarely pure RDF or DL ontologies, but rather a mix of those two. According to the recent survey (Glimm, Hogan, Kroetzsch, & Polleres, 2012), a subset of RDFS that corresponds to ρ df language (Munoz, Perez, & Gutierrez, 2007), employed in our method, constitutes the most frequently used vocabulary on the Web of Data. Another observation is that the mentioned works address classification task either via: concept learning (DL-FOIL and DL-Learner), statistical relational learning (SPARQL-ML) or kernel methods. In this work, we propose a new method that addresses classification task on Semantic Web data via pattern mining. Our research hypothesis is that if sufficiently large and good quality pattern based feature set is constructed then this kind of method is able to outperform the other proposed classification methods even in cases where only the semantics of ρ df is used instead of complex DLs. PRELIMINARIES In this section we present notions and definitions that are further used in the text. First we provide a short overview of knowledge representation languages employed in this work, namely RDF and RDFS, and query language SPARQL. Subsequently, we formulate a problem of mining pattern sets, where patterns are represented as SPARQL queries. Language of knowledge representation RDF and RDFS syntax. RDF is a data format based on graphs designed to describe resources on the Web and properties of those resources by means of statements in the form of subjectpredicate-object structures. To formally define RDF we follow (Munoz, Perez, & Gutierrez, 2007). We consider pairwise disjoint infinite sets U, B, and L which denote, respectively, URI references, blank nodes and literals. An RDF triple is a tuple τ = (s, p, o) ∈ (U ∪ B ∪ L) × U × (U ∪ B ∪ L). In this tuple, s is called the subject, p is called the predicate, and o is called the object. An RDF graph G is a set of RDF triples. We will also refer to an RDF graph as to an RDF dataset. The universe of G, denoted by universe(G), is the set of elements in U ∪ B ∪ L that occur in the triples of G. The vocabulary of G, denoted by voc(G), is the set universe(G) ∩ (U ∪ L). In this work, we assume a fragment of RDFS, called ρdf (Munoz, Perez, & Gutierrez, 2007) that covers fundamental features of RDFS. This fragment is defined as the following subset of the RDFS vocabulary: ρdf ={sp, sc, type, dom, range}, where sp stands for rdfs:subPropertyOf, sc stands for rdfs:subClassOf, type stands for rdf:type, and dom, and range stand for, respectively, rdfs:domain, and rdfs:range. Thus, by (p, sp, q) we will denote that property p is a subproperty of property q, by (c, sc, d) we will denote that class c is a subclass of class d, by (a, type, b) we will denote that a is of type b, and by (p, dom, c), and (p, range, c) we will denote, respectively, that the domain of property p is c, and that the range of property p is c. RDF and RDFS semantics. Following (Munoz, Perez, & Gutierrez, 2007) we will now define the semantics of RDF and RDFS. An interpretation I over a vocabulary Voc is a tuple I = ⟨∆Res, ∆P, ∆C, ∆L, P·, C·, ·I⟩, where
∆Res, ∆P, ∆C , ∆L are the interpretation domains of I, which are finite non-empty sets, and P·, C·, ·I are the interpretation functions of I, such that: 1. ∆Res are the resources, called the domain or universe of I; 2. ∆P are property names (not necessarily disjoint from ∆Res); 3. ∆C ⊆ ∆Res are the classes; 4. ∆L ⊆ ∆Res are the literal values, where ∆L contains all plain literals in L ∩ Voc; 5. P· is a function P·: ∆P → 2 !Re s "!Re s , a mapping that assigns an extension to each property name; 6. C· is a function C·: ∆C → 2 !Re s , a mapping that assigns a set of resources to every resource denoting a class; 7. ·I: (U ∪ L) ∩ Voc → ∆Res ∪ ∆P is the interpretation mapping that assigns a resource or a property name to each element of (U ∪ L) in Voc, and such that ·I is the identity for plain literals and assigns an element in ∆Res to elements in L. An interpretation I is a model of a graph G, denoted I╞ G, iff I is an interpretation over the vocabulary ρdf ∪ universe(G) that satisfies the following conditions: 1. Simple: (a) there exists a function A: B→ ∆Res such that for each (s, p, o) ∈ G, p I A ∈ ∆P and (s I A ,o I A ) ∈ Pp I A , where I A is the extension of ·I using A; 2. Subproperty: (a) Psp I A is transitive over ∆P; (b) if (p, q) ∈ Psp I A then p, q ∈ ∆P and Pp ⊆ Pq; 3. Subclass: (a) Psc I A is transitive over ∆C; (b) if (c, d) ∈ Psc I A then c, d ∈ ∆C and Cc ⊆ Cd; 4. Typing I: (a) x ∈ Cc iff (x, c) ∈ Ptype I A ; (b) if (p, c) ∈ Pdom I A and (x, y) ∈ Pp then x ∈ Cc; (c) if (p, c) ∈ Prange I A and (x, y) ∈ Pp then y ∈ Cc; 5. Typing II: (a) for each e ∈ ρdf, e I A ∈∆P; (b) if (p, c) ∈ Pdom I A then p ∈ ∆P and c ∈ ∆C; (c) if (p, c) ∈ P range I A then p ∈ ∆P and c ∈ ∆C; (d) if (x, c) ∈ P type I A then c ∈ ∆C. Graph G entails graph H under ρdf, denoted G╞ ρdf H, iff every model under ρdf of G is also a model under ρdf of H. SPARQL syntax. A SPARQL query Q is composed of the body of the query, denoted body(Q), and the head of the query, denoted head(Q). The body of SPARQL query may be a complex RDF graph pattern expression including RDF triples with variables, conjunctions, disjunctions, optional parts and constraints over the values of the variables. The head of the query indicates
how to construct the answer to the query, where the answer can have different forms such as for example yes/no answer, a table of values, or a new RDF graph. In this work we concentrate only on SELECT queries, that is a table of values. Let V be an infinite set of variables disjoint from (U ∪ B ∪ L). We assume that the elements from V are prefixed by ?. Following (Perez, Arenas, & Gutierrez, 2009) we present the syntax of SPARQL graph patterns in an algebraic way, using the binary operators UNION, AND and OPT, and FILTER. We define a SPARQL graph pattern recursively as follows: (1) A tuple from (U ∪ B ∪ L ∪ V) × (U ∪ V) × (U ∪ B ∪ L ∪ V) is a graph pattern (a triple pattern). (2) If P1 and P2 are graph patterns, then expressions (P1 AND P2), (P1 OPT P2), and (P1 UNION P2) are graph patterns. (3) If P is a graph pattern and R is a SPARQL built-in condition, then the expression (P FILTER R) is a graph pattern. In this paper, we use SPARQL built-in conditions that are constructed using elements of the set L ∪ V, and inequality symbols (≤, ≥,). A set of triple patterns (which corresponds in our case to triples connected with AND operator) is commonly named Basic Graph Pattern (BGP). A group graph pattern may extend BGP with a FILTER operator. Optional graph patterns result from the extension with the OPT (OPTIONAL) operator, and alternative graph patterns (where two or more possible patterns are tried), result from the extension with the UNION operator. Our algorithm, proposed in this paper, constructs patterns using only AND, and FILTER operators, and hence further in the paper, we will concentrate only on those operators. Given a graph pattern P, by var(P) we denote the set of variables occurring in P, and given a built-in condition R, by var(R) we denote the set of variables occurring in R. Example 1. The following is a SPARQL graph pattern that intuitively corresponds to all trains having at least one passenger car with at least fifty seats, but no more than eighty seats: (?x, type, Train) AND (?x, hasPassengerCar, ?y) AND (?y, hasNumberOfSeats, ?z) FILTER(50=n && ?y=n’ && ?y= && ?x