Nov 21, 2006 - Keywords. Web Ontology Language (OWL), ABox, extensional queries,. RDBMS. 1. .... with databases such as MySQL and PostgreSQL to store. RDF graphs. ..... UNION SELECT manual.id FROM manual). UNION SELECT ...
The Ohio State University Department of Biomedical Informatics 3190 Graves Hall 333 W. 10th Avenue Columbus, OH 43210
Technical Report OSUBMI_TR_2006_n03 DBOWL: Towards Extensional Queries on a Billion Statements using Relational Databases Sivaramakrishnan Narayanan, Tahsin Kurc, Joel Saltz 11/21/2006
DBOWL: Towards Extensional Queries on a Billion ∗ Statements using Relational Databases Sivaramakrishnan Narayanan, Tahsin Kurc, Joel Saltz Department of Biomedical Informatics, The Ohio State University Columbus, OH, 43210 {krishnan,kurc,jsaltz}@bmi.osu.edu ABSTRACT This paper is concerned with the problem of managing and querying very large OWL datasets. We target a class of extensional queries: instance retrieval. In our framework, axioms in a given ontology are mapped to a set of relational database views and instance retrieval queries are executed against these views. We experimentally evaluate the performance of our implementation against other memorybased and database-based systems using the Lehigh University Benchmark (LUBM). Our results show the proposed framework is able to achieve faster data loading times than the other systems. The query execution performance of our framework is comparable to that of memory-based systems for small datasets and better than the other systems for large, out-of-core datasets. Using the framework, we were able to handle a dataset consisting of one billion statements generated using LUBM.
Keywords Web Ontology Language (OWL), ABox, extensional queries, RDBMS.
1.
INTRODUCTION
Web and Grid computing technologies have made it easier for users to access distributed data sources. These tools have also enabled a user or a group of users to publish their datasets easily in a distributed environment. The result is an explosion in the number of heterogeneous data sources and the volume of accessible data (in the form of Web pages, on-line publications, or databases of experimental and simulation data). This trend has prompted a need for standards and tools so that users can effectively interact with and synthesize information from heterogeneous collections of data sources and data types. XML and Web Services Definition Language (WSDL) have provided a means of representing the structure of resources, data elements, and datasets so that they can be accessed, retrieved, and exchanged pro∗This research was supported in part by the National Science Foundation under Grants #CNS-0203846, #ACI0130437, #CNS-0403342, #CNS-0406384, #CCF-0342615, #CNS-0509326, #ANI-0330612, #CNS-0426241, and by NIH NIBIB BISTI #P20EB000591, Ohio Board of Regents BRTTC #BRTT02-0003 and #ODOD-AGMT-TECH-04049. Copyright is held by the author/owner(s). WWW2007, May 8–12, 2007, Banff, Canada. .
grammatically. In addition to syntactic representation, the information content of a data source should be expressed in formal semantics so that it can be interpreted and processed correctly. A number of standards for representation and storage of semantic information have been developed over the years, including the Resource Description Framework (RDF) [9], RDF Schema (RDFS) [7], and the Web Ontology Language (OWL) [5]. RDF provides a syntax [8] to represent labeled, directed graphs (with labeled edges) and has associated semantics [6]. A RDF graph can also be expressed as a set of statements or triples, as (subject, predicate, object), to express facts about inter-related resources. RDFS and OWL are ontology languages which enable greater functionality for expressing domain knowledge in machine interpretable form. OWL has three flavors of increasing expressivity: OWL Lite, OWL DL and OWL Full. Ontology languages are standardizations of Description Logics [11]. A dataset represented in OWL can be viewed as consisting of two components. The TBox (terminological box) component forms the ontology, which describes concepts and properties. It may relate concepts and properties using axioms/constructs with well-defined semantics. The ABox (assertion box) component, on the other hand, contains assertions and statements on individuals. In this paper, we refer to the ABox component as instance data as well. Instance data is an RDF graph that adheres to an ontology, i.e., it uses concepts and properties defined in the ontology. Instance data, in conjunction with the corresponding ontology, can cause additional statements to be asserted based on the semantics of the ontology language. These assertions may be queried using a query language such as SPARQL [10]. Queries against instance data are called extensional queries. Most earlier work on semantic information management and querying focused on management of ontologies and efficient TBox reasoning. In recent years, there has been an increasing interest in efficient storage and querying of instance data and ABox reasoning [22]. Availability of tools for managing ontologies [30], automatic extraction of semantic information from data sources [20] and semantic integration [15] are going to result in large instance data. Furthermore, eScience applications bring large observational data [28] using high throughput instruments. It is conceivable that in a few years datasets with billion statements will become a reality. Scalable tools and techniques are needed to store and query such large datasets. In this paper, we develop a semantic store to handle very large amounts of data that may not fit in main memory.
We target simple extensional instance retrieval queries of the form: “Give me all instances of concept A” where A is defined in the ontology. While this is a subset of extensional queries, it still offers a good starting point to evaluate system performance and scalability for handling very large datasets. Our framework uses relational databases for storage of OWL instance data and executing queries with reasoning. The contributions of our work are: • We propose a mapping algorithm to create database views for and use view definitions to perform partial OWL-Lite ABox inferencing. • We compare our performance with existing solutions including Jena [35], Sesame [17] and DLDB-OWL [32] (now subsumed by the Hawk project1 ). Our experimental results show that data loading times with our approach are much smaller. This performance improvement would be especially beneficial when new datasets are added to the system dynamically. • Our results also show that the framework can achieve good query execution performance. The performance is comparable to that of memory-based systems for small datasets and is better than other database-based system for very large datasets. • We demonstrate that the proposed approach can handle a dataset consisting of one billion triples. To the best of our knowledge, this is the largest scale instance data managed by a database-based OWL store.
2.
RELATED WORK
RDF/XML [8] and N-TRIPLES, which is a subset of Notation 3 [1], are popular file formats for representing RDF graphs. An informal list of potential database schemas to represent RDF graphs can be found at the Stanford University InfoLab site [4]. Garc´ıa and Montes [19] compare semantic stores on the expressiveness of the supported ontology language, storage models, query languages, and reasoning mechanisms. We identify design aspects of some of these systems that can have a bearing on performance for large datasets. A comparison for taxonomic RDFS queries using databases is done by Theoharis et al. [34]. The authors generate a variety of class hierarchies and instance data using a Zipf distribution. We propose a mechanism for databases to perform inferences beyond RDFS. We also compare performances of memory-based and database-based stores using very large datasets generated from realistic ontologies. Sesame [17], Jena [35], 3store [25], and Redland [13] are application frameworks that provide an API to manipulate RDF graphs. These are useful for developing applications that use RDF. They also provide mechanisms to interface with databases such as MySQL and PostgreSQL to store RDF graphs. RDF graphs may be mapped to different table structures depending on query patterns [21]. Inferencing in these systems, however, is done using in-memory algorithms and outside the database engine. This restricts scalability of these tools for large data. Another issue is that their APIs involve fine grained manipulations of a RDF graph like adding/deletion nodes and edges. These operations do not translate efficiently to database operations, since insertion, 1
http://swat.cse.lehigh.edu/projects/index.html#hawk
deletion, and update of a single tuple is relatively expensive in a database system. We aim to match the performance of such systems for small datasets while scaling well for larger datasets. We have used Jena and Sesame in our experimental performance evaluation and comparison. A few projects have also utilized database-specific features to store RDF graphs. The earlier version of Sesame [17] used the table inheritance feature of PostgreSQL to implement subsumption reasoning for RDFS. A modified Oracle database system is used to access RDF graphs in [18]. In our approach, we use views based on standard SQL constructs to achieve greater degree of portability across different database management systems. Our work is, in part, inspired by DLDB-OWL [32], where SQL-based views are created using subsumptions calculated by a TBox reasoner like FaCT [26]. In DLDB-OWL, reasoning mechanisms are restricted to those that can be reduced to subsumption. Unlike our current implementation, DLDB-OWL employs sophisticated TBox reasoning. DLDBOWL can handle a larger set of KIF-like queries. However, we can handle certain inferences like one involving restrictions and domain and range that is not currently supported by DLDB-OWL. We believe a union of both approaches will be more powerful than either one. We have used Hawk, which is built on DLDB-OWL by the same team and subsumes the functionality of DLDB-OWL, in our experimental performance comparison in Section 4 and discuss how some aspects of DLDB-OWL can affect the performance for extremely large instance datasets. BigOWLIM2 is a successor to the OWLIM system [29]. It uses a special disk-based data structure called the TRREE to support very large-scale ABoxes. Our framework builds on existing relational database technology. This facilitates greater portability of our framework and allows it to leverage optimizations developed in the database community. Databases also have some additional benefits like being able to handle transactions. We plan to do a performance comparison with BigOWLIM in a future work. Much of the Semantic Web is inspired by work in the Description Logics community. However, such work tackled problems related to handling expressive logics and develop complete reasoning algorithms. FaCT [26] and Racer [24] are reasoners that can test for satisfiability of TBoxes. They use expensive algorithms that cannot be applied to large ABoxes, which is the focus of our work. Beeri et al [14] explain how queries may be rewritten using views. Bordiga and Brachman [16] and Hustadt et al [27] describe how to map the SHIQ− description logic to disjunctive datalog program and provide the theoretical foundation for our work.
3.
DBOWL: HANDLING VERY LARGE OWL DATASETS
In this section, we present the DBOWL framework and describe how large OWL instance datasets are handled in this framework. In the rest of the paper, we use the terms ontology and TBox interchangeably; similarly, the ABox is also denoted by the term instance data.
3.1 2
Framework and Mapping Algorithm
http://www.ontotext.com/owlim/big/index.html
RDF graph of the instance data or the graph representation of the ontology. The vertices of graph DG correspond to concepts and properties in the ontology. The edges represent the dependencies between the vertices, hence between the respective labels. Each vertex is assigned a label. Initially, the following labels are assigned to the vertices corresponding to each concept A and each property P in DG: l(A) ← SELECT s FROM base WHERE p=rdf:type AND o=A l(P ) ← SELECT s,o FROM base WHERE p=P Figure 1: The DBOWL framework
Figure 1 illustrates the overall framework. A key component of this framework is the mapping module that generates a set of views in a relational database from a given ontology. The relational database backend is used to support the storage and querying of large instance data. By layering the semantic storage and query functionality on a relational database management system, we seek to leverage database optimizations for efficient data loading, management, and retrieval of large datasets, view creation and materialization, and query execution. Approaches to using relational databases for storage of semantic information can be grouped into two main categories [34]: Schema-aware and Schema-oblivious. A schemaaware representation uses the knowledge about the ontology to create tables in a relational database. These tables may correspond to concepts or properties defined in the ontology. A schema-oblivious scheme is blind to the ontology and simply stores triples. Our approach can be categorized as a Schema-aware approach. ABox inference support in semantic stores can be categorized as Forward-chaining vs Backward-chaining. In forward-chaining, all inferences are pre-computed and stored. The backward-chaining strategy, on the other hand, produces inferences dynamically from queries to the system. The inferencing mechanism in DBOWL can be viewed as a hybrid between Forwardchaining and Backward-chaining. The prototype DBOWL is written entirely in Java and builds on several existing libraries. Specifically, it uses Jena2.3 to parse and manipulate ontology files and talks to the relational database via a JDBC driver. Initially, DBOWL creates a single table in the backend database. This table is called the base table and contains three attributes, subject (s), predicate (p), and object (o). All three attributes are varchar and store Unified Resource Identifiers (URIs). The mapping algorithm parses the ontology and identifies concepts and properties. The algorithm creates a view for each concept and property defined in the ontology. A database view is equivalent to a named SQL query and defining a view involves specifying a query that will result in the rows in the view. A view corresponding to a concept will have a column named id that will refer to a resource URI. Similarly, a view corresponding to a property will have columns named s and o that refer to the subject and object URIs in a statement with the aforementioned property. The rest of the algorithm performs the task of building view definitions. First, a directed graph DG is instantiated by the mapping module. We should note that this graph is different from the
As per these definitions, the label of node A corresponds to an SQL query that will result in all explicit instances of concept A. Domain knowledge is captured in an ontology using axioms. In description logic terms, this corresponds to a TBox. The ontology may contain several axioms (or constructs) that express relationships between concepts and properties. For example, a subsumption axiom of the form C v D can be used to denote that any resource that is of type C, is also of type D. This captures the inheritance (subclass) among different concepts, much like the subclass notion in object-oriented programming. A full discussion on various axioms, constructs and their semantics can be found in [2]. The mapping algorithm starts by creating the vertices and proceeds by processing the axioms in the ontology one by one. It adds edges between the vertices and updates the vertex labels, as each axiom is processed. Axioms/constructs in the ontology may result in additional resources belonging to a concept or additional properties asserted between resources. As an example, consider the following axiom: A v ∀R.B. The semantics of this axiom says that if a is an element of A (a ∈ A) and a is related to b through property R (aRb), then b is an element of B (b ∈ B). This type of inference potentially causes a new element to be added to the concept B. This is captured by adding the following label: “SELECT R.o FROM R,A WHERE R.s = A.id” to vertex B. To perform this type of ABox inferencing, the vertex labels in DG are extended. The label generation and assignment rules are detailed in Table 1 under the Label Action column. When an axiom is encountered in the ontology, the corresponding rule in the Table 1 is executed to update the vertex labels. In Table 1, A, B denote concepts and P, Q, R correspond to properties. For each axiom, directed edges are added as specified by the Edge Added column in the table. The purpose of each edge is to keep track of dependencies between the concepts, between the properties, and between the concepts and properties. An edge (A, B) in DG denotes that view B’s definition depends on view A. This will determine an order in which views can be created in the database. If at any time, a cycle in graph DG is detected, then an exception is noted and the corresponding axiom is ignored. This is because databases do not allow cyclic view definitions. This aspect will be discussed in Section 3.3. When all axioms have been considered, the vertices of DG are processed in a dependency preserving order and a view corresponding to each vertex (concept/property) is created. The order of creation of views is important, because if view V1 depends on V2 , then V2 must be created earlier. At the
Axiom/Construct AvB A v ∀R.B ∃R.B v A P vQ P has-domain A P has-range A P is symmetric P inverseOf Q
Label Action l(B) ← l(B) ⊕ l(B) ← l(B) ⊕ l(A) ← l(A) ⊕ l(Q) ← l(Q) ⊕ l(A) ← l(A) ⊕ l(A) ← l(A) ⊕ l(P ) ← l(P ) ⊕ l(P ) ← l(P ) ⊕
UNION SELECT A.id FROM A UNION SELECT R.o FROM A, R WHERE A.id = R.s UNION SELECT UNIQUE(R.s) FROM R, B WHERE R.o = B.id UNION SELECT P.s,P.o FROM P UNION SELECT P.s FROM P UNION SELECT P.o FROM P UNION SELECT base.o, base.p FROM base WHERE base.p == ’P’ UNION SELECT s,o FROM Q
Edge(s) Added (A, B) (R, B), (A, B) (R, A), (B, A) (P, Q) (P, A) (P, A) (Q, P )
Table 1: Mapping of axioms supported in the DBOWL framework to labels. end of the mapping phase, there will be several views defined in the database. Some views will be simple and will access the base table only. Some views will access several other views. Once the mapping of a given ontology to the relational database has been computed, instance data can be loaded as triples into the base table using the bulk loading facility of the underlying database system. Most database systems have highly optimized custom operations, such as COPY in PostgreSQL or LOAD DATA INFILE in MySQL, for bulk loading datasets to a database. To take advantage of the optimized bulk loading support, instance data is assumed to be in the N-TRIPLES format in DBOWL. N-TRIPLES is a subset of the Notation 3 [1] format and maps directly to the base table. Rows in the base table, therefore, correspond to statements in the instance data. The RDF/XML [8] format is another widely accepted format for OWL instance data. We developed a pre-processing utility program that converts datasets represented in RDF/XML to N-TRIPLES. Queries to instance data may be expressed in a RDF query language such as SPARQL [10]. In this context, a SPARQL query is mapped onto an SQL query to the set of views defined in the database. In our current implementation, our query mapping module is implemented to handle extensional instance retrieval queries only, i.e., queries of the form: Give me all resources of type X. Such a query is translated to a SELECT query to a view named X in the database.
3.2
A Mapping Example
We now illustrate the ontology mapping algorithm using a simple example which is a subset of the Lehigh University Benchmark [23] ontology. The example ontology defines a small set of concepts, properties, and axioms related to publications.
Article v P ublication Conf erenceP aper v Article orgP ublication has − range P ublication
Here, Publication, Article, and ConferencePaper are concepts and orgPublication is a property. The axioms in the ontology define Article to be a subclass of Publication, ConferencePaper to be a subclass of Article (hence, ConferencePaper is also a subclass of Publication through inference), and orgP ublication to have a range domain of Publication. The initial labels for the concepts and properties are:
Figure 2: The directed graph DG for the example university ontology.
l(P ublication) ← SELECT * FROM base WHERE p=rdf:type and o=”Publication” l(Article) ← SELECT * FROM base WHERE p=rdf:type and o=”Article” l(orgP ublication) ← SELECT * FROM base WHERE p=”orgPublication”
After all the axioms are considered, the system creates the dependency graph DG as shown in Figure 2. This graph is traversed to create the final views. For example, the following query represents the final view definition of the Publication concept: SELECT * FROM base WHERE p=’’rdf:type’’ AND o=’’Publication’’ UNION SELECT id FROM Article UNION SELECT o FROM orgPublication Note that Article is a concept view and orgPublication is a property view. The Publication view uses the Article view which, in turn, will refer to the ConferencePaper view.
3.3
Discussion
Our implementation is layered on top of JDBC and does not employ an database system specific operations. This facilitates greater portability across database systems; we have been able to use MySQL and PostgreSQL without any database-specific tweaking. Using a database backend to perform ABox inferences scales well to large instance data as opposed to in-memory reasoning algorithms, as will be shown in the experimental evaluation. In this work, we also examine the performance impact of view materialization and indexing optimizations in databases (see Section 4). The mapping algorithm creates a set of views that are used to answer queries. This multi-view approach makes it possible to trade off storage space and query
execution performance. When a view is materialized, it corresponds to a partial pre-computation of inference. This involves creating a persistent table with the same rows as the view using the appropriate query. Materialized views present opportunities for speeding up query execution, especially if the view is complex. Materializing a view not only improves the performance of queries that directly map to that view, but also speeds up queries that use other views, which depend on the materialized view. A view may be materialized the first time it is accessed by a query, thus hiding the view creation costs. In Section 4, the experimental results show that significant performance improvement can be obtained by view materialization. One disadvantage of materializing views is that a view has to be rematerialized when the base table is updated. Moreover, any changes to the ontology affect view definitions. However, monotonic changes may be incorporated gracefully, using the same mapping algorithm. Another disadvantage of view materialization is that additional storage space is needed to store the materialized view. Depending on the query workload and resource availability, one could selectively materialize views in our framework to trade off query performance and storage space requirements. Our current implementation does not employ this optimization. We plan to investigate how to select best set of views to materialize under space constraints for a given workload in a future work. Indexing is a commonly used optimization in databases. Note that most queries that can be answered using the base table specify the predicate and object columns. A multicolumn index on these two columns can benefit query execution. Indexing, however, has an additional cost associated with it. We examine the costs and benefits of creating such an index in our experiments. There are a few limitations in our approach. Using databases to perform ABox inferences is more difficult because databases work with closed-world assumption while description logics work with open-world assumptions. We do not perform any sophisticated TBox reasoning. We believe coupling a TBox reasoner like FaCT [26] will augment the system. Our reasoning method does not handle complete axioms of the form A ≡ B. This axiom reduces to two axioms, A v B and B v A. This causes a cycle in our dependency graph DG. Currently, cycles are not supported in the framework because a database system will not allow cyclic view definitions. Another axiom that does not map well in the proposed scheme is the TransitiveProperty axiom. This also causes a cycle in DG. Even some existing ABox inferences for inverseOf and some other axioms are not complete, once again due to cycles. A possible workaround to such cycles is to define stored procedures that could iteratively compute views involved in a cycle until a fixed-point is reached. The scalability of such mechanisms needs to be investigated. Our current implementation is most suited for moderately sized and relatively static ontologies, since the definition of views depends on the ontology and existing materializations may become invalid. It is also well suited to very large instance data. On the other hand, it is not well suited to support fine-grained graph manipulation APIs, because manipulating a single-row in a database is an expensive process. A system like DBOWL can be useful in workflows where a task in the workflow may require access to a subset of the instance data to run an analysis algorithm.
4.
EXPERIMENTAL EVALUATION
In this section, we demonstrate the performance of our system and compare it with existing solutions. We employed MySQL-5.1.11-beta3 and PostgreSQL-8.1.4 as the backend database systems. All of our experiments were carried out on a PC with dual Opteron 250 (single core) processors, 8GB of main memory space, and two 250GB SATA disks. In our evaluation, we compared the performance of the DBOWL framework against memory-based and databasebased semantic storage systems. For experiments with memory-based configurations, we chose Jena-2.3 and Sesame-2.0alpha; two popular systems for memory-based storage and querying of semantic data. For the database-based systems, we used Hawk-1.5-beta (which subsumes DLDB-OWL [32]) and Sesame-1.2.54 . We used PostgreSQL as the backend database system for both softwares. In addition to evaluating our basic approach, we also examined the performance impact of indexing and view materialization. In the graphs, the experimental configurations with these optimizations are denoted by the idx and mat suffixes. We performed experiments to measure how long it takes 1) to load a given dataset to each system (Loading Phase) and 2) to execute an extensional query once the data has been loaded (Querying Phase). In all the graphs, both axes are logarithmic scale.
4.1
Applications
We performed experiments using datasets generated from two applications: the Lehigh University Benchmark [23] (LUBM) and the Biological Pathway Exchange (BioPAX) ontology for storage and exchange of biological pathway information [12]. The LUBM was developed to facilitate the evaluation of Semantic Web repositories in a standard and systematic way. The benchmark is intended to evaluate the performance of those repositories with respect to extensional queries over a large data set that commits to a university domain ontology. The LUBM also consists of several benchmark queries. Since our work focuses on instance retrieval queries, we only used the dataset-generation part of the LUBM. The benchmark program generates instance data corresponding to x universities (including faculty, student information) where x is a parameter. Larger values of x lead to a larger dataset. We used this to scale up dataset sizes so as to compare the scalability of different systems. A billion triple dataset is created for x = 8000. BioPAX is a community effort to create a standard format for biological pathway data. In biology, the chains of reactions that take place in the operation of biological systems are referred to as Pathways. Establishing and studying pathways is essential to understanding how organisms function at different scales and how diseases such as cancer may affect these functions. To be able to ascertain the role of pathways, scientists often perform a series of experiments and try to formulate a hypothesis that will explain the data. The effective representation and management of pathway information is critical for scientists to analyze the data and integrate information from multiple pathway data sources. This problem was taken up by the BioPAX group [12]. The goal of the BioPAX group is to develop a common exchange format 3 We should note that at the time of this work, support for views was recently added to MySQL. 4 The database backend was not yet implemented in the Sesame-2.0-alpha at the time of writing this paper.
for biological pathways data [33]. The BioPAX format is defined in OWL and currently consists of two levels (Level 1 Ontology and Level 2 Ontology). These ontologies include support for expressing metabolic pathways, signaling pathways, protein-protein interactions, and molecular interactions. Several data providers such as INOH5 , BioCyc6 , and Reactome7 have adopted the BioPAX ontology and provide their data in that format. With the increasing availability of high-throughput instruments for analyzing molecular reactions, it can be anticipated that such datasets can quickly exceed main memory space on most systems. There also have been efforts to combine pathway data from different data providers to form larger datasets [31].
4.2
LUBM
The first set of experiments examines the data loading performance. Data loading involves retrieving the instance data from disk and creating appropriate data structures in memory or on disk so that the instance data can be queried. Instance data is usually represented in the RDF/XML scheme or the N-TRIPLES scheme [8]. The N-TRIPLES format maps well to database bulk loading in order to populate the base table in our framework. The LUBM generator generates instance data in the RDF/XML format. We implemented a small tool to convert the instance data from the RDF/XML format to the N-TRIPLES format. The Hawk system accepts instance data in the RDF/XML format. In comparisons to that system, we also measured the time to convert the RDF/XML format to the N-TRIPLES format. The performance of the other systems was measured using the N-TRIPLES format, as it provided better performance. Figure 3(a) shows that memory-based semantic stores start suffering from thrashing effects once physical memory is exceeded, as expected. We observed that DBOWL with PostgresSQL (DBOWL-PSQL) and MySQL (DBOWL-MySQL) achieve better data loading times than Jena and Sesame. The bulk loading feature of databases is highly optimized and much faster than creating the in-memory data structures. Figure 3(b) shows the data loading times with the databasebased systems. The data loading times are much shorter in DBOWL-PSQL and DBOWL-MySQL than Sesame-1.2.5PSQL and Hawk. Sesame-1.2.5-PSQL results in the longest data loading times. It pre-computes inferred triples (forwardchaining) and stores both the explicit and inferred triples in a triples table with subject, predicate and object columns. Both Hawk and Sesame use indirection via integer IDs and map URIs to these IDs using a separate mapping table. This design minimizes redundancy in storage and can lead to creation of better indexes. However, mapping URIs to IDs is a problematic task when there are a huge number of URIs. Maintaining memory-based data structures like hash-tables will eventually make the loading process memory bound (which we noticed in Hawk). Maintaining a mapping in the databases can result in millions of queries, if there are millions of URIs in the dataset. We believe that while using local IDs provides benefits for smaller datasets, it hampers performance for the larger datasets. Another design aspect is when to create indexes. Sesame and Hawk use indexes extensively. However, they define the indexes 5
http://www.inoh.org/ http://www.biocyc.org/ 7 http://www.reactome.org/ 6
on tables while creating a table and insert data tuple by tuple. Both these aspects can hamper performance greatly. Databases perform better when bulk loading data and then creating indexes. These seem to be the main factors contributing to higher data loading times with Sesame-1.2.5PSQL and Hawk. We should note that translation of the RDF/XML format for loading instance data to the database is an expensive process. It is an order of magnitude more expensive than bulk loading triples into the base table in the DBOWL system as can be seen in Figure 3(b). However, the total data loading times (including the cost of the RDF/XML-to-NTRIPLES translation) are still lower in DBOWL as compared to Hawk. We feel the N-TRIPLES format, while verbose, is better for representing large datasets. Between the two versions of DBOWL, the version with MySQL as the backend system achieved better performance. In the second set of experiments, we evaluate the query execution performance of the various systems. In our current work, we focus on simple instance retrieval queries. So, the questions we ask translate to: “Give me all resources of type A” where A is a concept from the ontology. In the LUBM ontology, we chose the concepts Publication and Professor to do the comparison. We noticed similar trends with other concepts in the ontology. As per the ontology, the algorithm derived the definition of the Publication view as shown in Figure 4. As discussed in Section 3, this view refers to other views in its definition. In the DBOWL approach, a query to return all objects of type Publication translates to a “SELECT * FROM Publication”. In the experiments, we varied the dataset size from a few thousands of explicit triples to 140 million triples. To negate the effect of file system or database caching, we cleared all caches before getting results on these queries (cold-cache). Figure 5 shows execution times for a query for instances of the concept Publication. Among memory-based systems, Sesame performs the best overall for small datasets as seen in Figure 5(a). We observed that Jena performs worse than Sesame on average. Without any optimizations, the performance of DBOWL is about 10 times worse that that of Sesame for in-memory datasets. With the view materialization optimization (DBOWL-PSQL-mat in the graphs), the performance of DBOWL improves significantly and becomes comparable to (albeit still lower than) the performance we achieved using Sesame. It also scales almost linearly as the size of the dataset is increased. Our results suggest that Sesame is the best choice for small datasets that fit in memory. However, DBOWL-PSQL-mat has reasonable performance in these cases. We can see that thrashing raises its ugly head at about 4 million triples for memory-based systems while DBOWL approaches scale linearly. As seen in Figure 5(b), the materialization optimization results in about 10 times faster execution times than those achieved by the other database-based systems. While Sesame-1.2.5-PSQL performs well among the other implementations, we could not run experiments on some bigger datasets because of extremely long loading times. Hawk performs better than DBOWL without optimizations for smaller queries as view definitions do not all access the base table. Rather, resources are partitioned into separate concept tables in the loading phase. This additional cost shows up in the loading phase, but provides good performance when compared to the basic version of DBOWL. We also noted that all database
8
10
9
10
7
10
8
6
10
7
10 10
5
106
Time (ms)
Time (ms)
10
104 10
3
10
2
101 100 5 10
Jena Sesame-2.0-mem DBOWL-PSQL DBOWL-MySQL 106 107 108 Size of the dataset (Explicit Triples)
105 104 103 10
2
101 109
100 5 10
(a)
Sesame-1.2.5-PSQL DBOWL-PSQL-loadtriples OWL-to-NTriples DBOWL-PSQL-total Hawk-PSQL DBOWL-MySQL 106 107 108 Size of the dataset (Explicit Triples)
109
(b)
Figure 3: Data loading performance with the LUBM datasets. (a) Comparison with memory-based semantic stores. (b) Comparison with database-based systems. (((((((((( SELECT base.s AS id FROM base WHERE base.p = ’rdf__type’ AND base.o = ’Publication’ UNION SELECT specification.id FROM specification) UNION SELECT unofficialpublication.id FROM unofficialpublication) UNION SELECT article.id FROM article) UNION SELECT book.id FROM book) UNION SELECT software.id FROM software) UNION SELECT manual.id FROM manual) UNION SELECT publicationdate.s AS id FROM publicationdate) UNION SELECT publicationauthor.s AS id FROM publicationauthor) UNION SELECT softwaredocumentation.o AS id FROM softwaredocumentation) UNION SELECT publicationresearch.s AS id FROM publicationresearch) UNION SELECT orgpublication.o AS id FROM orgpublication;
Figure 4: The view definition for the Publication concept. implementations benefit substantially from database and file system caching. Similar results are seen with querying for instances of Professor as seen in Figure 6.
4.3
BioPAX
Using the BioPAX datasets we compare the performance impact of the various optimizations and the different backend database systems in the DBOWL framework. For our experiments, we took a sample dataset from the Reactome database [3] and scaled it up to create increasingly larger datasets. The dataset is scaled by replicating the explicit statements in the database and applying a renaming algorithm for URIs to create new nodes in the RDF graph. More specifically, we used the data corresponding to the Synechococcus organism and scaled it upto 2000 times, or upto 33 million explicit triples. This is, admittedly, a rudimentary way of scaling data. However, the purpose of these experiments was to show that our system performs well for other ontologies. In this section, we only present results with DBOWL approaches and provide the performance numbers for the memory-based Sesame version to supply a context for comparison. Figure 7 shows data loading times as the size of the data is scaled. In this graphs, DBOWL-PSQL-idx includes time to copy the data and to create an index. As is seen from the performance figures, the version with MySQL backend achieves the lowest data loading times. Creating an index to
speed up query execution increases the loading time. Note that at about 107 triples, Sesame starts thrashing takes an unacceptable amount of time to finish. Looking at query performance numbers, we focus on two extensional queries. Figure 8(a) corresponds to finding all instances of openControlledVocabulary concept. For small, in-memory datasets, Sesame performs very well as expected. Interestingly, with the materialization optimization (marked as DBOWL-PSQL-mat in the figures), DBOWL is able to match the performance of Sesame even for small datasets. We see similar results in Figure 8(b) corresponding to queries regarding instances of externalReferenceUtilityClass. In our experimental evaluation, we observed an interesting issue when MySQL is employed as the backend database system. The externalReferenceUtilityClass has a complex view definition. For this view, MySQL performed very poorly, even though it produced correct results. The cost increases linearly with respect to dataset (and result) size, but the cost per triple of the dataset is unexpectedly high. The support for views is a recent addition to the MySQL infrastructure. We believe that this implementation is not optimized and is the reason for the poor performance in queries involving complex views. We must mention, however, that for simpler views, MySQL outperforms PostgreSQL as was shown in the LUBM benchmark. Indexing can reduce querying time by around a factor
8
10
7
10
6
10
5
10
4
103
Jena Sesame-2.0-mem DBOWL-PSQL DBOWL-PSQL-idx DBOWL-PSQL-mat DBOWL-MySQL
102 101 100 5 10
10
7
10
6
105 Time (ms)
Time (ms)
10
106 107 108 Size of the dataset (Explicit Triples)
10
4
103 Sesame-1.2.5-PSQL Hawk-PSQL DBOWL-PSQL DBOWL-PSQL-idx DBOWL-PSQL-mat DBOWL-MySQL
102 101 100 5 10
109
106 107 108 Size of the dataset (Explicit Triples)
(a)
109
(b)
Figure 5: LUBM: Publication query (a) Comparison with memory-based semantic stores. (b) Comparison with database-based systems. 7
10
106
105
105
104 103 10
Jena Sesame-2.0-mem DBOWL-PSQL DBOWL-PSQL-idx DBOWL-PSQL-mat DBOWL-MySQL
2
101 100 5 10
6
7
8
10 10 10 Size of the dataset (Explicit Triples)
Figure 6: LUBM: Professor query. database-backed stores
104 103 10
Sesame-1.2.5-PSQL Hawk-PSQL DBOWL-PSQL DBOWL-PSQL-idx DBOWL-PSQL-mat DBOWL-MySQL
2
101 10
100 5 10
9
(a)
6
7
8
10 10 10 Size of the dataset (Explicit Triples)
10
9
(b) (a) Results comparing memory-based stores (b) Results involving
of 2. It benefits a larger set of queries. Materialization, however, has a greater impact on performance of specific queries. Both optimizations require additional storage space. Also, in our experiments, we found that materialization of a view does not take much more time than querying the view. Therefore, materialization may be performed the first time a view is accessed, thus hiding its cost. In summary, our experimental results show that using various database optimizations, DBOWL is able to provide comparable performance to Sesame for smaller datasets while scaling linearly for large datasets.
4.4
7
106
Time (ms)
Time (ms)
10
Handling One Billion Statements
In the last set of experiments, we examine the application of the DBOWL framework for extremely large datasets. To this end, we conducted an experiment with over a billion explicit triples. We generated LUBM data corresponding to 8000 universities and loaded it into the PostgreSQL database. The performance results are shown in Figure 9.
Loading time was approximately 4.67 hours. Note that this time refers to the loading of N-TRIPLES file and does not include the cost of converting the dataset representation to N-TRIPLES. Materializing the Publication view took a lot longer at about 22.65 hours. Subsequent querying to the materialized view took 123 seconds. We also noted the effect of database and file system caching by performing the same query multiple times and recording performance results. Since the materialized views, in this case, fit in memory, caching improves performance. For large datasets, materializing views can take a significant amount of time; much more than loading time. Determining which views are to be materialized and in what order should be considered carefully. This may depend on several factors such as space constraints and query workloads.
5.
CONCLUSION AND FUTURE WORK
In this work, we have mapped the semantic storage problem to a database view problem and shown that it can offer
6
10
8
105
10
7
10
6
10
5
10
4
Time (ms)
Time (ms)
10
103 102 10
Sesame-2.0-mem DBOWL-PSQL DBOWL-PSQL-idx DBOWL-PSQL-mat DBOWL-MySQL
1
100 4 10
105 106 107 Size of the dataset (Explicit Triples)
104 103 Sesame-2.0-mem DBOWL-PSQL DBOWL-PSQL-idx DBOWL-PSQL-mat DBOWL-MySQL
102 101 100 4 10
108
105 106 107 Size of the dataset (Explicit Triples)
(a)
108
(b)
Figure 8: BioPAX data: Query performance (a) openControlledVocabulary query and (b) externalReferenceUtilityClass query.
108 10
100000
10000
105
1000
Time (s)
Time (ms)
106
10
Publication Professor
7
4
103 10
2
10
1
100 4 10
Sesame-2.0-mem DBOWL-PSQL DBOWL-PSQL-idx DBOWL-MySQL 5
6
7
10 10 10 Size of the dataset (Explicit Triples)
100
10
10
8
1 load
Figure 7: BioPAX data: Loading performance.
coldquery
hotquery
Figure 9: LUBM(8000,0): 1 billion triples
6. good performance. We have identified some factors that can have a huge bearing on performance while using databases in semantic stores. We have shown performance up to a billion triples using the LUBM benchmark and pathway data and compared it to some existing solutions. We believe databases offer a good foundation to build semantic stores on and also bring several research problems. There are some aspects of ABox inferences that do not map well to SQL. We seek to address this aspect in our future work. We will also consider more complex SPARQL queries and investigate how such queries may be mapped to existing views. Our current implementation handles extensional queries of the form: “Give me all resources of type X”. Mapping more complex queries to views in the presence of database optimizations is a complex problem. It requires cost model(s) for view access and a query mapping algorithm that will use these to optimize query execution. We will investigate execution of more complex semantic queries using databases in a future study.
materialize
REFERENCES
[1] Notation3 (n3): A readable RDF syntax. http://www.w3.org/DesignIssues/Notation3. [2] OWL web ontology semantics and abstract syntax. http://www.w3.org/TR/owl-semantics/. [3] Reactome: a curated knowledgebase of biological pathways. http://www.reactome.org/. [4] Storing RDF in a relational database. http: //infolab.stanford.edu/∼melnik/rdf/db.html. [5] OWL web ontology language overview. http://www.w3.org/TR/owl-features/, 2004. [6] RDF Semantics. http://www.w3.org/TR/rdf-mt, 2004. [7] RDF Vocabulary Description Language 1.0: RDF Schema. http://www.w3.org/TR/rdf-schema/, 2004. [8] RDF/XML syntax specification (revised). http://www.w3.org/TR/rdf-syntax-grammar/, 2004. [9] Resource Description Framework (RDF): Concepts and Abstract Syntax. http://www.w3.org/TR/rdf-concepts/, 2004.
[10] SPARQL query language for RDF. http://www.w3.org/TR/rdf-sparql-query/, 2006. [11] F. Baader, D. Calvanese, D. L. McGuinness, D. Nardi, and P. F. Patel-Schneider, editors. The Description Logic Handbook: Theory, Implementation, and Applications. Cambridge University Press, 2003. [12] G. Bader and et.al. BioPAX: Biological pathway exchange language. http://www.biopax.org. [13] D. J. Beckett. The design and implementation of the Redland RDF application framework. In International World Wide Web Conference, pages 449–456, 2001. [14] C. Beeri, A. Y. Levy, and M.-C. Rousset. Rewriting queries using views in description logics. In PODS, pages 99–108, New York, NY, USA, 1997. ACM Press. [15] S. Bergamaschi, S. Castano, and M. Vincini. Semantic integration of semistructured and structured data sources. SIGMOD Rec., 28(1):54–59, 1999. [16] A. Borgida and R. J. Brachman. Loading data into description reasoners. In SIGMOD, pages 217–226, New York, NY, USA, 1993. ACM Press. [17] J. Broekstra, A. Kampman, and F. van Harmelen. Sesame: A generic architecture for storing and querying RDF and RDF schema. In International Semantic Web Conference, number 2342 in Lecture Notes in Computer Science, pages 54–68. Springer Verlag, July 2002. [18] E. I. Chong, S. Das, G. Eadon, and J. Srinivasan. An efficient SQL-based RDF querying scheme. In VLDB, pages 1216–1227. ACM, 2005. [19] M. del Mar Rold´ an Garc´ıa and J. F. A. Montes. A survey on disk oriented querying and reasoning on the semantic web. In Semantic Web and Databases Workshop, held jointly with the 22nd International Conference on Data Engineering (ICDE 2006), page 58. IEEE Computer Society, Apr 2006. [20] S. Dill, N. Eiron, D. Gibson, D. Gruhl, R. Guha, A. Jhingran, T. Kanungo, S. Rajagopalan, A. Tomkins, J. A. Tomlin, and J. Y. Zien. SemTag and Seeker: Bootstrapping the semantic web via automated semantic annotation. In WWW ’03: Proceedings of the 12th international conference on World Wide Web, pages 178–186, New York, NY, USA, 2003. ACM Press. [21] L. Ding, K. Wilkinson, C. Sayers, and H. A. Kuno. Application-specific schema design for storing large RDF datasets. In 1st International Workshop on Practical and Scalable Semantic Web Systems held at ISWC 2003, volume 89 of CEUR Workshop Proceedings. CEUR-WS.org, 2003. [22] Y. Guo, Z. Pan, and J. Heflin. An evaluation of knowledge base systems for large OWL datasets. In International Semantic Web Conference, volume 3298 of Lecture Notes in Computer Science, pages 274–288. Springer, 2004. [23] Y. Guo, Z. Pan, and J. Heflin. LUBM: A benchmark for OWL knowledge base systems. Journal of Web Semantics, 3(2-3):158–182, 2005. [24] V. Haarslev and R. M¨ oller. Racer: A core inference engine for the semantic web. In 2nd International Workshop on Evaluation of Ontology-based Tools (EON 2003), volume 87 of CEUR Workshop Proceedings. CEUR-WS.org, 2003.
[25] S. Harris and N. Gibbins. 3store: Efficient bulk RDF storage. In 1st International Workshop on Practical and Scalable Semantic Web Systems held at ISWC 2003, volume 89 of CEUR Workshop Proceedings. CEUR-WS.org, 2003. [26] I. Horrocks. The FaCT system. In Proceedings of the International Conference on Automated Reasoning with Analytic Tableaux and Related Methods (TABLEAUX-98), volume 1397 of LNAI, pages 307–312, Berlin, May 5–8 1998. Springer. [27] U. Hustadt, B. Motik, and U. Sattler. Reducing SHIQ− description logic to disjunctive datalog programs. In Principles of Knowledge Representation and Reasoning, pages 152–162. AAAI Press, Menlo Park, California, 2004. [28] C. S. Jr, H. C. Causton, and C. A. Ball. Microarray databases: standards and ontologies. Nature Genetics, pages 469–473, 2002. [29] A. Kiryakov, D. Ognyanov, and D. Manov. OWLIM A pragmatic semantic repository for OWL. In WISE Workshops, volume 3807 of Lecture Notes in Computer Science, pages 182–192. Springer, 2005. [30] H. Knublauch, R. W. Fergerson, N. F. Noy, and M. A. Musen. The Prot´eg´e OWL plugin: An open development environment for semantic web applications. In International Semantic Web Conference, volume 3298 of Lecture Notes in Computer Science, pages 229–243. Springer, 2004. [31] N. Kotecha, K. Bruck, W. Lu, and N. Shah. Pathway Knowledge Base: Integrating BioPAX compliant pathway knowledgebases. In Workshop for W3C Semantic Web Health Care & Life Sciences, held jointly with the 5th International Semantic Web Conference (ISWC 2006), Nov 2006. [32] Z. Pan and J. Heflin. DLDB: Extending relational databases to support semantic web queries. In 1st International Workshop on Practical and Scalable Semantic Web Systems held at ISWC 2003, volume 89 of CEUR Workshop Proceedings. CEUR-WS.org, 2003. [33] A. Ruttenberg, J. A. Rees, and J. S. Luciano. Experience using OWL DL for the exchange of biological pathway information. In OWL:Experiences and Directions, 2005. [34] Y. Theoharis, V. Christophides, and G. Karvounarakis. Benchmarking database representations of RDF/S stores. In Semantic Web Conference, volume 3729 of Lecture Notes in Computer Science, pages 685–701. Springer, 2005. [35] K. Wilkinson, C. Sayers, H. A. Kuno, and D. Reynolds. Efficient RDF storage and retrieval in Jena2. In Proceedings of VLDB Workshop on Semantic Web and Databases, pages 131–150, 2003.