RDF vs. NoSQL databases for the Semantic Web ...

3 downloads 257 Views 333KB Size Report
performance testing of selected semantic repositories and. NoSQL databases. .... documents in the JavaScript Object Notation (JSON). BSON allow modeling of ...
RDF vs. NoSQL databases for the Semantic Web applications Peter Bednar*, Martin Sarnovsky**, Viktor Demko** *

**

Technical University of Kosice, Department of Banking and Investment, Kosice, Slovakia Technical University of Kosice, Department of Cybernetics and Artificial intelligence, Kosice, Slovakia [email protected], [email protected], [email protected]

Abstract— the main objective of presented paper is to compare and analyze the performance of semantic and NoSQL storage on the selected datasets. Paper focuses on a theoretical analysis of the problem and details the performance testing of selected semantic repositories and NoSQL databases. The practical part is focused on the testing of selected systems and our main aim was to simulate multiple querying with regard to diversity of the queries with different criteria. Results of the performed experiments are reported and analyzed.

I.

INTRODUCTION

Traditionally, data of Semantic Web applications are stored in the RDF data stores and one of the main outputs of the Semantic Web initiative was the specification of standard query languages such as SPARQL for RDF data [1]. Various techniques for indexing of RDF data were proposed and implemented for efficient retrieval and evaluation of the queries. Most of the implemented techniques decompose RDF data into triplets with subject, predicate and object and then store indexes for various combinations of triplet constituents. Indexes are usually stored on the disk and/or in memory in the native format but some of triplet stores are using traditional relational databases as the data persistent backend. Decomposition of data to triplets is flexible but on the other side can be less efficient, since it requires multiple joins for queries with multiple RDF properties [3]. For this reason, some of the semantic repositories store data at higher level, where higher semantic constructs such as RDF Schema classes and relationships are mapped to one relational table. Requirements on the design of the applications including the simplification of the design process, horizontal scaling and finer control over availability led to new types of the databases for unstructured data commonly denoted as NoSQL databases. A NoSQL database provides a mechanism for storage and retrieval of data that employs less constrained consistency models than traditional relational databases (i.e. ACID – atomicity, consistency, isolation and durability properties doesn’t have to be fully supported). NoSQL databases are often highly optimized key–value stores intended for efficient retrieval and appending operations, which provide benefits in terms of latency and throughput. NoSQL databases are also referred to as "Not only SQL" to emphasize that they do in fact allow SQLlike query languages to be used [10].

Term NoSQL database is used for various approaches, which overlaps in many cases, but generally implemented approaches can be divided to key-value stores, document based databases and graph databases. The main goal of this paper is to compare RDF databases with NoSQL databases in terms of efficiency and simplicity of application design. The main question is if NoSQL databases provide additional benefits for the implementation of Semantic Web and Linked Data applications over the RDF triplet stores. In our comparison, we have included only representatives of document based NoSQL databases and graph databases since typical key-value stores do not provide support for complex queries. In scope of our future research, investigation of presented technologies can be important. Our main motivation is to utilize the suitable NoSQL repository in generalized concept lattices [14] generation tasks theoretically introduced in [11, 13] as well as in data and text mining tasks within different domains [15, 16]. The structure of this paper will be organized as follow: next chapter will describe methodology used for comparison. Chapter 3 will describe benchmark data and queries used for testing of performance and scalability and chapter 4 will describe tested RDF and NoSQL databases. Results of tests will be reported in chapter 5 followed by the conclusion. II.

METHODOLOGY

In our comparison, we have included both functional and non-functional requirements for the database storage. Functional requirements include effectiveness and scalability. Non-functional requirements include flexibility and simplicity of design and implementation. For the testing of functional requirements, we have followed these steps: 1. Benchmark data modeling – model for benchmark data is based on the real case. It contains various types of properties and entities including data properties, relations between entities and transitive relations. 2. Benchmark data population – we have generated a large set of data according to the benchmark model. Generated data can be customized with various parameters such as number of entities of each type, density of relations and size of the vocabularies for data properties. Parameterized customization of benchmark data allows gradually testing scalability of particular query types.

3.

Measuring – for functional properties, we have conducted various tests and measured quantitative performance indicators such as request times or memory and disk space usage. For testing, we have used native client APIs provided for each repository in order to reflect real operation conditions, as it will be implemented in application logic (i.e. our benchmark is on higher level measuring also overhead due encoding of data into the communication protocol etc.). 4. Interpretation of the results – quantitative measures were presented in the tables and charts, which allow overall comparison of the databases or comparison for the particular type of the data/query. For each repository, we have also analyzed overall scalability trends and provided charts how database scales in terms of the request times in dependency to the increasing size of the data. For non-functional requirements, we have followed less formal approach. For each repository, we have evaluated how difficult was to create and populated benchmark data using the native client APIs. In addition, we have evaluated code complexity for each type of the query. We have compared code complexity in Java language, which was supported by all databases. III.

BENCHMARK DATA AND QUERIES

Our benchmark data is based on the real case from the e-Business domain. Data describe products, their properties, features and types, information about the product producers and information about the customers buying the products and their product reviews. Data contains: • Various data properties of numerical type, string or date/time type • Direct relationship between entities (Product to Product Type, Product to Product Feature, etc.) • Relations are represented as the standalone entities with additional metadata properties about the relation (Product Offer entity, which connects Product and Vendor or Review, which connects Product and Customer) • Transitive relations (e.g. Product Types are organized in the hierarchical structure with various depth) The following table summarizes main data entities and their relations/ properties. Number of generated entities 37 Product types 1954 Features 6 Producers

255 Products

Properties Type, Label, Comment, subClassOf, Publisher, Date Type, Label, Publisher, Date, Comment Type, Label, Comment, HomepageCountry, Publisher, Date Type, Label, Comment, Type, productPropertyNumeric,

3 Sellers 5100 Offers

1 Review page 2550 Reviews

productPropertyTextual, productFeature, Publisher, Date Type, Label, Comment, Homepage Country, Publisher, Date Type, Product, Vendor, Price, ValidFrom, ValidTo, DeliveryDays, OfferWebpage, Date Type, Name, Mbox, Country, Publisher, Date Type, ReviewFor, Reviewer, ReviewDate, Title, Text, Rating Publisher, Date

Based on the data model, we have proposed various benchmark queries covering all modeled aspects. Queries can be divided according the complexity to: type D: queries filtering data according to data properties or direct relations; type R: queries dereferencing relations and filtering data according the properties of referenced entities and type T: transitive queries dereferencing transitive relations. The following list summarizes benchmark queries followed by the query type. • Q1: Find the products with specific properties and features [D] • Q2: Get selected properties for specific product [D] • Q3: Find products of the specific type, which has one required feature and one optional with some data property constraints [T] • Q4: Find products with two features constrained by the property of the features. [R] • Q5: Find products with the similar data properties as the specified product [R] (specified product is at first dereferenced in order to get compared values) • Q6: Find products with the specified text description [R] (labels of products are standalone entities which has to be dereferenced) • Q7: Get information about all offers and reviews for the specified product [R] • Q8: Get information about all reviews and their authors for the specific product constrained by the properties of the review [R] • Q9: Find information about the reviewer and all of his/her reviews according to one specified review [R] • Q10: Get all offers for the specified product filtered according to the offer properties [R] • Q11: Get selected properties for the specified offer [D] • Q12: Get selected properties of the offer, product and producer for the specified offer [R] For the benchmark data, we have implemented generator, which generates data set according to the specified parameters (such as number of products, number of product types and depth of the product type hierarchy). Output of the generator can be stored in various RDF formats (N-Triples, Turtle, XML RDF, TriG and SQL dump). We have also implemented tool designed to read the data from triplet representation and map them to the

objects in JSON notation for import of the data into the NoSQL databases. For benchmark testing we have implemented environment, which mix the list of queries in specified or random order for repeated measurements, so we can better simulates behavior of various application clients and measure query time perturbances. IV.

EVALUATED DATABASES

For testing of RDF databases, we have selected Sesame framework. Sesame supports two query languages: standard SPARQL [8] and SeRQL [7]. Sesame's RDF database API differs from comparable solutions in that it offers a stackable interface through which functionality can be added, and the storage engine is abstracted from the query interface [4]. Many other triplet stores can be used through the Sesame API, including Mulgara, and AllegroGraph. For testing purposes, we have selected the following implementations of the Sesame API: • Sesame native store – is the reference implementation of the Sesame API, which uses native format to store data on disk file system. • Sesame SQL store – uses relational database system (MySQL or PostgreSQL database) as the underlying persistence engine. • BigOWLIM store – is an enterprise-grade store designed to be scalable to billions of RDF statements. In addiotion to the indexing and efficient query evaluation, it supports rule-based reasoning used for example for the transitive relations [5]. For the testing of NonSQL databases, we have included two representatives: one is the document-based database and one graph database. MongoDB is a cross-platform document-based database system. Main concept of the document-based databases is the document, which encapsulates and encodes data and metadata in some standard format encoding [12]. MongoDB uses BSON binary format for storing of documents in the JavaScript Object Notation (JSON). BSON allow modeling of data properties (single or multivalued), embedded entities, arrays, or pointers to other documents. Query language in MongoDB is based on BSON notation. Besides of the querying of data, MongoDB support also map-reduce operations, where the user specified code is executed on the database server for batch processing and aggregation of the data. MongoDB also supports indexing of document fields similarly like in relational database system and replication and load balancing for reliability and scalability. As a graph database, we have selected OrientDB. The data model provided by the OrientDB is documentoriented similarly like in MongoDB, but relations between documents are indexed and can be traversed in complex transitive queries used in graph algorithms. The query language is SQL with some extensions to handle relations without SQL join, manage trees and graphs of connected documents. For reliability, it support multi-master replication, however current version is not fully distributed for horizontal scaling.

V.

EVALUATION RESULTS

For testing we have generated data with 100 000 triplets, 250 000 triplets and 500 000 triplets. RDF data were then converted to JSON format for NoSQL databases. During the batch uploading of the data, all tested repositories scaled linearly with the number of triplets. Uploading times and memory/disk requirements were similar with the small violation of MongoDB. In MongoDB uploading time depends on the number of indexes, which are optional for each data/relation property.

Figure 1. 6 slow queries for all technologies

Analysis of queries was divided according to time complexity into two groups: fast queries with small result sets and simpler constrains and slow queries with more complex constraints. We have conducted 10 times measurements of query evaluation times. Results for both queries are presented on Figure 1 and Figure 2.

Figure 2. 6 fastest queries for all technologies Figure 3. In general, the most problematic queries are relational queries. Dereferencing of relations was especially problematic for NoSQL databases, which do not support direct relations (i.e. MongoDB) or nested sub-queries. In this case, to implement some queries, it was necessary to

fetch entities referenced by the relation and then dynamically construct second query to fetch final resultset. Overall evaluation time of these queries depend mainly on number of referenced entities. Similar situation were with the transitive relations, however this was less problematic because our hierarchies were shallow (up to 3 levels). VI.

CONCLUSION

General conclusion of this paper is that both technologies (RDF triplet stores and NoSQL databases) are quite competitive for semantic applications. Both technologies are flexible and allow simple implementation of the data with non-static open schema. The main advantage of the RDF triplet stores is the standard declarative query language (i.e. SPARQL). Both tested NoSQL databases provide declarative language, which can express dynamic queries; however, capabilities and syntax is very heterogeneous across various NoSQL databases. On the other hand, NoSQL databases provide better support for transactions, reliability via replication, load balancing and more flexible indexing of properties similar to the relational databases. Another specific feature is map-reduce framework, which can be used for more complex analysis of large datasets. However many benefits of the NoSQL database depend on proper data modeling, which together with the query interface heterogeneity complicate selection of the data store for the specific problem and future maintenance of the applications. For these reasons, we can expect some effort for the standardization of NoSQL database programming interfaces. ACKNOWLEDGMENT This work was supported by the Slovak VEGA Grants No. 1/1147/12. REFERENCES [1]

T. Berners-Lee, J. Hendler, O. Lassila (2001), The Semantic Web, Scientific American, May 2001, pp. 28-37. [2] R. Studer, V. R. Benjamins, D. Fensel: Knowledge Engineering: Principles and Methods, Data & Knowledge Engineering 25, p. 161-197, 1998. [3] RDF Specification. Dostupné na webe (01.11.2011). [4] Komunita vývojárov a používateľov SESAME vedená spoločnosťou Aduna. Dostupné na http://www.openrdf.org/ [5] Technológia OWLIM vyvíjané spoločnosťou Ontotext. Dostupné na http://www.ontotext.com/owlim [6] ROEKSTRA, J. Sesame RQL: a Tutorial. [online] Version 1.2 Publikované 1997-2004, posledná aktualizácia 10. 2. 2004. Dostupné z UNESCO. The Guide to Electronic Theses & Dissertations [online]. Paris : UNESCO, c2001 [cit 2004-11-10]. Dostupné na internete: . [7] The SeRQL query language. User Guide for Sesame - Updated for Sesame release 1.2.3. [online] Aduna B.V., Sirma AI Ltd., 2005. Revision 1.2. Chapter 6. Dostupné z http://www.openrdf.org/doc/sesame/users/ch06.html [8] E. Prud'hommeaux, A. SEABORNE - SPARQL Query Language for RDF. W3C Recommendation 15 January 2008. Dostupné na http://www.w3.org/TR/2008/REC-rdf-sparql-query-20080115/ [9] RDF Store Benchmarking – W3C. Dostupné na http://www.w3.org/wiki/RdfStoreBenchmarking [10] C. Strauch, U. Sites, W. Kriha: “Nosql databases,” Lecture Notes, Stuttgart Media University,2011.

[11] P. Butka, J. Pocsova, J., Pocs, “Comparison of standard and sparse-based implementation of GOSCL algorithm”, 2012, CINTI 2012 - 13th IEEE International Symposium on Computational Intelligence and Informatics, Proceedings , art. no. 6496735 , pp. 67-71. [12] T. Hawkins, E. Plugge, P. Membrey, Peter: The Definitive Guide to MongoDB: The NoSQL Database for Cloud and Desktop Computing (1st ed.), Apress, p. 350, ISBN 978-1-4302-30519Technológia OrientDB vyvíjaná spoločnosťou NuvolaBase Ltd. Dostupné na http://www.orientdb.org [13] P. Butka, J. Pócs, J. Pócsová, “Distributed version of algorithm for generalized one-sided concept lattices”, 2014, Studies in Computational Intelligence 511 , pp. 119-129 [14] P. Butka, J. Pócs, “Generalization of one-sided concept lattices”, 2013, Computing and Informatics 32 (2) , pp. 355-370 [15] J. Paralic, C. Richter, F. Babic, J. Wagner, M. Racek. Mirroring of Knowledge Practices based on User-defined Patterns. Journal of Universal Computer Science 17(10), pp. 1474–1491, 2011. [16] F. Babic, P. Bednar, F. Albert, J. Paralic, J. Bartok, L. Hluchy, “Meteorological phenomena forecast using data mining prediction methods”, In Proceedings of 3rd International Conference on Computational Collective Intelligence, ICCCI 2011, LNAI 6922, pp. 458–467, 2011.