RDF Data in Property Graph Model Dominik Tomaszuk Institute of Informatics University of Bialystok
[email protected]
Abstract. This paper proposes a formalization of the Property Graphs (PG) model, which now does not have a commonly agreed-upon formal definition. The paper shows how to store Resource Description Framework (RDF) triples in the form that can be easily processable in PG databases. We propose methods for mapping from one model to another. This is important because of existing many graph databases, in which we enable to load RDF data. Moreover, we propose a new serialization, called YARS, for RDF that is compatible with PG solutions.
1
Introduction and Motivations
Graphs are useful in understanding a wide variety of datasets in areas such as government, science, social network, life sciences, media and geographic. The real world is interlinked. In some parts it is uniform, in others it is irregular. Such specificity can be easily represented precisely by graphs. There are two main models that allow it: Property Graphs (PG) model and Resource Description Framework (RDF) model. This paper show how to how to store the RDF triples in the form that can be easily processable in PG databases. In this paper we propose methods for interoperability between PG and RDF data stores. Our proposals contribute to enable a user who is familiar with PG databases to load and access RDF data. To accomplish this, we proposed a new serialization for PG databases, which is compatible with RDF. The paper is constructed according to sections. Section 2 is devoted to related work. Section 3 presents RDF concepts. In the Section 4 we formalize PG data model. Section 5 proposes a new RDF serialization, which complies PG model. In Section 5 we introduce an algorithm for transforming RDF graphs to property graphs and show an example of our serialization. Section 6 gives detailed results of our implementation and experiments. The paper ends with conclusions.
2
Related Work
In this section we present serializations and data stores from the Property Graphs area and RDF area.
2.1
Serializations
In the Property Graphs area there are a few solutions for serializing graphs. It may be divided into two groups: ones that uses XML and ones that is textbased. The first group can be distinguished to GraphML [2] and DotML1 . Unfortunately, XML does not allow certain characters in attributes, which can be used in RDF. The second group includes GraphSON2 that uses JSON syntax and GML [11] that uses a hierarchical textual file format. Both formats have some limitations. GraphSON holds vertices and edges in different places, which is difficult to read for humans. GML supports only a 7-bit ASCII characters. On the other hand there are a few solutions for RDF serialization. It may be divided into three groups: Turtle-family languages, XML-based based serializations and JSON-based serializations. The first group includes Turtle [6], N-Triples [20], TriG [21] and N-Quads [5]. RDF/XML [8] and RDFa [1] are the most importent ones in the second group. The third group includes JSON-LD [14] and RDF/JSON [23,22]. Unfortunately, none of these syntaxes does not match the property graph databases. Turtle* [10] extends the Turtle grammar and can support property graphs, but this proposal extends beyond RDF standard and does not have many implementations. There are some papers [9,12,19] that formalize some parts of PG model. In [9] Hartig proposes a formalization of the PG model and introduces transformations between PGs and RDF* [10]. In [12] Jouili et al. propose another definition of PG based on Blueprints3 . In [19] Sch¨atzle et al. present a formalization of PG in the RDF context.
2.2
Data Stores
There are a few data stores in Property Graph world [13,12]. Neo4j [13] is native graph database purpose-built to leverage not only data but also its relationships. It uses Cypher and Gremlin as well. Titan [12] is another graph data store that is distributed and transactional. It supports Gremlin query language. Dex/Sparksee4 is yet another data store that uses Gremlin. On the other hand there a lot of RDF data stores [16,4,17]. Jena [16] is a framework that supports SPARQL. It allows store RDF in a memory and in a relational database. Sameas [4] is yet another framework for querying and analyzing RDF data. RDF-3X is a implementation of SPARQL that uses a relational database to store RDF triples. Other RDF store proposals are discussed in [15]. There are also data stores that support Property Graph and RDF: Oracle database [7] and Bigdata/Blazegraph [9]. 1 2 3 4
http://martin-loetzsch.de/DOTML/ https://github.com/tinkerpop/blueprints/wiki/GraphSON-Reader-and-Writer-Library https://github.com/tinkerpop/blueprints/wiki http://sparsity-technologies.com/
3
RDF Basics
The RDF data model rests on the concept of creating web-resource statements in the form of subject-predicate-object expressions, which in the RDF terminology, are referred to as triples (or statements). Following [24], we provide definitions of RDF triples below. The elemental constituents of the RDF data model are RDF terms that can be used in reference to resources: anything with identity. The set of RDF terms is divided into three disjoint subsets: IRIs, literals, and blank nodes. Definition 1 (IRIs). IRIs serve as global identifiers that can be used to identify any resource. t u Definition 2 (Literals). Literals are a set of lexical values.
t u
Definition 3 (Blank nodes). Blank nodes are defined as existential variables used to denote the existence of some resource for which an IRI or literal is not given. t u Definition 4 (RDF triple). An RDF triple t is defined as a triple t = hs, p, oi where s ∈ I∪B is called the subject, p ∈ I is called the predicate and o ∈ I∪B∪L is called the object. I is the set of all Internationalized Resource Identifier (IRI) references, B an infinite set of blank nodes, L the set of RDF literals. t u Example 1. The example presents an RDF triple consisting of subject, predicate and object. hhttp://example.net/me#j,foaf:name,John Smithi A collection of RDF triples intrinsically represents a labeled directed multigraph. The nodes are the subjects and objects of their triples. RDF is often referred to as being graph structured data where each hs, p, oi triple can be inp terpreted as an edge s − → o. Definition 5 (RDF graph). Let O = I ∪ B ∪ L and S = I ∪ B, then G ⊂ S × I × O is a finite subset of RDF triples, which is called RDF graph. t u Example 2. The example in Fig. 1 presents an RDF graph of a FOAF5 profile in Turtle syntax. This graph includes the following elements: 1 2 3 4
@prefix rdf: . @prefix foaf: . rdf:type foaf:Person . foaf:name "John Smith" .
Definition 6 (RDF data store). An RDF data store is any storage system that uses RDF graphs to represent data. t u 5
http://xmlns.com/foaf/spec/
rdf:type
foaf:Person
#j foaf:name John Smith
Fig. 1. An RDF graph with two triples
4
Formalization of the PG Model
The PG data model rests on the concept of creating directed and key/valuebased graphs. It means that there is a tail and head to each edge and both vertices and edges can have properties associated with them. Following [18], a property graph has the following characteristics: – – – – – – – –
A property graph contains vertices6 and edges7 . Vertices can be labeled with one or more labels. Vertices contain key-value pairs called properties. Edges are named and directed. Edges have a start and end vertices. Edges can also contain properties. Properties are in the form of arbitrary key-value pairs. The keys are strings and the values are arbitrary datatypes. Following above characteristics we provide formal definition below.
Definition 7 (Property Graph). A Property Graph is a tuple P G = hV, E, S, P, he , te , lv , le , pv , pe i, where: 1. 2. 3. 4. 5. 6. 7. 8. 9. 6 7
V is a non-empty set of vertices, E is a set of edges, S is a set of strings, P contains each properties that has a form p = hk, vi, where k ∈ S and v ∈ S, he : E → V is a function which yields the source of each edge (head), te : E → V is a function which yields the target of each edge (tail), lv : V → S is a function mapping each vertex to label, le : E → S is a function mapping each edge to label, pv : V → 2P is a function used to assign vertices to their multiple properties.
Another name for a vertex is a node. Another name for an edge is an arc.
10. pe : E → 2P is a function used to assign edges to their multiple properties. t u Note that hV, E, he , te , le i is an edge-labeled directed multigraph. Properties can be implemented as an associative array that is an unordered list of attributes with associated values.
knows
since 2001 alice
name:Alice age: 22
bob
knows name:Bob since: 2001
Fig. 2. A property graph with two vertices and two edges
Example 3. The example in Fig. 2 presents a Property Graph. This graph includes the following elements: – – – – – – – – – – – – – – –
S = {name, Alice, Bob, age, 22, since, 2001, knows, alice, bob}, V = {v1 , v2 }, pv (v1 ) = {hname, Alicei, hage, 22i}, pv (v2 ) = {hname, Bobi}, lv (v1 ) = alice, lv (v1 ) = bob, E = {e1 , e2 }, he (e1 ) = bob, te (e1 ) = alice, le (e1 ) = knows, pe (e1 ) = {hsince, 2001i} he (e2 ) = alice, te (e2 ) = bob, le (e2 ) = knows, pe (e2 ) = {hsince, 2001i}.
Definition 8 (Property graph data store). A property graph data store is any storage system that uses property graph structures with vertices, edges, and properties to represent data. t u
5
Serializing RDF in Property Graphs Style
In section we present RDF serialization in PG style and propose algorithm that transform RDF to our serialization. We propose Yet Another RDF Serialization (YARS), which allow prepare RDF data to exchange on the property graph data stores. Our serialization is textual. It has three different parts: 1. prefix directives – a part where prefixes are defined, 2. vertex declarations – parts where vertices are created, 3. relationship declarations – parts where edges and properties are created. Prefix directives should be written in the starting lines. Vertex and relationship can be defined in different places. Values of subjects and objects are stored in vertex properties. The same vertices have the same names. Predicates are edge labels. Example 4. The example presents YARS serialization that represents the same triples as in Example 2. This property graph includes the following triples: 1 2 3 4 5 6 7
:rdf: :foaf: (a {value:}) (b {value:}) (a)-[:rdf:type]->(b) (c {value:"John Smith"}) (a)-[:foaf:name]->(c)
We also provide a method for transforming an RDF graph to our serialization. At the input our poposal requires RDF graph in the abstract syntax so there is no need to provide specific RDF serialization. Algorithm 1 presents creation of YARS serialization. The algorithm takes subject (subj() function), predicate (pred() function) and object (obj() function) from RDF graph and divides a predicate into two parts. The first part is used to shorten IRI. The second part is an edge label. Hash strings of subject and object are vertex names with values in properties. The next step is vertices (createVertex() function) and relationships (createRel() function) creation. YARS can have more than one possible representation in the syntax level. For example vertex declarations and relationship declarations can be mixed with each other. This feature is desirable for humans, because of readability. To easier processing by property graph data stores, we also propose a canonical form of our serialization. A canonical YARS (YARSC) has the following additional constraints: – prefix directives do not exist, all IRI are stored in the absolute form, – edges have a key called iriref and a value, which is vocabulary IRI or ontology IRI namespace,
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
input : RDF Graph G output: YARS Y P ← ; foreach g ∈ G do s ← subj(g); p ← pred(g); o ← obj(g); (pid , pname ) ← generatePrefix(p); if pid ∈ / P then addPrefix(pid , P ); sid ← hash(s) C oid ← hash(o) C srel ← createVertex(sid , s); orel ← createVertex(oid , o); Y ← createRel(srel , pname , orel );
md5(), sha512(), . . . ; md5(), sha512(), . . . ;
moveBackToBeginning(Y ); Y ← addPrefixes(P ); return Y ;
Algorithm 1: YARS generation
– edge labels have predicate name without prefix, – vertex declarations should be at the top of the file, – relationship declarations should be at the bottom of the file. The grammar (see Appendix A)for the language is the same. The new feature is a iriref key, which define vocabulary IRI or ontology IRI namespace for edge label. Example 5. The example presents a canonical YARS serialization that is equivalent to YARS in Example 4. 1 2 3 4 5
(a {value:}) (b {value:}) (c {value:"John Smith"}) (a)-[type {iriref:}]->(b) (a)-[name {iriref:}]->(c)
Algorithm 2 presents canonicalization of YARS serialization. At the input the algorithm requires YARS. In the first step prefix directives are removed. In the second step the algorithm divide content into vertex declarations (addVertexDeclaration() function) and relationship declarations (addRelationshipDeclaration() function). In the next step edges are enriched with a property consists of a vocabulary. In this step label is devoided of prefix. The last is vertex declarations and relationship declarations merging (createCanonical() function). YARS canonicalization has a worst-case space complexity O(|Y | · |P |).
1 2 3 4 5 6 7 8 9 10
11 12
input : YARS Y output: Canonical YARS Yc P ← getPrefixes(Y ) C Yc ← removePrefixDirectives(Y ) ; foreach y ∈ Yc do if y is vertex declaration then Dv ← addVertexDeclaration(y) ;
prefixes structure;
else Dr ← addRelationshipDeclaration(y) ; foreach p ∈ P do Yc ← ContextEnrich(y, p) ; Yc ← removePrefix(y, p) ; Yc ← createCanonical(Dv , Dr ) ; return Yc ;
Algorithm 2: YARS canonicalization
6
Implementation and Evaluation
In this Section we evaluate the creating YARS based on our inplementation including Algorithm 1 and Algorithm 2. All experiments have been executed on a Intel Core i7-4770K CPU @ 3.50GHz (4 cores, 8 thread), 8GB of RAM (clock speed: 1600 MHz), and a HDD with reading speed rated at ∼160 MB/sec8 . We have been used Linux Mint 17.3 Rosa (kernel version 3.13.0) and Python 3.4.3 with RDFLib 4.2.19 . To test our serialization we implemented N-Triples generator and transformation tool into YARS and Turtle. We prepare 10 datasets in YARS, N-Triples and Turtle. The YARS generation times are presented in Fig. 3. We consider two version of the implementation with MD5 and SHA512 algorithms. The plot shows the arithmetic mean encoding time from 10 runs. It presents that times are nearly quadratic to the number of RDF triples in both cases. In this case we assume that we do not know how many prefixes should be shorten so we use RDF graph abstract syntax. The results can be improved while we consider specific RDF serialization with prefixes at the file beginning i.e. Turtle10 . In the next step we tested serialization file size. We define size ratio ryx as nt size y size , where nt size is the size of an N-Triples file and y size is the size of Turtle and YARS files. We also test ZLib11 compressed serializations. Fig. 4 analyzes size ratio of YARS and YARSC. It shows that plain YARS serialization (ryars ) has similar ratios to Turtle rttl and better ratios compared to N-Triples. In zlib zlib compressed serialization YARS results (ryars ) are are similar to N-Triples (rnt ) zlib and Turtle (rttl ). Both YARSC (ryarsc ) and YARSC with ZLib compression 8 9 10 11
We test it in hdparm -t. https://github.com/RDFLib/rdflib If we assume that this serialization has all prefixes are shorten. http://www.zlib.net/
24 Abstract syntax 7→ YARS (md5) Abstract syntax 7→ YARS (sha512) Turtle 7→ YARS (md5)
22 20 18 16
Time s
14 12 10 8 6 4 2 0 −2
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Number of triples
0.9
1 ·104
Fig. 3. The YARS generation time result
zlib ) have the worst ratios, because they have additional data included to (ryarsc make processing easier and faster.
7
Conclusions and Future Work
Graphs are used in many areas of our lives. There are two main graph models: RDF model and PG model. The first one is well studied and formalized. This paper proposes a formalization of the PG model. Moreover, we present how to store RDF triples in the form that can be easily processable. We propose a new serialization for RDF that is compatible with PG databases and based on Cypher syntax. Future work will focus on preparing algorithms for mapping SPARQL into PG query languages i.e. Cypher and Gremlin. Another challenges is reducing
rttl ryars ryarsc
rzlib nt rzlib ttl rzlib yars rzlib yarsc 8
Ratio
Ratio
2
1
7
6
0.5 Number of triples
1 ·104
0.5 Number of triples
1 ·104
Fig. 4. Format size ratios
repeated nodes from our serialization to reduce the size of a document and speed up processing. It should also be considered supporting RDF named graphs.
Acknowledgements The author gratefully acknowledges the members of the Neo4j team. We thank Olaf Hartig for comments that greatly improved the paper.
References 1. Ben Adida, Mark Birbeck, Shane McCarron, and Ivan Herman. RDFa Core 1.1 Third Edition. W3C recommendation, World Wide Web Consortium, March 2015. http://www.w3.org/TR/2015/REC-rdfa-core-20150317/. 2. Ulrik Brandes, Markus Eiglsperger, Ivan Herman, Michael Himsolt, and M Scott Marshall. GraphML progress report structural layer proposal. In Graph drawing, pages 501–512. Springer, 2001. 3. Tim Bray, Jean Paoli, C. M. Sperberg-McQueen, Eve Maler, and Fran¸cois Yergeau. EBNF Notation. W3C recommendation, World Wide Web Consortium, November 2008. https://www.w3.org/TR/2008/REC-xml-20081126/#sec-notation. 4. Jeen Broekstra, Arjohn Kampman, and Frank Van Harmelen. Sesame: A generic architecture for storing and querying rdf and rdf schema. In The Semantic Web— ISWC 2002, pages 54–68. Springer, 2002. 5. Gavin Carothers. RDF 1.1 N-Quads. W3C recommendation, World Wide Web Consortium, February 2014. http://www.w3.org/TR/2014/ REC-n-quads-20140225/.
6. Gavin Carothers and Eric Prud’hommeaux. RDF 1.1 Turtle. W3C recommendation, World Wide Web Consortium, February 2014. http://www.w3.org/TR/2014/ REC-turtle-20140225/. 7. Souripriya Das, Jagannathan Srinivasan, Matthew Perry, Eugene Inseok Chong, and Jayanta Banerjee. A Tale of Two Graphs: Property Graphs as RDF in Oracle. In EDBT, pages 762–773, 2014. 8. Fabien Gandon and Guus Schreiber. RDF 1.1 XML Syntax. W3C recommendation, World Wide Web Consortium, February 2014. http://www.w3.org/TR/2014/ REC-rdf-syntax-grammar-20140225/. 9. Olaf Hartig. Reconciliation of RDF* and property graphs. arXiv preprint arXiv:1409.3288, 2014. 10. Olaf Hartig and Bryan Thompson. Foundations of an alternative approach to reification in RDF. arXiv preprint arXiv:1406.3399, 2014. 11. Michael Himsolt. GML: A portable graph file format. Html page under http://www. fmi. uni-passau. de/graphlet/gml/gml-tr. html, Universit¨ at Passau, 1997. 12. Salim Jouili and Valentin Vansteenberghe. An empirical comparison of graph databases. In Social Computing (SocialCom), pages 708–715. IEEE, 2013. 13. Mahesh Lal. Neo4j Graph Data Modeling. Packt Publishing, 2015. 14. Markus Lanthaler, Manu Sporny, and Gregg Kellogg. JSON-LD 1.0. W3C recommendation, World Wide Web Consortium, January 2014. http://www.w3.org/ TR/2014/REC-json-ld-20140116/. 15. Norbert Mart´ınez-Bazan, Victor Munt´es-Mulero, Sergio G´ omez-Villamor, Jordi Nin, Mario-A S´ anchez-Mart´ınez, and Josep-L Larriba-Pey. Dex: high-performance exploration on large graphs for information retrieval. In Proceedings of the sixteenth ACM conference on Conference on information and knowledge management, pages 573–582. ACM, 2007. 16. Brian McBride. Jena: Implementing the RDF Model and Syntax Specification. In SemWeb, 2001. 17. Thomas Neumann and Gerhard Weikum. RDF-3X: a RISC-style engine for RDF. Proceedings of the VLDB Endowment, 1(1):647–659, 2008. 18. Ian Robinson, Jim Webber, and Emil Eifrem. Graph databases. O’Reilly Media, Inc., 2013. 19. Alexander Sch¨ atzle, Martin Przyjaciel-Zablocki, Thorsten Berberich, and Georg Lausen. S2X: Graph-Parallel Querying of RDF with GraphX, pages 155–168. Springer International Publishing, 2016. 20. Andy Seaborne and Gavin Carothers. RDF 1.1 N-Triples. W3C recommendation, World Wide Web Consortium, February 2014. http://www.w3.org/TR/2014/ REC-n-triples-20140225/. 21. Andy Seaborne and Gavin Carothers. RDF 1.1 TriG. W3C recommendation, World Wide Web Consortium, February 2014. http://www.w3.org/TR/2014/ REC-trig-20140225/. 22. Dominik Tomaszuk. Flat triples approach to RDF graphs in JSON. In W3C Workshop – RDF Next Steps. World Wide Web Consortium, 2010. 23. Dominik Tomaszuk. Named graphs in RDF/JSON serialization. Zeszyty Naukowe Politechniki Gda´ nskiej, pages 273–278, 2011. 24. David Wood, Markus Lanthaler, and Richard Cyganiak. RDF 1.1 Concepts and Abstract Syntax. W3C recommendation, World Wide Web Consortium, February 2014. http://www.w3.org/TR/2014/REC-rdf11-concepts-20140225/.
A
Appendix: YARS grammar
In this appendix we present the grammar of YARS in EBNF [3]. 1 2 3 4 5 6 7 8 9 10
doc elem directive declaration prop vertex node rel alnum S
::= ::= ::= ::= ::= ::= ::= ::= ::= ::=
elem* directive | declaration ":" alnum ":" S "" vertex | rel "{" S alnum ":" S alnum S"}" alnum prop "(" (vertex | alnum) ")" node "-[" alnum ( prop )+ "]->" node (ALPHA | DIGIT | "_")+ (#x20 | #x9 | #xD | #xA)+