Li-ming LIN, Guang-cao LIU*, Yan WANG, Wei LU
Star-shaped SPARQL Query Optimization on Columnfamily Overlapping Storage Abstract: Column families are widely used to store structure-free or semi-structure data. However, traditional RDF data storage methods divide data into independent triples, which makes SPARQL queries executed have a low performance. Here, we propose a JoinFirst SPARQL-translation strategy for improving star-shaped SPARQL queries performance. Experiments demonstrate this strategy is helpful when joins are necessary in star-shaped queries. The speed is accelerated in an exponential scale. Keywords: RDF; Star-shaped Query; SPARQL; Column family
1 Introduction In the first decade of the 21st century, RDF models have been adopted by W3C as a specification for conceptual description or modeling of knowledge. An RDF statement is composed of , and an RDF statement set could be represented as a labeled, directed multi-graph, where a statement corresponds to an edge. So this model is capable of representing semi-structured or unstructured data, which are widely seen in Web or Semantic Web. There are many RDF data sources, such as Bio2RDF [1], which contains more than 10 billion RDF statements. And the need to query RDF data sets have also been increasing. The most widely used RDF query specification is SPARQL, which is recommended by W3C. In Figure 1, examples of an RDF graph and the corresponding SPARQL query are depicted, where an instance is represented as a node, and an edge as a relationship between two nodes. Besides an relationship (foaf:knows) between two persons (foaf:person1 and foaf:person2), this example also describes other information of their own, such as foaf:firstName, foaf:surname, etc. From this examples, we could find that the unit of an RDF dataset is an edge, which is also called as RDF statement. So in traditional RDF storage solutions [2-5], data are stored as triples, with each corresponds to one RDF statement. Then, SPARQL optimization technologies are also restricted by such storage solutions. *Corresponding author: Guang-cao LIU, Xiamen Great Power Geo Info. Tech. Co. Ltd., State Grid Information & Telecommunication Group, Xiamen, China, E-mail:
[email protected] Li-ming LIN, Xiamen Great Power Geo Info. Tech. Co. Ltd., State Grid Information & Telecommunication Group, Xiamen, China Yan WANG, School of Computer & Information Engineering, Xiamen University of Technology, Xiamen, China Wei LU, School of Information, Renmin University of China, Beijing, China
Unauthenticated Download Date | 12/31/17 6:01 PM
68
Star-shaped SPARQL Query Optimization on Column-family Overlapping Storage
Figure 1. An Example of an RDF Dataset’s Graph Representation and a SPARQL Query
Researchers have proposed many solutions for SPARQL optimization, such as using matrix set operations on a bitmap to reduce number of joins or intermediate results [6], an index technique for group-by queries [7], and data dividing strategies on semantic hash [8]. However, the key problem lies in their scattered storage of RDF statements about one instance, the burdens of merging information of the requiring instance also increase. So in [9], we propose an RDF data storage strategy to extract RDF statements with frequent predicate pairs, and store them into column families. Table 1 to 3 illustrates column families extracted from RDF statements. Here, we introduce the extraction rule we use. Rule 1 (Overlapping Rule): Overlapping of column-families only exists in the structure not content, which means there is no redundant information among column families. The proposed strategy is based on mining of frequent predicate pairs, overlapping could be found among column families. In traditional database theory, overlapping is seldom seen, query optimizer could not work well under such circumstances. In this paper, we design an efficient SPARQL optimization trick for database with columnfamily overlapping to take full advantage of column storage. Table 1. A column family with two predicates ID
foaf:firstName
foaf:job
foaf:person5
Alex
Dispatcher
foaf:person2
Konstantinos
Customer service staff
Table 2. Another column family with one predicate ID
foaf:surName
foaf:person6
Valarakos
foaf:person4
Stergiou
Unauthenticated Download Date | 12/31/17 6:01 PM
Star-shaped SPARQL Query Optimization on Column-family Overlapping Storage
69
Table 3. Another column family with two predicates ID
foaf:firstName
foaf:surName
foaf:person1
Page
Charles
foaf:person3
David
Smith
2 Optimizations for Star-shaped Query 2.1 Description of a Star-shaped Query The purpose of a star-shaped query is to get as much information about one instance as possible, which often appears as subjects of several RDF statements. The example in Figure 2 shows a query to obtain the firstName and surName of foaf: person1, where the identifiers beginning with symbol “?” represents a variable. Here, we limit that only the query, whose number of queried predicates is equal to or bigger than 2, could be called as a star-shaped query.
Figure 2. An Example of a Star-shaped Query Represented as a Graph
2.2 Optimization The optimization could be divided into two stages. In the first stage, the star-shaped query is transformed into tuple filterings and projections, which avoids costly operation of edge join on column storage. In the second stage, the column-family, which does not contain all predicates in the star-shaped query, could be filtered out. Then, the scans on these column-families could be reduced and performance will be improved greatly.
Unauthenticated Download Date | 12/31/17 6:01 PM
70
Star-shaped SPARQL Query Optimization on Column-family Overlapping Storage
πfoaf:firstName, foaf:surName( (t1( πfoaf:firstName, ID(T1) πfoaf:firstName,ID(T3))) JOINt1.ID=t2.ID (t2( πfoaf:surName, ID(T2) πfoaf:surName, ID(T3))))
πfoaf:firstName,foaf:surName ( πfoaf:firstName,ID(t7(T3)) JOINt7.URI=t8.URI πfoaf:surName, ID(t8(T3)))
πfoaf:firstName,foaf:surName (T
Figure 3. SPARQL Query on Column-Families Represented as Relation Algebra
πfoaf:firstName,foaf:surName ( πfoaf:firstName,ID(t1 (T1)) Supposing there are several column families in the database, T1={ID, foaf:firstName}, JOINt1.ID=t2.ID T2={ID, foaf:surName}, T3={ID, foaf:firstName, foaf:surName}, such as those in πfoaf:surName, ID(t2(T2))) Table 1 to Table 3. A typical UnionFirst query shown on Figure 3 aims to find all πfoaf:firstName,foaf:surName ( π ( πfoaf:firstName,foaf:surName . It firstly makes unions of all firstName in all Table ( foaf:firstName, foaf:surName πfoaf:firstName,ID ( t3(T1)) ( ( π (T ) π t1 foaf:firstName, ID all 1 surname in all Table s, then joins firstName and foaf:firstName,ID s, andJOIN makes unions of surName (t7(T3)) t3.URI=t4.URI JOINt7.URI=t8.URI together. The symbol σ3ID=foaf:person1 πfoaf:surName, ))) (T1) selects a row in which ID=foaf:person1, πfoaf:firstName, ID(t4(T π (T ))) π foaf:firstName,ID foaf:surName,rID(t8(T3))) πfoaf:firstName,foaf:surName ( 3foaf:firstName and ID columns, È is a union operation, filters columns leaving ID t1 JOIN t1.ID=t2.ID πfoaf:firstName,ID (T3and )) etc. renames the row ( ast5t1, ( t2( πfoaf:surName, ID(T2) JOIN For thet5.URI=t6.URI sake that union and join obey distribution law, we could use distribution ID(t6(T2))) π foaf:surName, law to transform the original query into that shown in Figure 4. The original UnionFirst πfoaf:firstName,foaf:surName (T πfoaf:surName, ID((T3)))) πfoaf:firstName,foaf:surName query with one join is transformed into JoinFirst query. πfoaf:firstName,ID(t7(T3)) JOINt7.URI=t8.URI π π foaf:surName, ID(t8(T3))) ( foaf:firstName,foaf:surName
πfoaf:firstName,ID(t1 (T1)) JOINt1.ID=t2.ID πfoaf:surName, ID(t2(T2))) πfoaf:firstName,foaf:surName ( πfoaf:firstName,ID(t3(T1)) JOINt3.URI=t4.URI πfoaf:surName, ID(t4(T3))) πfoaf:firstName,foaf:surName ( πfoaf:firstName,ID(t5(T3)) JOINt5.URI=t6.URI πfoaf:surName, ID(t6(T2))) πfoaf:firstName,foaf:surName( πfoaf:firstName,ID(t7(T3)) JOINt7.URI=t8.URI πfoaf:surName, ID(t8(T3)))
Figure 4. Result of Using Join Distribution Law On Original Query
Unauthenticated Download Date | 12/31/17 6:01 PM
Star-shaped SPARQL Query Optimization on Column-family Overlapping Storage
71
According to Rule 1, if a column family (such as T1) is the subset of another column family (such as T3), then the content of T3 will not appeared in T1 again. So, the join result of T1 and T3 is null, and the join result of T2 and T3 is null. Additionally, the join result of T1 and T2 is null in this case. Then three joins in Figure 4 could be filtered out, and only T3 join T3 remains in Figure 5. Furthermore, we find that the join condition is ID should be equivalent, then the join could be simplified as a projection on T3, which is shown in Figure 6. Then rule 2 could be concluded from this process, and it can be extended to scenarios with many joins.
πfoaf:firstName,foaf:surName ( πfoaf:firstName,foaf:surName ( πfoaf:firstName,ID(t7(T3)) πfoaf:firstName,ID(t7(T3)) JOIN JOINt7.URI=t8.URI πfoaf:surName,t7.URI=t8.URI (t8(T3))) πfoaf:surName, ID ID(t8(T3))) Figure 5. Reduction
πfoaf:firstName,foaf:surName (T3) πfoaf:firstName,foaf:surName (T3) Figure 6. The Final Result
Rule 2 (Heuristic Rule of Star-shaped Query): Under the circumstance of storing RDF data with overlapping column families, the start-shaped query of SPARQL could be equally transformed into a union of projections on column-families containing all the queried predicates. {?x Pred1 ?V1}. {?x Pred2 ?V2}. … {?x Predn ?Vn} → πID,Pred1,…Predn(T1) ∪…∪ πID,Pred1,…Predn(Tm)
3 Experiment and analysis 3.1 Setup The dataset for experiments is a subset of Yago, which contains 1000000 triples, 741165 subjects, 35 predicates and 494512 objects. The effectiveness of optimization is verified by adding the number of star joins one by one. Here, there are 3 solutions compared in our experiments. The first is called triple solution, which stores RDF triple in column families and each row contains only one predicates. The second is called UnionFirst, which translates a star-shapred query into a union of several joins, just like that shown in Figure 3. The last is called JoinFirst, just like that shown in Figure 4. And we could use rule 2 to translate into only unions of projections, such as the example shown in Figure 6. To treat the three solutions fairly, in triple solution, an index on ID is added.
Unauthenticated Download Date | 12/31/17 6:01 PM
72
Star-shaped SPARQL Query Optimization on Column-family Overlapping Storage
3.2 Experiment The star-shaped query template used in the experiment could be represented as SELECT ?a WHERE {?a Pred1 ?v1}. {?a Pred2 ?v2}. … {?a Predn ?vn}. The purpose is to query instances that having values on all predicates: Pred1,Pred2,…,Predn. In the begining, we only find instances having value in Pred1, so there is no join. Then, instances having values both in Pred1 and Pred2 are found, so the number of joins is 1. The queried predicates are added one by one, and the number of joins increases. The comparison of three solutions are shown in Figure 7. The x-axis means the number of joins, y-axis means query execution time. It should be pointed out that the scale on y-axis is exponential. From Figure 7, we could find that, initially the number of joins is 0 (which means there is no join), and the execution time of JoinFirst is longest because joins are unnecessary. As the addition of queried predicates, the performance of JoinFirst is improved gradually. Especially in the case that the number of joins is 3 or 4, the speed of JoinFirst is 10 or 100 times faster than the other two solutions.
Figure 7. Performance of Star-shaped Query on Yago
The second point should be noted is that the worst cases in UnionFirst and Triple solutions are the time that the number of joins is 3. It is for the sake that at that time the number of intermediate results is largest. And when adding the 4th join, intermediate results could be reduced. However, the inflection point of the two solutions is not easily to estimate with different queries. And even at that time, the speed of JoinFirst is greatly faster than the others.
Unauthenticated Download Date | 12/31/17 6:01 PM
Star-shaped SPARQL Query Optimization on Column-family Overlapping Storage
73
4 Conclusion In this paper, we propose an optimization strategy for star-shaped SPARQL executed on RDF data stored in column families. When translating SPARQL into execution plans, the strategy firstly runs join operations, then unionifies the join results. Then, unnecessary joins could be omitted, and the execution plans could be changed into scans on column families containing all required families. In our experiments, the performance is improved in an exponential scale. Acknowledgment: The work is supported by Science and Technology Project of State Grid Corporation of China under Grant SGITG-KJ-JSKF[2015]0012, National Natural Science Foundation of China under Grant 61502504, Fujian’s Education & Scientific Research Program (Scientific) of Young & Middle-age Teachers under Grant JA15365.
References [1] F. Belleau, M. A. Nolin, N. Tourigny, P. Rigault, J. Morissette, “Bio2RDF: towards a mashup to build bioinformatics knowledge systems,” J. Biomed Inform., vol. 41, Oct. 2008, pp. 706-716, doi:10.1016/j.jbi.2008.03.004 . [2] D. J. Abadi, A. Marcus, S. R. Madden, K. Hollenbach, “SW-Store: a vertically partitioned DBMS for Semantic Web data management,” VLDB Journal, vol. 18, Apr. 2009, pp. 385-406, doi: doi:10.1007/s00778-008-0125-y. [3] A. Harth, J. Umbrich, A. Hogan, S. Decker, “YARS2: A federated repository for querying graph structured data from the web,” Proc. ISWC/ASWC 2007, Springer-verlag, 2007, pp. 211-224, doi: 10.1007/978-3-540-76298-0_16. [4] T. Neumann, G. Weikum, “RDF-3X: a RISCstyle engine for RDF,” Proc. VLDB Endowment, 2008, pp. 647-659, doi:10.14778/1453856.1453927. [5] C. Weiss, P. Karras, A. Bernstein, “Hexastore: sextuple indexing for semantic web data management,” Proc. VLDB Endowment, 2008, pp. 1008-1019, doi:10.14778/1453856.1453965. [6] P. Yuan, P. Liu, B. Wu, H. Jin, W. Zhang, “TripleBit: a fast and compact system for large scale RDF data,” Proc. VLDB Endowment, 2013, pp. 517-528, doi:10.14778/2536349.2536352. [7] L. Zou, M. T. Özsu, L. Chen, X. Shen, R. Huang, “gStore: a graph-based SPARQL query engine,” VLDB Journal, vol. 23, Aug. 2014, pp. 565-590, doi: 10.1007/s00778-013-0337-7. [8] K. Lee, L. Liu, “Scaling queries over big RDF graphs with semantic hash partitioning,” Proc. VLDB Endowment, 2013, pp. 1894-1905, doi: 10.14778/2556549.2556571. [9] Y. Wang, X. Du, J. Lu, X. Wang, “FlexTable: using a dynamic relation model to store RDF data,” 15th international conference on DASFAA, Springer-verlag, 2010, pp. 580-594, doi: 10.1007/978-3-642-12026-8_44.
Unauthenticated Download Date | 12/31/17 6:01 PM