occurring queries. Vertical partitioning is executed on column store also, and as ... RDF data can be stored in a relational model using various data stores. Triple.
Query Execution for RDF Data on Row and Column Store Trupti Padiya, Minal Bhise, Sandeep Vasani, and Mohit Pandey DA-IICT Gandhinagar {padiya_trupti,minal_bhise,sandeep_vasani, mohit_pandey}@daiict.ac.in
Abstract. This paper shows experimental comparison between various data storage techniques to manage RDF data. The work represents evaluation of query performance in terms of query execution time and data scalability, using row and column store for various data storage techniques. To demonstrate these ideas FOAF (Friend Of A Friend) data is used. The paper contributes experimental and analytical study for application of partitioning techniques on FOAF data which makes queries 168 times faster compared to traditional triples table. Materialized views over vertically partitioned data show an additional 8 times improvement in query performance against partitioned data for the frequently occurring queries. Vertical partitioning is executed on column store also, and as FOAF data size scales, an order of magnitude improved performance is observed over row store execution. Keywords: Data Partitioning, Data Scalability, FOAF, Materialized Views, Query Execution, Semantic Web.
1
Introduction
Due to tremendous increase in RDF data, RDF applications need to retrieve data efficiently at the scale of web, which makes performance and scalability issues increasingly significant. Therefore, efficient query processing and efficient management of RDF data is an important factor to achieve goal of highly interactive semantic web applications. SQL query execution for relational data can be simpler and can take less time compared to SPARQL and there are tools available to convert RDF data to relational data. RDF data can be stored in a relational model using various data stores. Triple Store- a three column simple and flexible representation suffers from a lot of performance issues [1], because as the number of join increases, query complexity increases which will result in serious performance issues in terms of execution time, and increase in data size will make query performance even worse. Query performance is measured and analyzed for triples table and various other data storage techniques such as property table, horizontally and vertically partitioned tables using row store. We also use materialized views to increase query performance over vertically partitioned data. In addition we check repeatability of the experiment R. Natarajan et al. (Eds.): ICDCIT 2015, LNCS 8956, pp. 403–408, 2015. © Springer International Publishing Switzerland 2015
404
T. Padiya et al.
of vertically partitioned tables using column store to analyze query performance against row store [3]. Section 2 describes experimental details and Section 3 represents analysis and discussion of experimental results.
2
Experiment
FOAF is a project devoted to linking people and information using the Web. FOAF integrates three kinds of network: social networks of human collaboration, friendship and association. We are using FOAF [6] dataset from university of Maryland as a benchmark for our experiment. This Section highlights dataset and implementation of the experiment. FOAF consists of 406540 triples. It has 550 properties out of which we have found 234 unique properties. We designed a query set of 15 real life frequently occurring queries on social web. These queries consist of multiple subject-object joins. We studied and analyzed query performance based on join types and other aggregate operations as depicted in Fig 1.
Fig. 1. Join Analysis
We used Jena Parser to convert RDF data into triples and developed a tool to insert triples into a relational model; postgresql- a row store and MonetDB [5] - a column store. These RDBMS tools were installed on a machine having fedora as operating system, with 1 GB RAM and 0.6 GB swap memory, and 250 GB hard disk. This experiment uses Eclipse IDE, Java 1.6 SDK, Jena Parser 2.3, Postgres 8.2, and MonetDB. FOAF data is stored in a triples table, a three column table consisting of subject, property and object. We use two clustered property tables and one left over triples table to experiment two specific queries. We study application of partitioning techniques on FOAF data and use vertical and horizontal data partitioning. For vertical partitioning, data is partitioned based on properties of a person. The dataset has 234 uniquely identified properties and therefore, there are total 234 tables for vertically partitioned FOAF data. For subject specific queries, FOAF data is partitioned using
Query Execution for RDF Data on Row and Column Store
405
horizontal partitioning, based on subject names and thus it has 26 tables for names starting with (a-z) and 10 tables for names starting with numeric (0-9) having total 36 tables for horizontally partitioned FOAF data. A tool is developed using Eclipse and Java to partition FOAF data, and feed them in respective data stores. We create materialized views over vertically partitioned FOAF data, and compared and analyzed its query performance against vertically partitioned FOAF data without materialized views. Effect on query performance for a larger dataset is also experimented by performing data scaling on FOAF data. Initial size is gradually increased to 2 times, 4 times, 8 times and 10 times of actual data size. Query set of 15 queries is fired on all the data stores and query performance is measured in terms of query execution time. Hot and cold runs are taken for the experiment, which are averaged over three runs. For cold run we restarted database and flushed memory for every run. The same experiment for vertically partitioned data is repeated for column store.
3
Result and Discussions
Execution time in ms
We execute query set of 15 queries for various discussed data storage techniques on a row store. The same set of queries is executed for vertically partitioned data on a column store. We have taken hot and cold runs; however comprehensive analysis is presented based on cold runs as it helps us estimate the scenario when the database is up for the first time and gives the query execution time for the worst case. Details of cold run vs. hot run are given later in this section. 18000 16000 14000 12000 10000 8000 6000 4000 2000 0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Triples
2677 1303 2929 1271 783 1358 3736 893 734 1600 7952 3562 1533 14 7714
PropertyTable
3654 2772 1767 677 564 16473 2279 647 532 848 2185 1042 964 2034 2158
VerticalPartition
254
53
216
33
22 2972 271
51
11
35
13
52
394
13
617
HorizontalPartition 1463 2376 1287 4970 149 1434 4075 169 289 2452 509 3854 4595 12 4325
Fig. 2. Query Execution time on various data stores
Fig 2 plots the query execution time of various data storage techniques and Table 1 shows query performance gain using partitioning technique for all the queries over triples table. Various data storage technique occupies different disk storage space. FOAF data at the initial size occupied around 65 MB; property tables occupied 60 MB, and horizontal and vertically partitioned data stores occupied 53 MB and 35 MB respectively.
406
T. Padiya et al. Table 1. n times gain using partitioning technique over triples table
Query n x gain
3.1
1 11
2 25
3 14
4 39
5 36
6 1
7 14
8 18
9 67
10 46
11 612
12 69
13 4
14 1
15 13
Query Analysis for Scaled Data on Row Store
1
2
3
4
40000
Execution time in ms
200000 180000 160000 140000 120000 100000 80000 60000 40000 20000 0
Execution time in ms
Execution time in ms
Data Scaling is performed chronologically by increasing number of triples to 2 times, 4 times, 8 times and 10 times of actual data size. So data size of resultant data stores were 813080, 1626160, 3252320 and 4065400 triples. The same experiment and analysis was performed for each data size on all data stores discussed in the previous subsection. We were able to find that partitioning techniques performed an order of magnitude better compared to all other data stores for scaled data.
35000 30000
25000 20000 15000 10000 5000 0
5
1
2
3
4
5
15 14 14 13 13 12 12
1
2
3
4
5
Triples Q1 267 904 288 120 182
Triples Q12 35625223184433213598
Triples Q14 14 14 14 14 14
VP Q1
VP Q12
VP Q14
254 729 130 266 309
Fig. 3. Performance Comparison for one subjectobject join for Query 1
52 162 426 841 1219
Fig. 4. Performance Comparison for two subjectobject join for Query 12
13 13 13 14 14
Fig. 5. Performance Comparison for three subjectobject join for Query 14
Real life queries on semantic web generally have more number of subject-object (Type 2) joins. For our designed query set, query 1,2,3,4 and 11 contains one subjectobject join, query 7, 12, and 15 contains two subject-object joins, and Query 14 contains three subject-object joins. We analyze query performance based on number of joins against scaled data. We were able to find out that queries having one subjectobject join, when fired on vertically partitioned data store, shows 43 times average improvement over triples table. Queries having two subject-object join shows 36 times average improvement and queries having three subject-object join shows about nearly equal performance for triples table and vertically partitioned data. Fig 3, 4, and 5 depicts query performance for vertically partitioned table against triples table for query 1, 12 and 14 having one, two and three subject-object join respectively. Number 1, 2, 3, 4, 5 on x-axis indicates data scaling of actual data, 2 times, 4 times, 8 times and 10 times respectively. It is seen that vertically partitioned data outperforms other data storage techniques and hence we have implemented materialized view [6] over partitioned data to gain even better performance. Materialized views were created for every query listed in the query set. Based on the kind of joins, we studied queries and their performance with various data size on various data storage techniques. Query 3 has one subject-object
Query Execution for RDF Data on Row and Column Store
407
join, whereas Query 6 has no joins. Vertical partitioning gives the best-case performance compared to all other storage techniques. The performance got enhanced after using materialized views for vertically partitioned data for query 3. Horizontal partitioning gives best-case performance for query 6. Since query 6 has no subject-object joins materialized views have not shown improvement in query execution time. Vertical partitioning and horizontal partitioning scales linearly, which shows that even with increases of data, partitioning technique leads to better performance compared to current storage techniques. 3.2
Query Analysis for Scaled Data on Column Store
The same experiment of vertical partitioning on a row store- postgresql experiment is executed on a column store- MonetDB [3]. The query set of 15 queries is fired on the column store and hot and cold runs are observed, which are averaged over three runs for both hot and cold runs. Data Scaling is performed chronologically same as done in row store experiment. Table 2 shows performance comparison (for cold runs) between row and column store results for vertically partitioned data with 4065400 triples. It is found that column store gives an order of magnitude better performance for query 1, 2, 3, 5, 6, 7, 10, 12, 13 and 15. For queries like query 4,8,9,11,14 it gives nearly equivalent performance as compared to row stores. Table 2. Performance comparison between row store and column store for vertically partitioned data Query N x Gain 8 -0.55
3.3
1 1.72 9 -0.28
2 1.17 10 1.41
3 3.64 11 -0.51
4 -0.97 12 1.58
5 2.18 13 3.46
6 18.12 14 -0.01
7 6.58 15 57.45
Analysis – Cold Runs vs. Hot Runs
The query set of all the fifteen queries was fired on all the four data stores. We have taken hot and cold runs for all of them. Query execution was carried out for both cold and hot runs and the same runs were repeated. All runs are averaged over three runs for both hot and cold runs. Cold runs results in 1%, 4%, 17%, and 32% of fluctuation in query execution time for triples table, property table, vertical partitioning, and horizontal partitioning respectively. Whereas for hot runs the fluctuations were 47%, 46%, 43%, 92% of actual hot runs. To check the repeatability of these results, we performed the same experiment again and found that cold runs recline around the same percentage of fluctuation and hot runs gave fluctuation of 47%, 58%, 65%, and 91%. We were able to see from the data that execution time for hot runs were not stable. In real life environment execution time of hot runs depend on the history of the accessibility of the data. In order to understand the phenomenon we need to understand the physical configuration and design of such systems. So we can not totally rely on the execution time using hot runs as it fluctuates more compared to cold runs, which was clearly visible from the observations. On the other hand, cold runs are the worst case scenario and shows remarkable repeatability.
408
4
T. Padiya et al.
Conclusion
We demonstrated that triples table performs and scales poorly compared to partitioned data due to increased number of self joins compared to partitioned data. Property tables are inefficient due to its complexity issues. In our query set, For 13 queries out of 15, vertical partitioning is giving best-case performance in terms of execution time, whereas for other 2 queries, horizontal partitioning gives best-case execution time. Queries which are user oriented and involved no joins, execute faster in horizontal partitioning and rest of the queries execute faster in vertically partitioned data. Queries for partitioned data, on an average executed 168 times faster compared to triples table. Queries that used to take execution time in thousands of milliseconds are now taking time in tens of milliseconds for partitioned data, which can help in making semantic web applications interactive. Queries in real life consists of subject-object join, and hence we have shown that depending on the type of join we can see 43 and 36 times performance improvement for vertically partitioned scaled data over triples table’s scaled data, having one and two subject-object joins respectively. Queries which had subject-object join, on an average, executed 8 times faster, after using materialized views on vertically partitioned data. Frequent queries with more joins can be executed in even lesser time by creating materialized views on partitioned data. Cold runs showed remarkable repeatability where as hot runs shows considerable fluctuations comparatively.
References 1. Abadi, D.J., Marcus, A., Madden, S.R., Hollenbach, K.: SW-Store: a vertically partitioned DBMS for Semantic Web data management. The VLDB Journal — The International Journal on Very Large Data Bases 18(2), 385–406 (2009) 2. Abadi, D.J., Madden, S.R., Hachem, N.: Column-stores vs. row-stores: how different are they really? In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, Vancouver, Canada, June 09-12 (2008) 3. Sidirourgos, L., Goncalves, R., Kersten, M., Nes, N., Manegold, S.: Column-store support for RDF data management: not all swans are white. Proceedings of the VLDB Endowment 1(2), 1553–1563 (2008) 4. FOAF Dataset (February 23, 2013), http://ebiquity.umbc.edu/blogger/ 2005/01/25/foaf-dataset-available/ 5. MonetDB Available: (March 1,2014), http://www.monetdb.org/Home 6. Vasani, S., Pandey, M., Bhise, M., Padiya, T.: Faster Query Execution for Partitioned RDF Data. In: Hota, C., Srimani, P.K. (eds.) ICDCIT 2013. LNCS, vol. 7753, pp. 547–560. Springer, Heidelberg (2013)