2015 IEEE Region 10 Symposium
Data Management for Internet of Things
Trupti Padiya DA-IICT Gandhinagar, India
[email protected]
Minal Bhise
Prashant Rajkotiya
DA-IICT Gandhinagar, India
[email protected]
DA-IICT Gandhinagar, India
[email protected]
supports data interoperability and provides semantics to data. Diverse data from different devices can be integrated easily for inference and decision making, which will help trigger an appropriate action by devices on the network, for the required business model. The most popular model of RDF is RDF/XML, which is represented in XML (eXtensible Markup Language) format and hence it is easy to convert SenseML to RDF using XSLT (eXtensible Stylesheet Language Transformation). RDF Schema is used for defining classes, properties, collections, hierarchies, documentation, reification, and basic implication for reasoning. To make network of devices interactive in real time, we need to have a way to process RDF data faster.
Abstract—Internet of Things is projected to connect uniquely identifiable devices over the network to build an interactive system with high velocity and volume of data placing forth a challenge of interoperability between such devices. RDF provides a common standard for communication among devices of network and supports powerful data inference. The paper addresses the challenge of handling huge sensor data interactively using RDF. The experiment includes various RDF storage mechanisms such as triple store, property table, vertically and horizontally partitioned table, column store, and data aware hybrid storage. It also shows comparison between vertical partitioning approach and data aware hybrid storage approach for faster data retrieval in IOT systems. The experiment shows 12% of performance improvement using hybrid approach over vertical partitioning approach. It also represents a set of metrics which have been designed to take decision for using appropriate RDF data storage technique beforehand for IOT systems.
The digital universe has doubled every two years, however, as part of IOT; it may increase ten-fold between 2013 and 2020 from 4.4 trillion to 44 trillion gigabytes [5]. There is an emerging principle of big data science that says, the more data history one has the better. Since we don’t know what questions we are going to ask our data in future, it is best to retain history perpetually, which will give visibility over a larger data and flexibility to answer any question [4]. IOT will transform data center completely and that will require different storage management techniques [6].
Keywords—Internet of Things, RDF, Vertical Partitioning, Data Aware Hybrid Storage
I.
INTRODUCTION
Internet of Things (IOT) is an evolution in which objects are competently interactive with other objects on the network. They are uniquely identifiable objects, which offer connectivity of devices, systems and services that behave as a smart grid. IOT is expected to generate large amount of data from diverse locations that is aggregated at very high velocity. It demands better methods for indexing, storing and processing such data which in turn requires developing a technique that converts this data into a knowledge base [2]. Most of the devices comprising IOT services need to operate utilizing standardized protocols which is a burgeoning challenge especially with the emergence of big data and growing need to tie the data together. There is no shortage of standards, however these standards remain disconnected. Strength of RDF-based technology is in connectivity. Smaller, modular vocabularies from different sources can be combined, linked, and built upon to create information models for different domains [3]. Open source tool are available such as RDFLib, which provides a robust linked endpoint.
In this paper, we present RDF data storage techniques, which will enable us to have interactive device network that can process RDF data faster. These techniques are applicable for e-governance, e-health, e-commerce, and various mobile applications contributing to IOT community. Suggested RDF data storage techniques [8], [9] are demonstrated for DBLP [10] data set – a dataset of computer science bibliography. The paper also enlists a set of metrics, designed for evaluating suitable storage mechanism for IOT systems. II.
Data for IOT will be growing at a huge pace and it is necessary to store and retrieve data efficiently for interactive IOT systems. There are various storage mechanisms for storing RDF data effectively for faster retrieval. RDF data comprises of triples of the form . Simplest and easiest way to store RDF in a table is using a triple store – a three column table with fields namely subject, property and object. However it suffers from a lot of performance issues,
RDF (Resource Description Framework) models are directly usable, query-able and can include reference data. RDF
978-1-4799-1782-2/15 $31.00 © 2015 IEEE DOI 10.1109/TENSYMP.2015.26
RDF STORAGE MECHANISMS
62
Clustering phase identifies the set of properties from the RDF dataset, which always tend to be defined together and hence become candidate to be stored together. Partitioning phase takes clusters from clustering phase and balances the trade-off between storing as many properties together while keeping null storage to minimum. Partitioning phase also tries to remove the overlapping properties from the clusters. Properties in a cluster are stored as a property table and properties which are not in clusters are stored as vertically partitioned tables. RDF storage approach [12] can be used for several IOT applications such as e-health and other e-commerce systems where data is frequently accessed and faster query execution is required. The paper shows experimental performance comparison of RDF data aware hybrid storage over existing RDF storage techniques. New set of metrics are designed for evaluating appropriate storage technique for IOT systems.
because almost all real life queries involve many self joins which result in poor query execution time [9]. As number of join increases, query complexity increases which will result in serious performance issues in terms of execution time, and increase in data size will make query performance even worse. A. Property Tables Another approach to store RDF data is using Property table, which has one column for its subject, and other columns for properties associated with it. However it suffers from few performance issues such as occupying extra space for null values. It is also difficult to store multi-valued properties using property tables. Mainly two types of property tables are used: clustered property table and property class tables [7]. Clustered property table contains clusters of properties that tend to be defined together. Property-class table exploit the type property of subjects to cluster similar set of subjects together in the same table. Property tables can achieve faster data retrieval when equal set of properties are defined for all the subjects, where a subject can be retrieved straight forward without use of joins or less number of joins in case of very complex queries.
III.
IMPLEMENTATION
This section highlights details about the DBLP dataset and query set being used, details about experimental setup and also discusses implementation details of various data storage approaches.
B. Vertically Partitioned Storage Vertically partitioned storage approach is efficient way of storing RDF data, which has n two-column tables, where n describes number of unique properties in the RDF data [9]. Vertically partitioned approach has shown better performance compared to triple store and property tables in most of the cases for the dataset used in some of the experiments [7],[9]. Vertically partition storage has shown around 108 times of performance improvement over triple store. Materialized views were created over vertically partitioned data and it gained around 8 times performance improvement over vertically partitioned data [7]. Materialized views store partial data of query result beforehand and help gain better performance. Horizontal partitioning is carried out subject wise and it has shown performance improvement when subject specific queries are fired. Vertically partitioned storage is also implemented using column store which have shown better performance for most of the queries compared to vertically portioned approach for row store [7].
A. Dataset and Queryset DBLP dataset has 107 million triples and 27 unique properties. It has 103 Million unique triples. We implemented four categories of queries for DBLP dataset which includes queries for encoded properties or all subjects as Type 1, queries for encoded subjects or all properties as Type 2, subject specific queries as Type 3, and administrative queries as Type 4. Four different frequent queries were chosen for each category. Average query execution time for each category and their results are discussed in results and discussions section. Experimental system test-bed has 1 TB of Hard disk, 4 GB of RAM and 0.5 GB of cache memory. DBLP Dataset occupies 16.8 GB on disk. B. Vertically Partitioned Storage and Property table Implementation of vertically partitioned approach is carried out for DBLP dataset which consists of a set of 27 vertically partitioned tables as there are 27 unique properties in the dataset. It takes processing time of O (n*n) where n is number of triples in the dataset. Every vertically partitioned table consists of a list of subjects and associated objects through the property in their respective table. All types of queries are fired on the database of 27 tables and query execution time is recorded. Both hot and cold runs are observed which are averaged over three runs. However we use cold runs for the experiment as they give the worst case scenarios. Materialized views over partitioned data are also implemented by storing the partial query result beforehand to gain query execution performance over vertically partitioned table.
C. Data Aware Hybrid Storage Many researchers suggest query aware storage for the frequently accessed data, provided queries are known in advance. For IOT systems it is hard to know all the frequently occurring queries initially as it depends on IOT application and hence we focus on data aware storage approach in this paper. Data aware hybrid storage approach combines vertically partitioned storage mechanism and property table approach, giving a data centric view of storing RDF data [8]. It combines best of both the approaches eliminating their respective performance issues. It follows two phase approach: clustering and partitioning.
63
Property table implementation is carried out using clustered property table for the queries to be fired over the dataset. We have 16 clusters, each specific to queries in the query set. Each property table consists of required properties for their respective queries.
we are presenting data generated at a certain point of time in order to check query performance against various data storage techniques for faster data retrieval among devices on the network. The paper highlights comparison between results of vertical partitioning technique against data aware hybrid storage approach. We selected four categories of queries discussed in the previous section. For each category we wrote four frequently occurring queries. The query execution time of vertical partitioning is compared against total query execution time for data aware hybrid storage which includes summation of look up time for cluster and time for storage. Fig 1 depicts the comparison of vertically partitioning (VP) and data aware hybrid storage (DAHS). We were able to see on an average of 12% of improvement in query execution time for data aware hybrid partitioning storage against vertically partitioned storage. We were able to find that vertically partitioning technique performed nearly equivalent for Type 1 queries which fetched less number of properties.
C. Data Aware Hybrid Storage Data aware hybrid storage technique uses hybrid approach of combining vertically partitioning technique and property table approach. Based on strength of association between properties and percentage of null storage in RDF schema, property table and a set of vertically partitioned storages are built. The RDF data is once scanned and two data structures are prepared: property-use listing and subject-property bin. Property-use listing is a list structure which stores the name of the properties and the number of times it is used in the RDF dataset, which will help us find the frequency of its occurrence in the dataset. Subject-property bin stores subject and associated properties to that subject, which will give us the number of associated properties to a particular subject. Data aware hybrid storage technique gives 6 cluster tables and 11 vertically partitioned tables. It is implemented using property-use listing and subject-property bin. Property-use listing is created by finding the number of subjects associated to that property i.e the frequency of the property in the RDF dataset. It took processing time of O(n) where n is the number of triples in RDF dataset. Subject-property bin is created by putting associated properties together for each subject with processing time of O(s*p) where s is the number of unique subjects and p is average number of properties defined for a subject in the RDF dataset. Occurrence of similar property list is calculated which gives us support threshold, helping us to reduce the number of joins if kept together. Support threshold is used for clustering phase. Null storage is restricted below provided null threshold in partitioning phase. Subject-property bin are sorted in descending order of number of properties associated to a subject, giving highest number of associated properties first. One by one all bins are scanned and checked for its support threshold. If it satisfies the support threshold they are checked for null threshold. Satisfying the null threshold criterion they are considered as final clusters else they are sent to next phase of partitioning. In partitioning phase property-use listing is pruned from the bin and if property exist in any of the bin then it is dropped else it is stored in a vertically partitioned table. All clusters are materialized by creating one table for each bin and the tables are loaded with RDF triples. IV.
Fig 1. Query performance comparision for VP and DAHS
Execution time in ms
VP vs DAHS 1400.00 1200.00 1000.00 800.00 600.00 400.00 200.00 0.00
Type 1
Type 2
Type 3
Type 4
VP
44.30
80.00
1159.43
86.00
DAHS
40.05
67.50
949.00
66.00
There are two queries in category Type 1 in the query set, which supports OPR metric, means two queries fits vertical partitioning approach. Table retrieval factor on an average for all the queries is 1.33 which is derived from (3). SoD is 67970 which is 0.68% of total subjects which is derived from (1). We set Multi-valued property threshold as 0.69 using (2). We have found 11 queries out of 16 in our query set which are suitable for data ware hybrid storage approach by considering the breakeven point analysis given in (4). The paper presents a set of metrics to evaluate storage approach for the dataset contributing to IOT applications. Metric SoD (Structuredness of Dataset) identifies the structuredness of dataset. Structuredness is categorized as: well-structured and semi-structured. Well-structured data are denser and has higher number of properties defined per subject whereas semi-structured data are less dense and has lesser number of properties defined per subject. It can be found using following equation:
RESULTS AND DISCUSSION
DBPL dataset is stored using various listed mechanisms such as triple store, property tables, vertically and horizontally partitioned tables, materialized views on the vertically partitioned data and data aware hybrid storage technique. We fired various queries on all the data storage techniques to find out a suitable kind of RDF data storage for IOT systems. Data persistently keeps on generating in IOT systems, however here
64
SoD = nT / nUP
(1)
which has given 12% improvement over vertical partitioning approach in the experiment. Set of metrics represented in the paper have provided sufficient information to decide an appropriate storage mechanism depending on the dataset and queries. Main memory databases [11] can be used to achieve the goal of highly interactive IOT systems. A combination of disk based and memory based storage mechanisms can also be used in IOT systems where some portion of the data is frequently used and some portion is stored for history or used seldom. In such cases, it is possible to store frequently queried data on main memory database system and the other data which is not frequently queried can be stored on disk resident systems. We are looking forward to use main memory systems for IOT systems which will help achieve the aim of highly interactive and interoperable network of devices. It is necessary to develop algorithms which will work for main memory database to achieve the vision of extremely interoperable devices.
where nT is number of triples in the dataset and nUP is number of unique properties in the dataset. Higher structuredness ratio indicates well-structured dataset and lower structuredness ratio indicates semi-structured dataset. Metric MPT finds the multivalued-property threshold which gives the percentage of multi-valued properties in the dataset.
MPT= nMP / nP
(2)
where nMP is number of multi-valued properties and nP indicates total number of properties in the dataset. A proper multivalued property threshold needs to be defined in order to implement data aware hybrid storage technique. Table Retrieval factor (TRF) is a metric which helps us to decide, whether to store data using vertical partition or data aware hybrid storage approach for a particular dataset. TRF is given as TRF= nTvp / nThdas
(3)
REFERENCES
where nTvp is number of table retrieved in VP and nThdas is number of tables retrieved in data aware hybrid storage. TRF can be calculated for every type of query to be executed in order to gain the best possible execution time for faster retrieval of data.
[1]
B. Violino, "The 'Internet of things' will mean really, really big data", [online]July 2013, http://www.infoworld.com/article/2611319/computerhardware/the--internet-of-things--will-mean-really--really-big-data.html [Accessed: December 2, 2014] [2] J. A. Stankovic, “Research directions for the Internet of Things” IEEE Internet Things J., vol. 1, no. 1, pp. 3-9,Feb., 2014 [3] I. Polikkof, “RDF is critical to a Successful Internet of Things”, [online document] May, 2014, http://semanticweb.com/rdf-critical-successfulinternet-things_b42994 [Accessed: December 2, 2014] [4] M. Matchett, “Internet of Things will Boost Data Storage “,[online document] June 2014, http://searchstorage.techtarget.com/opinion/Internet-of-Things-data-willboost-storage [Accessed: December 1, 2014] [5] S. Kar, “Internet of Things will Multiply the Digital Universe Data to 44 Trillion GBs by 2020”, [online document] April 2014, http://cloudtimes.org/2014/04/17/internet-of-things-will-multiply-thedigital-universe-data-to-44-trillion-gbs-by-2020/ [Accessed: December 1, 2014] [6] J Rivera, R Meulen, “Gartner says the Internet of Things will transform the Data Center”, March 2014, http://www.gartner.com/newsroom/id/2684915 [Accessed: December 3, 2014] [7] T.padiya, M. Bhise, S. Vasani, M. Pandey, “Query Execution for RDF Data on Row and Column Store”, Natarajan et al. (Eds.): ICDCIT 2015, LNCS 8956, pp. 403--408. Springer International Publishing Switzerland (2015) [To be published in Feb 2015] [8] Justin J. Levandoski , Mohamed F. Mokbel, RDF Data-Centric Storage, Proceedings of the 2009 IEEE International Conference on Web Services, p.911-918, July 06-10, 2009 [9] Daniel J. Abadi , Adam Marcus , Samuel R. Madden , Kate Hollenbach, Scalable semantic web data management using vertical partitioning, Proceedings of the 33rd international conference on Very large data bases, September 23-27, 2007, Vienna, Austria [10] “DBLP dataset”, [online document] datahub.io/package/l3s-dblp [Accessed: 3 September 2014] [11] H. Garcia-Molina , K. Salem, Main Memory Database Systems: An Overview”, IEEE Transactions on Knowledge and Data Engineering, v.4 n.6, p.509-516, December 1992 [12] C. C. Aggarwal, The Internet of Things: A survey and form the date-centric perspective, in Managing and Mining Sensor Data. New York, USA: Springer, 2013, pp. 383-428.
OPR is One Property Retrieval metric which says if only one property is retrieved in a query then it is always better to opt for vertical partitioning. We have devised a generalized breakeven point N presented in equation 4, where queries can benefit from data aware hybrid partition approach compared to vertical partitioning. n * vpq > n * hdasq + L
(4)
where n is no of queries, vpq querying time for vertical partitioning, nThdas represents querying time for data aware hybrid storage approach and L indicates time for cluster lookup.
V.
CONCLUSION AND FUTURE WORK
The paper shows that RDF can be applied to IOT systems by using various storage mechanisms. Data can be converted to XML based RDF form. On the top of that RDF provides data interoperability and deals with semantics which serves as a basic entailment for reasoning. RDF is a standard which can fill the gap for interoperability between devices, as it supports powerful reasoning and decision making. We have seen that partitioning RDF data helps us get faster data retrieval, leading towards highly interactive IOT systems. Vertical partitioning has shown improvement over property tables. We have shown the hybrid combination of vertically partition approach and property table considering the best of both the approaches,
65