2015 12th Web Information System and Application Conference
A Unified Storage and Query Optimization Framework for Sensor Data Jun Fang Cloud Computing Research Center, North China University of Technology. Beijing Key Laboratory on Integration and Analysis of Large-scale Stream Data. Beijing, 100041, P.R. China E-mail:
[email protected]
Ting Lu Cloud Computing Research Center, North China University of Technology. Beijing Key Laboratory on Integration and Analysis of Large-scale Stream Data. Beijing, 100041, P.R. China E-mail:
[email protected]
Cong Liu College of Information Science and Engineering, Shandong University of Science and Technology. Qingdao, 266590, P.R. China E-mail:
[email protected]
relatively mature, easy to use and maintain. Unfortunately, a relational database has some limitations, such as concurrent read and write, scalability and high availability, etc., especially for big data management. NoSQL databases have the characteristics like high storage speed, high scalability, and random data structures, when compared to traditional relational database. HBase is one of its representative applications. But NoSQL databases, such as HBase, also reveal some problems in real applications, including (1) no direct support for query operations in SQL; and (2) query performance often depends on the level of its row key design, but performance is relative low for non-row key queries [2]-[5].
Abstract—With the development of Internet of Things (IoT), large amounts of sensor data, which has the characteristics of multi-source, high speed and large volumes, etc., are generated. Traditional data storage and query approaches cannot handle them properly. To deal with such limitations, a unified storage and query optimization framework, named DeCloud-RealBase, is proposed towards the management of large volumes of sensor data. An adaptation layer is applied to connect the bottom-layer database clusters and basic database operations can be realized by top-layer applications through a unified interface. Furthermore, Caching technology, data partition strategy and multi-threading technology are applied to realize data storage and query in a high performance manner. Experiments demonstrate that DeCloud-RealBase is capable of handling the storage and query of sensor data with an excellent performance.
On the other hand, considering the demand of construction cost, data scale, application complexity, and etc., it is required to deal with different storage databases in real development of sensor data management systems.
Keywords—sensor data,data management, database cluster, adaptation layer
I.
To resolve the preceding limitations, a unified storage and query optimization framework, named DeCloud-RealBase, is proposed towards the management of large volumes of sensor data. By building an adaptation layer to connect the bottomlayer database clusters, basic database operations can be realized by top-layer applications through unified SQL interfaces. Its main contributions involve the following three perspectives: (1) DeCloud-RealBase builds an adaptation layer to connect the bottom-layer relational database and nonrelational database clusters, and top-layer applications can perform database operations through unified SQL interfaces; (2) DeCloud-RealBase supports majorities of SQL query commands, which include add, delete, update and query for both table and data levels.; and (3) With the optimization strategies like caching technology [5], data partition strategy and multi-threading technology, DeCloud-RealBase can meet general demands for concurrent read and write, scalability and high availability of sensor data.
INTRODUCTION
Sensor data refers to information acquired by electronic devices, such as cameras, sensors or other likewise location techniques. Considering for example a Road Vehicles Realtime Monitoring System, it collects and store vehicle information of one city through various sensor devices (such as cameras, coil, GPS, etc.) in a real-time manner. With the rapid accumulation of sensor data, users may become unsatisfied with the poor performance data as they cannot be handled in an acceptable response time. In this setting, it is extreme important to improve user experience in database storage performance and retrieval efficiency by selecting appropriate database storage methods that meet actual demands [1]. Traditional information systems tend to choose a relational database to serve as a persistent storage medium as it is 978-1-4673-9372-0/15 $31.00 © 2015 IEEE DOI 10.1109/WISA.2015.13
229
II.
RELATED WORK
(2) Unified Interface Layer: This layer provides a unified data query and writes interfaces, based on which database operations are realized by calling these interfaces.
In the management of large volume of sensor data, Ding and Gao [6] proposed an IoT Database Cluster Systems Framework, named IoT-ClusterDB, to deal with massive sending data with fully considering of its features, such as massive, heterogeneous, spatial-temporal sensitive and dynamic. By experiments, they showed that IoT-ClusterDB enjoyed a satisfactory sensor data uploading and query processing performance, and therefore, provides an effective solution for massive sensor data management.
(3) Syntax and Semantic Validation Service Layer: This layer provides syntax and semantics check for the SQL statements, and the integrity of the received sensor data is also validated here. (4) Scheduling Layer: Analyzing and merging the data write request, and parsing, decompositing and rewriting data query, and finally generating appropriate scheduling operations to the adaptation layer.
Hu et al. [7] constructed a Dual Core Cloud-based Data System, TaijiDB for short, which leveraged the advantages of cloud storage based on master-slave and P2P structures. TaijiDB is capable of supporting big data in the cloud using SQL. Unfortunately, this approach ignored user’s need for a relational database in some situations with complicated queries. Moreover, it also ignored to fair use cache to improve query efficiency.
(5) Adaptation Layer: This layer is motivated by the adapter pattern in classical design pattern, i.e., it can support interface conversion by creating an interface adapter, and allow users to decouple from the interface implementation. Both relational databases and NoSQL databases are connected, and it depends on the user requests to distribute corresponding database operations.
Zhong et al. [8] proposed a NoSQL based LaSQL Unstructured Data Management System, LaUD-MS for short, to resolve the problems such as massive data storage, fast read and writes response and massive data analysis. Using this architecture, a free table model with multiple column family was designed. By experimental analysis, this solution can meet the demand of massive monitoring data storage. Zhang [9] designed a JUOB middleware which stands between JOBC and real application. It shields the differences between heterogeneous databases of different JOBC.
(6) Database Cluster Layer: It serves as a persistent storage medium. In the DeCloud-RealBase framework, Oracle and MySQL are adopted as relational databases and HBase is used as non-relational databases.
Our system architecture is motivated by [8]-[9], which is capable of handling the storage and query of sensor data with an excellent performance. Moreover, the architecture also provides a useful solution to manage multi-source, high-speedy and large volume sensor data.
III. A UNIFIED STORAGE AND QUERY OPTIMIZATION FRAMEWORK FOR SENSOR DATA-- DECLOUD-REALBASE Aiming at the characteristics of multi-source, high speed and large volumes of sensor data, a unified storage and query optimization framework for sensor data, named DeCloudRealBase, is proposed towards the management of large volumes of sensor data. The core function of DeCloudRealBase is that it can call different types of database interface in accordance with users’ requirements, and therefore, the different insertion performance and query efficiency demands can be met. In the following, we will introduce the DeCloudRealBase platform from its framework, core implementations and optimization strategies.
Fig. 1. DeCloud-RealBase Framework
B. Core Implementation of DeCloud-RealBase In this subsection, we first discuss the implementation layer of DeCloud-RealBase. The structure of Adaptation Layer is shown in Fig. 2. By constructing the Oracle master-slave database, MySQL cluster database, and HBase cluster database, relevant operations in relatioal database such as Oracle module, MySQL module and NoSQL database such as HBase module are called according to different database types.
A. DeCloud-RealBase Framework The framework of DeCloud-RealBase is demonstrated in Fig. 1, which contains the following components or layers. (1) User Layer: users first initiate the SQL request, e.g., CREATE, INSERT, SELECT and etc.
230
Fig. 2. Structure of Adaptation Layer
Due to the sensor data has the characteristics of multiple read with once written, in order to improve the query efficiency under similar query conditions, the structure uses Memcached to cache the results of the first query. The specific implementation of query as follows: In the data query part, the rewritten query statement is first parsed, and then coded using Hash. Next, the results are compared with the key value in the cache database. If a cache is hit, then we should obtain the query result directly from cache database. Otherwise, one need to determine its database types according to the parsing results. And then choosing appropriate database server for further query. The query results should also be written to the cache database which can be reused in the future queries. The written query statement can relate with basic operation in relational databases directly, while for those non-relational databases, such as HBase, the SQL statements should first be converted to another format that can be recognized by HBase. p g 3. Its detailed conversion procedure is shown in Fig.
Fig. 3. SQL Statements Conversion Module of HBase Cluster
For the data written part, the adaptation layer will be analyzing, scheduling and integrating the multi-source sensor data according to the parsed data type. These processing procedures are performed on condition that the integrity and reliability of the data are not affected. Finally, by combining the multi-threading technology and the row key design strategy, data is stored to the persistent storage medium in a parallel and balanced manner.
If databaseType equals Relational then If the partition field is containd in the query then rewrite SQL; If hash query matches the cache then get ResultSet from Memcached; Else get ResultSet from database; write ResultSet into Memcached; return ResultSet; Else if databaseType equals Non-relational then If hash query match the cache then get ResultSet from Memcached; Else set start row key; set end row key; set columns needed to get; set Filters; ... ... get ResultSet from HBase cluster database; write ResultSet into Memcached; return ResultSet; End The scheduling of data written is implemented as follows: WriteRequestScheduler: Input: List messagesdatabaseType Output: writestatus
Begin If the messages’s size reaches the threshold then determine database type ; If databaseType equals Relational then for i = 0 to dataList.size do write data into database; Else if databaseType equals Non-relational then get HTableInterface from HBase cluster database with tablename; get ColumnDescriptor from HBase cluster database ; definition List puts; for i = 0 to dataList.size() do design rowkey ; p.add(... ... ...); p.add(... ... ...); ... ... p.add(); puts.add(p); return writestatus; End
C. Optimization Strategies of DeCloud-RealBase 1) MemCached Caching Strategy
The scheduling of data query is implemented as follows: QueryRequestScheduler: Input: SQLdatabaseType Output: ResultSet /*query result set*/ Begin determine database type ;
Considering that the query may involve different ranges and the results returned by different query may be different, that is , the query cache may be partially matched. The Cache is divided into three-level cache buffers. The key-value distributed cache technique Memcached is adopted as the
231
bottom-level architecture, and the three-level cache is designed as following: (1) parsing results of rewritten queries after removal time information are encoded using hash, and its outcome is used as a cache key denoted as L1-key; (2) by connecting the L1-key with its corresponding time range, L2key is obtained; and (3) taking the L2-key as input for further hash encoding, the L3-key is obtained. It is worth noting that (1) the value of first-level cache are time range, result range, and query result set; (2) the value of second-level cache are result range, and query result set; and (3) the value of third-level cache is query result set. The detailed caching structure is drawn in Fig. 4. Fig. 5. Scheduling Process of Coprocessor
In a DeCloud-RealBase, the aggravation query, for example count(*), is implemented using the coprocessor by positioning its computation part on the server with data. Its detailed process is shown in Fig. 5. The computation of query resulted data is conducted on the server and only the computation results are send back to the client, which save the network transmission time and simplify clients operation greatly. 3) Data Partition Strategy Fig. 4. Structure of Caching Layer
The data partition strategy is also applied to process sensor data in our DeCloud-RealBase. As introduced in the framework, the sensor data is also partitioned into different parts to improve its write speed and efficiency when it is received. It is shown in Figure 6 that data objects are stored in the memory using list structure. QueueList represents data objects sent per time unit and List1, List2, List3…stand for the sub-object in each list. According the features of sensor data, it is portioned into different data queues following the unique identifier.
When matches with caches, the first-level cache is used. If there is no item matches, then the database query is needed. If there exist some items that match, then its time range is compared with those in the first-level cache. If the time range is not contained in the cache, then database query is operated. Otherwise, it is compared with the second-level cache, if the item is not matched, then database query is used, otherwise, it is then compared with the third-level cache. If the item is not matched in the third-level cache, then database query is used, otherwise final result is returned.
Considering for example the vehicle sensor data, we first obtain the timestamp of each data. The hash mode is used to convert this data item and then it is push in the queue according to the hashing results. There are three queues that are waiting for data, and the data receiving time is 2012-10-17 00:00:52. Then we convert it into the second representation as 1350403252000 milliseconds. Its remainder on 3 is 1, so we push this data item in the first queue. This procedure is shown in Fig. 6.
2) Usage of Filter and Co-processor HBase fliter provides powerful ability to help users to improve the data processing efficiency. A couple of filters are used in the DeCloud-RealBase when query with the HBase cluster database. Its detailed usage is illustrated in the prementioned data query scheduling procedure. There are two types of coprocessors, one is called Observer, which is similar to the trigger in traditional relational database, and the other is named EndPoint that is like the storage procedure in the relational database. The EndPoint coprocessor provided by the HBase can do computation task on the clusters and then return its results to its client. Compared to retrieve data from the server and then calculate data in the data-side, the network transportation section is reduced. In this way, the data transmission time is shortened greatly. Therefore, this method can save a lot of time to improve performance when deal with large amounts of data.
Fig. 6. Structure of Data Partitioning
When a queue contains enough data to meet the predefined threshold, the multi-threading method is used to send it to server for persistent storage. In the data partitioning strategy each queue corresponding to the data object is sent to the data write area, the data writing area parse each data object and
232
then store them simultaneously to different data servers in a high-rate and fast speed manner. d) Data Replication Strategy To maintain the integrity and stability of the MySQL cluster database, data replication strategy is applied. More specially, two servers is assigned as the master server and the other two servers are denoted as slave servers. The database change will be recorded in the log be master servers, and these changes will also be sent to its corresponding slave servers. Once the master servers are break up, the slave servers can provide relevant services instead. For more discussions on data replication strategy, one may refer to [10]-[11].
Fig. 7. Data Writing Test for DeCloud-RealBase
IV. EXPERIMENTAL ANALYSIS
Experiment 2 We perform the data query test for our DeCloud-RealBase framework. Five set of testing data is used with each contains 10 query statements. Their corresponding query response time is obtained. For example, we query the results from 2012-10-17 to 2012-10-26 by setting five testing data as 1 minutes, 5 minutes, 10 minutes, 30 minutes, and 60 minutes. Each has ten concurrent query statements, like Query Statements 1-2 in the following. Similarly, Query Statements 310 can be obtained accordingly.
Experiment Objective: DeCloud-RealBase is proposed towards the management of large volumes of sensor data. This section we perform a set of experiment to test its performance in read-in data and query on different database types. Our query experiment were performed on 280 million data records. The detailed experiment environment information is shown in Table 1. TABLE I. EXPERIMENT EQUIPMENT AND ENVIRONMENT Equipment LoadRunner Server (4) DeCloudRealBase Server(1) MySQL Database Server (4) HBase Database Server (4)
Environment Dual core 3.0GHz CPU4 GB Memory 24 core 2.4 GHz CPU16 GB Memory 24 core 2.4 GHz CPU8GB Memory 24 core 2.4 GHz CPU8GB Memory
Query Statement 1: select CAMERANO,RECORDTIME from SBSN where RECORDTIME >= to_date('2012-10-17 12:12:01','yyyy-mm-dd hh24:mi:ss') and RECORDTIME = to_date('2012-10-18 12:12:01','yyyy-mm-dd hh24:mi:ss') and RECORDTIME