Raw Data Processing Framework for IoT Mayank Patel Distributed Databases Group DA-IICT Gandhinagar, India
[email protected] Abstract—The Internet of Things IoT sensors continuously generate large amount of data. Storing and managing the streaming IoT sensor data is a great challenge. IoT applications require real-time analytical query processing on stored data. To make any IoT data queriable, it needs to be loaded into any database management system traditionally. Where the high data insertion time DIT becomes the first bottle-neck. Storing data in its raw format is the most convenient solution for scientific experiments and variety of other applications where high data velocity and unknown or varying schemas exist in the approaching data. Therefore raw data storage and management techniques need to be explored to utilize very low DIT for handling streaming IoT data. Most of the work with raw data processing is dealing with the static data and how to improve query performance on the stored data. This work tries to explore possibilities in applying raw data processing technique to streaming IoT sensor data. A Raw data processing framework is proposed to handle the streaming IoT sensor data. The work tries to exploit the open source system setup to handle the streaming data. The experimental setup consists of PostgresRAW PgRAW a first mature NoDB system. Results of PgRAW shown 99.5% improvement in DIT as compared to PostgreSQL PgSQL. While the query execution time QET performance improved by 57% over the data scaling experiment from 1M to 4M for PgRAW over PgSQL. Keywords—Internet of Things, Query Processing, Raw data, Sensors, Streaming Data
I.
INTRODUCTION
Nowadays smart devices are equipped with several sensors. The Android/IOS mobile applications are readings multiple sensor observations to provide user the personalized experience. Internet of Things IoT applications like Smart Homes, Health monitoring and Smart city have variety of sensors which continuously observe users’ surroundings and generate the data streams. The data being transferred from these devices needs to be stored securely and efficiently to answer the queries being fired on the applications end. IoT application uses these data to make certain decision which helps in making an autonomous system(Abu-Elkheir, Hayajneh and Ali, 2013). For example, the heart monitoring of a patient is a critical use case, where the system must report any irregularity in the observations immediately to the doctor of that patient and call the ambulance if needed. Smart city applications can also manage some critical services of the city like Electricity, Gas, Traffic signals, and real-time event detection using CCTV cameras and other sensors. To satisfy the data security needs of such applications most of the companies rely on private servers rather than using shared cloud services. However, with growth in users and data, the servers need to be scaled accordingly which can significantly increase the overall cost of the system. The server used to store the application data needs to be able to handle the heterogeneous streams of data from various sources(Balazinska et al., 2007). Researchers have developed several techniques to store and process the data streams
Minal Bhise Distributed Databases Group DA-IICT Gandhinagar, India
[email protected]
coming from sources spread across the world. Storing entire data into main memory is one of the fastest techniques to store and process the data streams but it requires expensive server setup. Bulk loading techniques can collect data in buffer storage for the fixed time interval or based on query arrival to load all the collected data to main database altogether. This saves a lot of data insertion time DIT as compared to one by one data insertion. Bulk loading needs to parse and convert the data in a proper database schema format. Although bulk loading takes very less time, it is not the fastest and the query needs to wait until bulk loading of data completes. The technique of storing data without any preprocessing using comma separated values CSV or list of strings have been highly utilized over time to log some event. Scientists also store most of their data in raw formats due to multiple reasons. First, the complexity of science experiments influences the output of the data. Characteristically the generated data schema is not known in advance. The second major reason is the volume and velocity of the experiments’ data does not allow the use of traditional database systems due to their high loading time. For example, the Large Hadron Collider at CERN can produce terabytes of data every second, which is impossible for any database management system to load in real-time. Storing data in raw format has many other benefits. The raw data has better accessibility. Many applications can access the same data concurrently. Even a single application can easily access multiple raw data sources for analytical queries. In addition, many database management systems allow bulk data loading natively from raw data formats. The recent trends of processing Raw files in-place do not require any kind of data loading in the database to execute queries. Many in-situ engines parse only a part of the entire data needed by the query. This saves time by not parsing unnecessary data. The processed data may allow faster query execution for future queries accessing the same part of the dataset. The existing in-situ raw data processing engines can completely eliminate the time used to load the data into the database. The same can be said for streaming data loading also. However, this increases query execution time QET for initial queries but it improves quickly for future queries. This technique said to be useful if only a part of the data is used multiple times when the parsed data is still in memory or only a part of the complete data is needed to answer the queries. The goal of the proposed work is to explore the possibilities of applying raw data processing techniques to streaming data applications. Here data is continuously generated therefore the discussed setup will repetitively get the benefit from DIT time savings. The queries may require only a small part of data to answer a query; therefore it can also save a lot of time by keeping unused data as raw data and parse only the useful data. The challenges arise when the modern smart city applications or IoT applications require
historical and current data for analytical queries. Now to identify and process the historical data is a complex task, even the well matured traditional database systems need to scan entire database to answer the time travel queries. IoT application may need the current and historical data to take certain decisions in real-time. The work will be aimed towards managing data in raw files and improve QET for applications having multiple data streams. II. LITERATURE SURVEY The huge amount of scientific, astronomical, genomic and streaming sensor data is collected in raw formats. The raw data storage benefits and its use cases have been discussed intensively by multiple researchers. Earlier work on raw data processing was scanning entire file to answer the queries each time. For example, Oracle & MySQL allows to query the CSV data files directly without loading the data, but it scans the entire file each time a query is fired on those data. NODB(Alagiannis et al., 2012) is the first mature Raw data processing system from a research group at EPFL. The core idea behind this work was to remove the major bottleneck of data-to-query time, which is data loading. The work proposed extending the traditional query processing architectures to work directly on raw files to answer the queries. Researchers had converted a traditional row-store PostgreSQL into a NoDB system, also known as PostgresRAW. The work also included multiple techniques to work robustly with raw files. Subsequently, the work Slalom(Olma et al., 2017) partitioned the raw data logically and built lightweight indexes on-the-fly based on accessed data. It creates zone maps, bloom filters and B+ tree indexes in parallel by sharing the scan operator of the query. The most recent work of the group on raw data tries to choose the optimal data caching mode automatically based on query history, cache behavior and cache size for faster query execution(Azim et al., 2018). SCANRAW(Cheng and Rusu, 2015) is another in-situ processing engine over raw files. This work focuses on making query execution faster by giving first priority to query execution and load data only when the I/O bandwidth or CPU is available. In contrast to Invisible loading(Abouzied, Abadi and Silberschatz, 2013) it may not load data with query processing, because if the I/O bandwidth is fully utilized the data loading can slow down the query execution performance. OLA-RAW(Cheng, Zhao and Rusu, 2017) applied query driven sampling over raw data to estimate if the query result is of interest or not. If the result is of no interest for the query, it can be terminated in middle and save the time. This work provides almost 95-99% accuracy by executing the queries on sample sizes of 0.050.3% of actual data with 10X less query execution time for astronomical data. Similar to sampling the hot and cold data distribution based on query work and partitions can also improve QET by keeping the hot data in main memory. Work(Jain, Padiya and Bhise, 2017) shows that only 8% of total data can answer 64% of queries with 83% of time gain. A heuristics with vertical partitioning algorithm is discussed in a work(Zhao, Cheng and Rusu, 2015). It partitions the raw data files using the two-stage heuristic algorithm which combines the concepts of query coverage and attribute usage frequency of workload to decide the best attributes to load. Many data partitioning techniques have been used on RDF &
relational data with traditional databases to improve the QET for semantic web data analysis(Vasani et al., 2013), (Padiya and Bhise, 2017). Modern techniques used to process streaming data include the bulk loading, in-memory databases or cloud (Abu-Elkheir, Hayajneh and Ali, 2013), (Alagiannis et al., 2012). The City Pulse(Puiu et al., 2016) discusses the smart city data processing requirements and proposed the framework with all the required components to handle the generated data and answer queries. All the discussed related work have some limitations while handling streaming raw data in real time for query processing. NoDB does not load existing data into the database; it caches data only in memory as query arrives. SCANRAW loads data into database whenever CPU or disk is underutilized, but it may not get the underutilized disk or CPU due to constant data and query flow when applied to IoT applications with streaming data sources. At the best of our knowledge, there is no work which tried to process sensor data using Raw methods in real-time. III.
RAW DATA MANAGEMENT FRAMEWORK
The system includes the following components to manage streaming data using raw data management techniques. Fig. 1 shows the basic framework of the said Raw data management framework. Here clients or sensors from the smart devices continuously send the observations and queries to the systems. The system then stores streaming data & queries the stored data using Streaming Data & Query Management component. The in-situ engine or any traditional database can be used to execute SQL queries on CSV files or databases respectively.
Fig. 1. Raw Data Management Framework
A. In-sity Query Engine + DBMS Any open source state-of-the-art in-situ query engine can be used. It will be used to execute the SQL queries directly on the generated Raw data files. While any traditional DBMS system can be used in parallel to load data into the database and use it as a hybrid system if needed based on
application data management requirements. The experimental setup will be discussed in section IV which includes an example of such system which is a combination of in-situ engine PgRAW and a tradition DBMS PgSQL.
load the data into a database. The time needed to extract data from an object or parameters to create the SQL statement was not considered to get the exact DIT of a traditional DBMS system.
B. In-meory storage In-memory storage can be used to store the current or frequent data for faster query execution. The in-situ query engine can store the parsed frequent data. While streaming data and query management component can utilize this to store the current data of the sensors. This storage can also be used as buffer storage to gather data for bulk loading.
A. Experimental Setup The machine was running 64-bit Ubuntu 14.04 OS. The machine included Intel Core i3-2100 CPU clocked at 3.10GHz and backed by 16GB of RAM. A 500GB SATA hard disk drive having 7200RPM rotation speed was used as a permanent storage medium. The NoDB extension of PostgreSQL system included PgSQL and PgRAW. The JRE eclipse was used to run the java code to load and queries on the implemented data management setup.
C. RAW Data Files The data can be stored in CSV, JSON or XML formats for better accessibility into raw data files. These files can store the relational or unstructured data in raw form. The data have been stored in the CSV file for the experiments discussed in section VI-B. D. Streaming Data & Query Management This part of the system is the core of the framework. This part needs to handle multiple tasks smoothly to allow storage and query execution on raw data and traditional database system. The code of this part is written in Java using Eclipse JRE. This component is used to store the streaming data from sensors into CSV files directly. It can manage the data with different velocities into the given raw file in comma separated format. The raw file generated using this technique becomes the database for the in-situ query engine from which all the queries will be answered. This component will also manage the data loading and query execution operations for in-situ engine and database system. Based on data frequency, the bulk number can be set to one by one or bulk loading methods. The bulk loading number decides how many records of the data will be stored in a buffer storage before loading it into raw file or database. This bulk number can be set according to the streaming data velocity or query frequency. However, the bulk loading number is set to a default value for streaming experiments discussed in section IV-C of this paper. IV. SETUP & EXPERIMENTS PostgresRAW PgRAW is an open source NoDB(Alagiannis et al., 2012) implementation. PgRAW is used as a basic in-situ raw data processing engine due to multiple reasons. 1) It is the first mature raw data processing system; it can execute SQL queries directly on raw data. 2) PostgresRAW is the extension of PostgreSQL PgSQL to query raw files based on NoDB philosophy. Therefore the base PgSQL system can also be used in-parallel with PostgresRAW as a single system. 3) The raw data files can be manipulated directly using other algorithms in parallel. A Java code was written to replicate stream data and bulk data loading on both PgRAW and PgSQL systems. The code is also used to execute queries and record the QET for both. For PgRAW, data streams were collected in a list of string arrays named as bulk-list for future references, where all the attributes of a record were stored in an array of strings. The generated list is then written to the CSV file all at once. While PgSQL used traditional bulk loading to save the data to disk. Here the SQL insert statements were added to a batch statement one by one, which is then executed to bulk
The benchmark sensor dataset of LOD in RDF Triples format with 2.9GB in size having 10M rows was used for the experiment. 16 random queries were used having different number of joins. 9 queries out of 16 were analytical queries which need to access large amount of records to answer the query. B. Experiments The number of basic experiments have been performed on the experimental setup to take certain calculated assumption to perform the experiments without favoring any system in comparison. The first experiment on data insertion time DIT shown that one by one data loading of the streaming data inside the traditional databases can result in the worst performance of the entire system, but storing raw data into a CSV file one by one can perform thousands of times better as compared to databases. Therefore the bulk loading with the best performance in PgSQL has been chosen in the following experiments. The queries have been executed in different sequences to find, how much QET differs if a query needs to access the data from CSV file first in PgRAW system? The results showed that the difference in QET was negligible. This may differ from dataset to dataset based on the number of columns and their data types in a table. Subsequently, the difference between cold and hot runs had been checked for PgRAW and PgSQL, it also showed no noticeable difference similar to the different query sequence. Next the dataset, the actual LOD triples datasets had total size of 26GB having more than 100M records. From that actual dataset only 10M triples have been extracted based on the different queries accessing different parts of the used dataset and different number of records satisfying the query conditions. 1) Experiment1. Aggregate Execution Time: This experiment is performed to identify the difference in the total execution time of DIT and QET between PgRAW and PgSQL. Here DIT for PgRAW will be the time taken to store comma separated data into the CSV file. 2) Experiment2. Static Data Scaling: For this experiment, the data in system have been scaled to 4Million records to observe the difference in DIT and QET. How the DIT and QET is affected in both the systems will show which system is better for a larger amount of datasets. 3) Experiment3. Streaming Data Scaling: This experiment scaled the data by inserting 1M records on top of the existing records to roughly replicate the actual streaming
data workloads. Here the 1M bulk loading number is set to default because of its optimal performance with respect to different bulk loading numbers ranging from 100-5M. C. Results and Analysis The Fig.2 shows the results of the first experiment Exp1. The framework component explained in section III-D creates a CSV file linked to a table in PgRAW to store the streaming data coming from the sensors. It also inserts data in PgSQL using bulk load technique to get the lowest possible DIT. Once data is loaded in CSV and PgSQL the component III-D executes 16 queries on stored data using PgRAW and PgSQL to find the QET. This experiment shows the difference in aggregate execution time between PgRAW and PgSQL. The graph shows the single bar for each system, where the blue part is DIT while the red is Total QET of 16 queries. Here, we can see that DIT of PgRAW is invisible because the DIT of creating CSV file used for PgRAW took less than 1sec.
PgSQL. All the analytical queries had 5 self-joins and almost all of these queried had more than 0.1M records in the output. To answer such queries the PgSQL needs to access the disk very often. This led to the performance decrease in the PgSQL. While the PgRAW performed better because it needed to access the disk only once. This was due to the fact that PgRAW stores cached data into main memory, so all the records were already there for the output.
Fig. 4. QET for Analytical Queries in PgRAW vs PgSQL
Fig. 2. Aggregate Execution Time comparison
The next experiment results shown in Fig.3 describes the pattern of the QET changes. DIT for the PgRAW stay very less as compared to the DIT of PgSQL. In addition to the DIT benefit, the PgRAW performs better in all the analytical queries executed on more than 1M records.
The results of Fig. 5 shows the comparison of streaming data insertion mode comparison with respect to the static data load for PgRAW. The "S" in "S_1M" shows the streaming data results of PgRAW. The DIT of PgRAW improved at S_5M in streaming mode, because the bulk-list needs to handle only 1M record to append them at the end of the raw file. While in the static mode the bulk-list organization of 5M records took more time, due to additional memory allocations and reading data from the larger bulklist. No major differences were seen in the QET of queries executed after different DIT methods because QET is dependent on data of the accessed file and not on how the data was stored in that file.
Fig. 5. Streaming Data Scaling experiment
Fig. 3. Static Data Scaling time comparison
PgSQL performance decreased mainly due to the analytical queries. Fig. 4 shows the comparison of total query execution time taken by 9 analytical queries to complete execution for different workloads on PgRAW and
This is due to the fact that the data is inserted in the same file and once the file is changed the cached data is removed from memory so the PgRAW performance stays the same. While PgSQL results are not displayed here because they also do not show any difference in DIT or QET. To summarize the results the data insertion in CSV files just took 0.3-0.6% of time as compared to the PgSQL DIT. The
QET 1M was almost equal in both. However, to complete execution of all 16 queries for 2M, 3M and 4M PgRAW took only 22%, 8% and 3% of time as compared to the PgSQL respectively. On an average DIT time is improved by 99.5% and 57% in QET for PgRAW over PgSQL. V.
CONCLUSION
The paper shows the raw data processing framework to process streaming IoT data. The framework was able to get the benefit of saved DIT using PgRAW while the PgSQL suffered from high data loading time. The query execution of PgRAW is also better compared to the PgSQL for analytical queries. The data scaling results have shown that for inserting 1M-4M records in PgRAW took only 0.3-06% of time as compared to PgSQL. While to execute all 16 queries on 1M-4M records the QET of PgRAW improved by 57% as compared to PgSQL. The streaming experiment has shown that streaming raw data can be easily handled by storing data in CSV files and processing it using tools like PgRAW. These tools can complete overall workload execution in just 7.7% of total time taken by DBMS systems. This shown that the raw data processing framework discussed in paper can be used for streaming data and it can save the huge amount of time. However, the size of the used dataset or the number of records in result were less than the available memory of the machine, which was easily able to store the entire dataset in main memory. Therefore, the datasets with data size larger than the main memory size will be analyzed for better analysis. This framework can be further improved to support spatio-temporal queries for better usage in IoT applications. REFERENCES [1]
Abouzied, A., Abadi, D. J. and Silberschatz, A. (2013) ‘Invisible Loading: Access-Driven Data Transfer from Raw Files into Database Systems’, Proceedings of the 16th International Conference on Extending Database Technology - EDBT ’13, pp. 1–10. doi: 10.1145/2452376.2452377. [2] Abu-Elkheir, M., Hayajneh, M. and Ali, N. A. (2013) ‘Data management for the Internet of Things: Design primitives and solution’, Sensors (Switzerland), 13(11), pp. 15582–15612. doi: 10.3390/s131115582. [3] Alagiannis, I. et al. (2012) ‘NoDB: efficient query execution on raw data files’, Proceedings of the 2012 international conference on Management of Data - SIGMOD ’12, pp. 241–252. doi: 10.1145/2213836.2213864. [4] Azim, T. et al. (2018) ‘Adaptive Cache Mode Selection for Queries over Raw Data’, in In Ninth International Workshop on Accelerating Analytics and Data Management Systems Using Modern Processor and Storage Architectures (ADMS). [5] Balazinska, M. et al. (2007) ‘Data management in the worldwide sensor web’, IEEE Pervasive Computing, 6(2), pp. 30–40. doi: 10.1109/MPRV.2007.27. [6] Cheng, Y. and Rusu, F. (2015) ‘SCANRAW: A Database MetaOperator for Parallel In-Situ Processing and Loading’, ACM Trans. Database Syst., 40(3), p. 19:1--19:45. doi: 10.1145/2818181. [7] Cheng, Y., Zhao, W. and Rusu, F. (2017) ‘OLA-RAW: Scalable Exploration over Raw Data’, pp. 1–23. Available at: http://arxiv.org/abs/1702.00358. [8] Jain, A., Padiya, T. and Bhise, M. (2017) ‘Log Based Method for Faster IoT Queries’, In IEEE Region 10 Symposium (TENSYMP), pp. 1–4. [9] Olma, M. et al. (2017) ‘Slalom : Coasting Through Raw Data via Adaptive Partitioning and Indexing’, Proceedings of the VLDB Endowment, 10(10), pp. 1106–1117. [10] Padiya, T. and Bhise, M. (2017) ‘DWAHP : Workload Aware Hybrid Partitioning and Distribution of RDF Data’, in In Proceedings of the 21st International Database Engineering & Applications Symposium. ACM, pp. 235–241.
[11] Puiu, D. A. N. et al. (2016) ‘CityPulse : Large Scale Data Analytics Framework for Smart Cities’, IEEE Access, 4. [12] Vasani, S. et al. (2013) ‘Faster Query Execution for Partitioned RDF Data’, ICDCIT, 7753(February). doi: 10.1007/b104418. [13] Zhao, W., Cheng, Y. and Rusu, F. (2015) ‘Vertical partitioning for query processing over raw data’, Proceedings of the 27th International Conference on Scientific and Statistical Database Management, pp. 1–12. doi: 10.1145/2791347.2791369.