International Journal of Parallel, Emergent and Distributed Systems
ISSN: 1744-5760 (Print) 1744-5779 (Online) Journal homepage: http://www.tandfonline.com/loi/gpaa20
NoSQL real-time database performance comparison Diogo Augusto Pereira, Wagner Ourique de Morais & Edison Pignaton de Freitas To cite this article: Diogo Augusto Pereira, Wagner Ourique de Morais & Edison Pignaton de Freitas (2017): NoSQL real-time database performance comparison, International Journal of Parallel, Emergent and Distributed Systems, DOI: 10.1080/17445760.2017.1307367 To link to this article: http://dx.doi.org/10.1080/17445760.2017.1307367
Published online: 30 Mar 2017.
Submit your article to this journal
View related articles
View Crossmark data
Full Terms & Conditions of access and use can be found at http://www.tandfonline.com/action/journalInformation?journalCode=gpaa20 Download by: [Diogo Pereira]
Date: 30 March 2017, At: 05:58
International Journal of Parallel, Emergent and Distributed Systems, 2017 http://dx.doi.org/10.1080/17445760.2017.1307367
NoSQL real-time database performance comparison Diogo Augusto Pereira , Wagner Ourique de Morais and Edison Pignaton de Freitas Institute of Informatics, Universidade Federal do Rio Grande do Sul, Porto Alegre, Brazil
ABSTRACT
The amount of data being produced is increasing constantly, as the number and variety of connected devices are growing and the advances in data storage and mining are supporting this evolution. However, storing and handling high quantities of data is challenging the current Relational Database Management Systems. Big Data and its related products came to help in this matter, and the NoSQL databases arise with the purpose to offer better solutions and features to handle massive amounts of data with higher performance, sometimes near real-time. The present study presents the NoSQL databases scenario and background, and elaborates a detailed study with the characteristics, a features comparison and a performance evaluation of three different NoSQL databases extensively used in the market nowadays: Couchbase, MongoDB and RethinkDB. Tests were performed in two different scenarios: single thread and multiple threads. The results reveal that Couchbase had a better performance at most of the operations, except for retrieving multiple documents and inserting documents with multiple threads, operations in which MongoDB scored better.
Results of the POST operation tests using a multiple threads scenario. The graphic presents the response time of each database.
CONTACT Diogo Augusto Pereira
[email protected]
© 2017 Informa UK Limited, trading as Taylor & Francis Group
ARTICLE HISTORY
Received 16 January 2017 Accepted 13 March 2017 KEYWORDS
Big data; NoSQL; database; Couchbase; MongoDB; RethinkDB; document store; real-time; performance comparison
2
D. AUGUSTO PEREIRA ET AL.
1. Introduction The increasing demand of higher and faster data storage is transforming the databases market. The number of applications demanding a high volume of data is growing and data intensive applications are being used more and more to support decisions [1]. The term Big Data was created to refer a large and complex volumes of data and where is possible to extract more meaning from them [2]. These demands for storage had motived the appearance of NoSQL databases, which bring high availability, scalability and performance to the handling of massive amounts of data [3]. There are many examples of applications in this scenario, for example, some of them are used by billions of people, like social networks and GPS navigation tools with real-time traffic data. It is also possible to mention that other Big Data applications will increase their usage in the next few years, in areas like healthcare, education, autonomous machines, smart cities and many others, because of the advances in data storage and mining and the increasing number and variety of devices producing data [1,4]. The capacity to store a high volume of data is critical for applications that are part of the scenario mentioned above, but the performance is also important, because systems that demand real-time data will need to store and retrieve data in a highly responsive manner. Based on this highly responsive scenario with increasing demand for storage of a high amount of data, databases systems with Big Data and near real-time characteristics are emerging in order to comply with the above-mentioned requirements. However, choosing the most appropriate database solution to a given deployment is not a trivial task because of their different features, characteristics and provided results. Each product could provide better results in a given kind of scenario, but could not be recommended for a different scenario with other requirements. For example, one database could be recommended in a scenario with high volume of insert or update operations, while could not be the best option for retrieving data. In order to help and support the choice and adoption of NoSQL databases, this paper presents the characteristics and features of three selected NoSQL databases: Couchbase [5], MongoDB [6] and RethinkDB [7]. Moreover, an objective and detailed performance comparison is provided, in which the databases’ response time and throughput are evaluated, for both single and multiple threads scenario. The remind of this paper is organized as follows: Section 2 describes and discusses related works. Section 3 provides a short background review on NoSQL databases. Section 4 presents a qualitative comparison among the three selected databases. The methodology of the performance experiment is provided on Section 5, while Section 6 presents the results for the quantitative analysis. Section 7 critically discusses the obtained results, while Section 8 concludes the paper presenting directions for future work.
2. Related work The work presented in [8] provides a classification and the characteristics of different NoSQL databases. In addition, a feature comparison and some adoption trends and numbers are also presented. This study provides a method to categorize the databases according its characteristics and presents an extensive list of features that are important to exist in a NoSQL database, helping to select a suitable product according to the desired functionalities. The present paper is complementary to the study reported in [8] by presenting a quantitative evaluation of NoSQL databases. This study here presented is not as comprehensive in terms of number of database evaluated, but provides a meaningful perspective once it evaluates three databases that are representative of the most used ones in the market nowadays. A performance comparison between PostgreSQL [9] (relational) and MongoDB (non-relational) database is presented in [10]. The main CRUD (Create, Read, Update and Delete) database operations are evaluated. In this research, MongoDB, with unstructured data model (in which there is not a strict schema), had a better performance in general. This study performed five iteration of tests for each data
INTERNATIONAL JOURNAL OF PARALLEL, EMERGENT AND DISTRIBUTED SYSTEMS
3
operation, with a different number of records, and then presented the execution time of each one. In comparison to [10], the current paper goes ahead and provides a comparison among NoSQL databases. A similar study is reported in [11], in which a performance measurement was performed for relational database MySQL [12] and object database DB4o [13]. In this case, the object databased revealed to be more efficient on insert operations, but less powerful when executing delete and select commands. The paper used sets of data with four different sizes and measured the time taken to process each of them. A model to evaluate the Velocity, Volume and Varity (scalability) of different in-memory database systems is presented in [14]. Using this model, CRUD operations of four different database systems were compared against each other. VoltDB [15] performed better than MongoDB and MySQL in most of the tests, and SQLite [16] was significantly better than MongoDB and MySQL on select, update and delete operations.
3. NoSQL databases The impedance mismatch between object-oriented and relational models motived the appearance of NoSQL, or ‘New SQL’, databases. The data model of NoSQL databases differs from the traditional RDBMS systems, like Oracle and SQL Server, because usually it is non-relational and schema free [3]. Besides the data model, the whole design, architecture and even the transaction model of the NoSQL databases are usually distinct from the relational databases [3]. The data model of the NoSQL databases may be different depending on the application’s requirements. Moreover, the DB’s data model can be used to classify the NoSQL databases. Some of the most used data models are: • Key-Value store: the data is stored in a form of dictionary, in which a key is used to identify the record. No structure and no limitation on the ‘value’ part of the key value, meaningless string variable length. • Document store: in this kind of database, all of the data related to an object is stored in a single instance, or document. In addition, every document could have a different structure from every other. Usually the format of the document is JSON or XML [17]. • Graph: this database uses graph structures for semantic queries with nodes, edges (relationships) and properties to store data. Each graph could be related to one or more nodes. • Column: stores the data in columns, and each column is a tuple consisting in three elements: unique name, value and timestamp. A ColumnFamily could be used to group the data, but each group could have different columns. • Multi-model: supports more than one data model in the same database.
4. Feature comparison Three different products are compared in this work: Couchbase Server 4.6.0, MongoDB Enterprise 3.4 and RethinkDB 2.3.5. These databases were selected due to the fact that they are representative of the most used NoSQL databases nowadays according to different studies and DB usage rankings [18–21]. These products have the same classification: NoSQL database, JSON document storage, high performance and scalability. These products are listed at Forrester’s Document Stores report as one of the viable document stores to choose [18]. MongoDB and Couchbase are also presented at Gartner’s Magic Quadrant for Operational Database Management Systems [19]. Besides their similarity and representativeness among the most used NoSQL databases, they have native support for Ubuntu Linux distribution which provides the same baseline for a homogeneous deployment. Table 1 provides a feature comparison among the selected NoSQL databases. The most complete version of each product was used to analyse their features. In general, all databases have the main expected features. However, some differences can be noticed:
4
D. AUGUSTO PEREIRA ET AL.
Table 1. Comparison of NoSQL databases features. Feature Storage type Open source Operating systems Drivers
Couchbase Server Document store (JSON), Key/ value store Yes Linux, OS X, Windows C, .NET, Java, JavaScript, PHP, Python
Query language Replication Scalability / Load balancing Sharding Aggregation MapReduce Administrative UI
Yes – N1QL Yes Yes Yes Yes Yes Yes
Security
Authentication, TLS and data encryption
MongoDB Enterprise Document store (JSON)
RethinkDB Document store (JSON)
Yes Linux, OS X, Windows, Solaris C, C++, .NET, Java, JavaScript, Motor, Perl, PHP, Python, Ruby, Scala Yes – JSON Yes Yes Yes Yes Yes Yes (need be installed separately) Authentication, TLS and data encryption
Yes Linux, OS X, Windows Java, JavaScript, Python, Ruby Yes - ReQL Yes Yes Yes Yes Yes Yes Authentication and TLS
• Only Couchbase server supports key/value data structure. • Couchbase and RethinkDB support less programming languages than MongoDB. • Couchbase offers a declarative query language called N1QL [22]. N1QL allows users to query documents in a language similar to SQL, with the power to sort, filter, and group data in a single query. • MongoDB does not have by default an Administration UI. This feature is part of a separate product called Compass, which is available only for paid versions of MongoDB [23]. • RethinkDB does not support data encryption for the data at rest, only for authentication and transporting data (TLS).
5. Performance experiments design This section presents the details about the performance evaluation experiments, considering the real time (or near real time) demands of the newer database applications, as well as the characteristics of big data that need to be handled by these systems [1]. In terms of infrastructure, all databases were installed and tested in the same server. With this approach, it is possible to avoid network latency, and it results in a fairer comparison, since the databases are installed in the same environment with the same characteristics. Apache JMeter was used to run the tests and to collect the results. It was also installed in the same machine. JMeter is an open source software, designed to execute load tests and measure performance [24].
5.1. Environment configuration The machine used in the tests has the following configuration: • CPU: Intel® Xeon® CPU 2.20 GHz 64bits - 8 cores • RAM: 30 GB • HD: 10 GB SSD • Operation system: Ubuntu 16.04.1 LTS This server was hosted at Google Cloud Platform infrastructure, in form of IaaS (Infrastructure as a service). However, all tests were executed and the results collected locally, without non-deterministic network problems affecting the performed tests.
INTERNATIONAL JOURNAL OF PARALLEL, EMERGENT AND DISTRIBUTED SYSTEMS
5
5.2. Tests methodology The tests include the performance assessment of insert, update, delete and retrieve operations using the same data model for all databases. The data schema used in the tests, consists of a JSON with the following structure: { ‘id’: ‘999999’, ‘number’: ‘9999999’, ‘date’: ‘10/10/2016 10:10:10’, ‘customer’: ‘ABCDE ABCDE ABCDE ABCDE’, ‘amount’: ‘999999.99’ }
In order to have the same test scenario for all databases, a JavaScript REST API was created for each product. Node.js [25] was used as runtime environment and Express [26] as web framework. The APIs use the product SDK to connect and run commands against each database system. The APIs’ methods expect the data model above and the following methods were created in all APIs: • POST: inserts a single document into database • PATCH: updates a singles document • GET by Id: retrieves a single document according to Id • GET: retrieves all documents (limited to 100 documents) • DELETE: removes a single document The APIs do not have any especial implementation, like data validation or transformation; they only receive the data from the HTTP requests, and send or retrieve the data from database. The requests are sent to the APIs by the Apache JMeter, which generates a sequential ID and random data for each request. Four indicators are extracted from the tests: • Average: it is the average response time for all requests; • Median: statistical median measure of requests’ response time; • 90th percentile: shows the response time of 90% of the performed requests; and, • Throughput: the number of requests that can be handled by the server. This indicator can show how many requests the server supports in a given period. The performance of each database was evaluated in two test iterations. The first one is a single thread scenario aiming to check how fast the server responds to a certain operation. In this case, one user sends one request per second, repeating this process 100 times, summing 100 requests per operation and product. Considering that the goal of this iteration is only to evaluate the database’s responsiveness, it uses only one thread. The second iteration is a multiple threads scenario, which evaluates the database performance in a load test scenario. In this case, 1000 users send requests in 1-s period, repeating this process five times, summing 5000 requests per operation and product. The goal of this second test scenario is to verify how the databases work with several requests being sent to the server in a short period of time. For this reason, multiple threads are used in order to have a scenario closer to the real-world usage situation, where several users can be requesting data at the same time.
6. Performance results In this section, the performance results of the two test iterations are presented. The results are separated by data operation and by the kind of executed test.
6
D. AUGUSTO PEREIRA ET AL.
6.1. POST operation The tests for POST operations consists in insert documents into database. Each document receives a unique and sequential ID, which will be used in subsequent tests for retrieving, updating and deleting the same documents. Tests results are presented in the Figures 1–3. Figure 1 presents the response time results for single thread scenario, while Figure 2 presents the obtained results for the multiple thread scenario. Figure 3 reports the throughput results for the multiple thread scenario.
Figure 1. Response time for POST operation in a single thread scenario.
Figure 2. Response time for POST operation in a multiple threads scenario.
Figure 3. Throughput (number of requests per second) for POST operation in a multiple threads scenario.
INTERNATIONAL JOURNAL OF PARALLEL, EMERGENT AND DISTRIBUTED SYSTEMS
7
6.2. PATCH operation The tests for PATCH operations consists in updating the documents previously inserted into the database. As each document received a unique and sequential ID, the same IDs will be used to update all documents (one by one). The results for these tests are presented in the Figures 4–6, following the same sequence as presented in the previous subsection.
Figure 4. Response time for PATCH operation in a single thread scenario.
Figure 5. Response time for PATCH operation in a multiple threads scenario.
Figure 6. Throughput (number of requests per second) for PATCH operation in a multiple threads scenario.
8
D. AUGUSTO PEREIRA ET AL.
6.3. GET by ID operation In the GET by ID operation tests, the documents existing in the database will be retrieved by its ID. Following the same sequence of results presentation, the results for this operation are presented in the Figures 7–9.
Figure 7. Response time for GET by ID operation in a single thread scenario.
Figure 8. Response time for GET by ID operation in a multiple threads scenario.
Figure 9. Throughput (number of requests per second) for GET by ID operation in a multiple threads scenario.
INTERNATIONAL JOURNAL OF PARALLEL, EMERGENT AND DISTRIBUTED SYSTEMS
9
6.4. GET operation In the GET operation test, a set of documents is retrieved from database. In the opposite of GET by ID test, this one does not query for a specific document, but the API returns all databases items, limited by 100 documents. The obtained results are presented in the Figures 10–12.
Figure 10. Response time for GET operation in a single thread scenario.
Figure 11. Response time for GET operation in a multiple threads scenario.
Figure 12. Throughput (number of requests per second) for GET operation in a multiple threads scenario.
10
D. AUGUSTO PEREIRA ET AL.
6.5. DELETE operation The tests for DELETE operations consist in removing the previously inserted documents from the database. As each document received a unique and sequential ID, the same IDs are used to delete all documents (one by one). The results for the DELETE operation are presented in the Figures 13–15.
Figure 13. Response time for DELETE operation in a single thread scenario.
Figure 14. Response time for DELETE operation in a multiple threads scenario.
Figure 15. Throughput (number of requests per second) for DELETE operation in a multiple threads scenario.
INTERNATIONAL JOURNAL OF PARALLEL, EMERGENT AND DISTRIBUTED SYSTEMS
11
7. Discussion Critically observing the obtained results reported in Section 6, this section summarizes the main discussions about these results. Analyzing the response time, in order to remove the outliers and consider the most part of sampling data, the discussion is made over the 90th percentile (less is better). Regarding to server’s throughput, the number of requests per second is considered (high is better). First, the single thread scenario is analyzed. When inserting documents (POST operation), Couchbase and MongoDB were faster, taking 4 ms to process each request. The same result could be verified on updating documents (PATCH operation); Couchbase and MongoDB took 2 ms on each request. The tests for retrieving documents brought different results when reading one document per request (GET by ID operation), in which the faster database was Couchbase with 1 ms response time, and when reading multiple documents per request (GET operation), in which MongoDB scored better with 3 ms response time. In opposite, Couchbase was far worse returning multiple documents, presenting a 22 ms response time. The removal test (DELETE operation), showed a better response time for Couchbase and MongoDB: 2 ms. Then, the analysis moves forward to the multiple threads scenario, aiming to verify how the databases work in a high demand scenario. In this test, the response time and the throughput are analyzed. The last one is calculated based on the number of requests divided by test’s total time. For example, the POST operation test took 2913 ms for Couchbase, so 5000 requests / 2913 ms = 1716 requests per second. Analyzing the results of multiples threads scenario, it is possible to verify that MongoDB presented better results when inserting (POST operation) documents: 1882 requests/s and 388 ms response time. However, in the update tests (PATCH operation), Couchbase was better: 1868 requests/s and 428 ms response time, and MongoDB was far worse: 839 requests/s and 1.052 ms response time. The retrieval operations also brought different results when reading one and multiple documents. When retrieving one document (GET by ID operation), Couchbase and RethinkDB scored better results, Couchbase was slightly better: 2798 requests/s and 195 ms response time. However, the retrieval of several documents (GET operation) was faster with MongoDB: 2212 requests/s and 307 ms response time, and Couchbase was much worse: 361 requests/s and 2992 ms response time. A reason that can explain the poor Couchbase results at reading multiple documents is the fact it was used the Couchbase’s custom query language N1QL. It makes the query usage a lot easier, as it is close to the regular SQL language, but its performance depends on the index configuration and perhaps the adoption of techniques for tuning the query could improve the results [22]. Moreover, the removal tests (DELETE operation) presented a higher throughput for Couchbase: 2043 requests/s and 392 ms response time. RethinkDB scored a better response time: 357 ms, this happened because it had a few requests with very low response time, but considering 99th percentile, the response time of RethinkDB was 607 ms. With these results, it is possible to conclude that Couchbase had a better performance in the most of the tests, both for single and multiple threads scenarios. The exception was in the retrieval of multiple documents and the POST operation with multiple threads, tests in which MongoDB had much better results than Couchbase. Depending on the specific requirements of the user application, i.e. depending the type of operation will occur more often, these results can guide the choice of the most suitable NoSQL database.
8. Conclusion This paper presented a feature and performance comparison among three NoSQL databases that are extensively used in the market nowadays: Couchbase, MongoDB and RethinkDB. Regarding to the performance comparison, two iterations of tests was performed: single thread and multiple threads scenarios. For both iterations, there were performed tests to insert (POST operation), update (PATCH operation), retrieve (GET by ID and GET operations) and delete (DELETE operation)
12
D. AUGUSTO PEREIRA ET AL.
documents. The tests results were evaluated with the requests’ response time and server’s throughput. JMeter was the tool used to run and to collect the results. With the acquired results, the conclusion is Couchbase had a better performance for most of the operations in the test scenarios, except for retrieving multiple documents and the POST operation with multiple threads, in which MongoDB was faster. For future work, the tests could be applied to other NoSQL database solutions, other operations or features could also be tested or a different setup can be used, like a distributed database or environment. Another direction for interesting future investigations is the impact of security features in the performance of the NoSQL databases.
Disclosure statement No potential conflict of interest was reported by the authors.
ORCID Diogo Augusto Pereira
http://orcid.org/0000-0002-1235-408X
References [1] Abawajy J. Comprehensive analysis of big data variety landscape. Int J Parallel, Emergent Distrib Syst. 2015;30:5–14. [2] Harrison G. “Google, big data, and hadoop” in next generation databases. New York (NY): Apress; 2015. p. 21–37. [3] Harrison G. The “NewSQL” in next generation databases. New York (NY): Apress; 2015. p. 16. [4] MichaelK, MillerKW, Big data: new opportunities and new challenges. IEEE Comput Soc; 2013;46. Available from: http://ieeexplore.ieee.org/xpl/tocresult.jsp?isnumber=6527234&punumber=2. [5] Couchbase. NoSQL database. [cited 2016 Nov]. Available from: http://www.couchbase.com. [6] MongoDB. MongoDB for GIANT ideas. [cited 2016 Nov]. Available from: https://www.mongodb.com. [7] RethinkDB. The open-source database for the realtime web. [cited 2016 Nov]. Available from: https://www.rethinkdb. com. [8] Moniruzzaman ABM, Hossain SA, NoSQL database: new era of databases for big data analytics – classification, characteristics and comparison”, Int J Database Theory Appl. 2013; 6. Available from: http://www.sersc.org/journals/ IJDTA/vol6_no4.php. [9] PostgreSQL. The world’s most advanced open source database. [cited 2016 Nov]. Available from: https://www. postgresql.org. [10] Jung MG, Youn SA, Bae J, et al. A study on data input and output performance comparison of MongoDB and PostgreSQL in the Big Data environment. 8th International Conference on Database Theory and Application (DTA), Jeju, 2015, p. 14–17. [11] Roopak KE, Rao KSS, Ritesh S, et al. Performance comparison of relational database with object database (DB4o). 2013 5th International Conference on, Computational Intelligence and Communication Networks (CICN), Mathura, 2013, p. 512–515. [12] MySQL. [cited 2016 Nov]. Available from: https://www.mysql.com. [13] Paterson J, Edlich S, Hörning H, et al. The definitive guide to db4o. New York (NY): Apress; 2006. [14] Y. Wang, et al. The Performance Survey of in Memory Database, 2015 IEEE 21st International Conference on, Parallel and Distributed Systems (ICPADS), Melbourne, VIC, 2015, p. 815–820. [15] VoltDB. The world’s fastest, in-memory operational database. [cited 2016 Nov]. Available from: https://www.voltdb.com. [16] SQLLite. [cited 2016 Nov]. Available from: https://sqlite.org. [17] Harrison G. “Document databases” in next generation databases. New York (NY): Apress; 2015. p. 53–63. [18] Forrester. The Forrester Wave™: document stores, Q3 2016. [cited 2016 Nov]. Available from: https://reprints.forrester. com/#/assets/2/363/’RES125581’/reports. [19] Gartner. Magic quadrant for operational database management systems. [cited 2016 Nov]. Available from: https:// www.gartner.com/doc/reprints?id=1-3JFMOQ2&ct=161006&st=sb. [20] DB-Engines. Knowledge base of relational and NoSQL database management systems. [cited 2016 Nov]. Available from: http://db-engines.com/en/ranking. [21] TechWorm. Top 5 NoSQL databases of the last year. [cited 2016 Nov]. Available from: http://www.techworm. net/2016/04/top-nosql-databases-last-year.html. [22] Couchbase N1QL. N1QL (SQL for JSON) – database query language. [cited 2016 Nov]. Available from: http://www. couchbase.com/n1ql.
INTERNATIONAL JOURNAL OF PARALLEL, EMERGENT AND DISTRIBUTED SYSTEMS
13
[23] MongoDB: MongoDB compass. [cited 2016 Nov]. Available from: https://www.mongodb.com/products/compass. [24] Apache JMeter. User’s manual glossary. [cited 2016 Nov]. Available from: https://jmeter.apache.org/usermanual/ glossary.html. [25] Node.js. [cited 2016 Nov]. Available from: https://nodejs.org. [26] Express. Node.js web application framework. [cited 2016 Nov]. Available from: http://expressjs.com.