Performance Evaluation of Unstructured NoSQL data ...

Performance Evaluation of Unstructured NoSQL data over distributed framework Mr. Suyog S. Nyati

Mr. Shivanand Pawar

Dr. Rajesh Ingle

Dept. of Computer Engineering, PICT, Pune. Maharashtra, India. [email protected]

Augment IQ data science, Pune Maharashtra, India [email protected]

Dept. of Computer Engineering, PICT, Pune. Maharashtra, India. [email protected]

Abstract— various organizations are dependent on structured database like Oracle, MySQL etc. but these database systems do not harness the requirements of all organizations like scalability, availability etc. When data and number of requests increases, structure database cannot handle huge data and requests efficiently. One of the solution to overcome these issues is to shift datacenters on NoSQL unstructured databases. In this paper we explain few NoSQL unstructured databases and present performance analysis of MongoDB. We compare the time required to insertion in different databases as well as searching with different number of threads in database with different number of entries. This work also studies the importance of the Sharding and Configuration of the cluster for MongoDB. Keywords— Distributed Database, Clustering, NoSQL.

I.

INTRODUCTION

Today structured database is the most widely used database. Every upcoming organization choose structured database for their data center in production while establishing or starting their organization. We know structured database is best solution for critical transaction management but there are many companies which do not deal with critical transactions in their organization but still they use structured database as their database. In this type of database they face some problems like scalability, availability, robustness etc. which are not fulfilled efficiently by structured database. Additionally structured database cannot automatically partition data which also shows the scalability issue. To overcome these problems NoSQL unstructured database give a solution for their datacenters. There are lots of NoSQL unstructured databases like Hive, HBase, Hypertable, Lucene, HSearch, Pig, Cassandra, MongoDB etc.[1].From these databases few databases like Hive, Cassandra, Pig etc. can support Indexing, Joins, Aggregation, Collation and additionally they are SQL like databases. In this paper we are giving brief idea about unstructured and semi structured NoSQL databases with their strengths and weaknesses as well as comparison with respect to RDBMS then we are showing our results of searching on MongoDB. The rest of the paper is organized as follow. In section II we describe NoSQL Databases. In section III we explain the benefits of MongoDB. In section IV we give our experimental setup. In section V evaluation results and

finally in section VI and VII we give future work and conclusion respectively. II.

NOSQL DATABASES

NoSQL system is getting more interest in recent days, the reason behind that is NoSQL database can stand in distributed environment and it is open source freeware database that make help to customize framework for data management, apart from that it gives all feature of distributed system like scalability, replication, robustness etc. which are not efficiently satisfied by structured databases. In addition to all these benefits it is faster than traditional databases. Most of NoSQL databases use map-reduce function in which map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. NoSQL system implements key-to-value map in their core. This system is hash like system as this system includes function which is similar to hash. Every type of NoSQL database consist key and its value. There is no standard form of storage in NoSQL databases which effects on missing exhaustive classification. On behalf of that there are some different types of common storage in NoSQL database such as key-value store, column store, document store and graph databases. In Key-Value storage the data is stored as key for each value. It is merely equivalent to hash functions. In these type of databases data is accessed by some methods such as PUT/GET/DELETE. This is a simple form of storage and it also have simple replication scheme. This makes NoSQL database faster than RDBMS. This key value pair is strictly limited to primary key access. Radis, Facebook’s Cassandra, Amazon’s Dynamo and Voldemart[1] are the popular implementation of key value store database. Column store is more structured format storage. In this type of storage single data entry which is called row is addressed by its key. It is not like RDBMS in this storage rows can have different sets of column, and they can change dynamically. In column store the keys are in sorted manner. Google’s BigTable [2], Hypertable, Hadoop’s, HBase[1] and Yahoo’s PNUTS [3] are some popular column store implementations.

Document store support the storage of (semi-) structured data. It uses JavaScript Object Notation (JSON) format to store data. It allows representing complex data structure like arrays, hashes as a single database entry and can retrieve data within one read operation. CouchDB [4], MongoDB [5] are some popular document store databases. Graph databases uses nodes and edges as a main storage elements. They have ability to store sets of element references in a single entry field value. They use the typical graph algorithm to optimize the performance. There are some popular graph store databases which are Neo4J and Hyper GraphDB [6]. As discussed earlier these NoSQL unstructured databases support some features like RDBMS as well as they have some weaknesses as compared to RDBMS. James P. McGlothlinet.al.[1] explains the support of NoSQL unstructured databases for Indexing, Caching, Join, and Aggregation and additionally shows the similarity with SQL. We modify their work and put it as shown in table 1. In table 1 X* indicates the feature is not natively supported but it can be simulated in implementation. TABLE I.

Hive Pig Redis Hyperta ble Project Voldma rt Risk Core+ Search HSearch HBase Lucene Cassand ra 0.75 Cassand ra 1.1.6 MongoD B

COMPARISON OF DISTRIBUTED DATABASES

Indexing

Caching

SQL-Like

Joins

9 X 9

X X 9

9 X 9

9 9 X

Aggreg ations 9 9 9

X*

9

X

X

X

9

9

X

X

X

9

9

X

X*

9

9 X* 9

9 9 X

X X X

X X X

X X 9

9

9

X

X

X

9

9

9

9

9

9

9

9

X*

9

In these databases few are like Structured Databases. These databases support ACID properties but up to some extent which can be helpful to manage some less critical task as well as we can achieve those ACID properties without compromising scalability also Zhou Wei et.al. [7] give the brief description of how ACID properties achieved with scalability in distributed environment. In cloud database or distributed database data consistency should be relaxed i.e. they disallow any consistency rule across the data partition which provide little support for transaction. For example eventual consistency is provided by the Amazon’s SimpleDB[8] and Cassandra in which data updates are visible after nondeterministic amount of time. Bigtable, SimpleDB and PNUTS support transaction over only single data items which

may not be sufficient to provide strong consistency. Scalaris system was one of the transaction management systems in distributed database but it cannot support durability as it is purely in memory system but cloud TPS [7] overcome this by checkpointing data updates into cloud data service and also help to give durable data in cloud. As cloudTPS check pointing data and store it on cloud the I/O time will reduce the performance. In this case Giuseppe Ottaviano[9] proposed semi-indexing in semi-structured document which may help to speed up the access of stored data. In these databases few are like Structured Databases. These databases support ACID properties but up to some extent which can be helpful to manage some less critical task as well as we can achieve those ACID properties without compromising scalability also Zhou Wei et.al. [7] give the brief description of how ACID properties achieved with scalability in distributed environment. In cloud database or distributed database data consistency should be relaxed i.e. they disallow any consistency rule across the data partition which provide little support for transaction. For example eventual consistency is provided by the Amazon’s SimpleDB[8] and Cassandra in which data updates are visible after nondeterministic amount of time. Bigtable, SimpleDB and PNUTS support transaction over only single data items which may not be sufficient to provide strong consistency. Scalaris system was one of the transaction management systems in distributed database but it cannot support durability as it is purely in memory system but cloudTPS [7] overcome this by checkpointing data updates into cloud data service and also help to give durable data in cloud. As cloudTPS checkpointing data and store it on cloud the I/O time will reduce the performance. In this case Giuseppe Ottaviano[9] proposed semi-indexing in semi-structured document which may help to speed up the access of stored data. III.

BENEFITS OF MONGODB

In all of these databases we choose MongoDB for evaluation purpose. MongoDB is document store database and it uses JSON object to store data. It is written in C++. It provides supports for queries. We choose MongoDB for evaluation because it can manage small data as well as large data efficiently where as in Hadoop if small data is there then single node perform faster than multinode cluster and for large data vice versa [10], which means we have to set single or multinode cluster manually. In addition MongoDB is a Semi-Structured database. MongoDB stores structured database in JSON format where tables are called collections and rows are stored in JSON objects and called documents, so we can use MongoDB to store structured database. MongoDB also support queries on this stored database, so it can be better replacement to the structured database. IV.

EXPERIMENTAL SETUP

A. Setup for performance evaluation of MongoDB In evaluation we try to perform search by running single as well as multiple threads on cluster which is based on

MongoDB database. We form cluster of four machines where machine 1 have 16 GB RAM and core i3 processor and machine 2, 3 and 4 have 12 GB RAM 2.2 GHz Core2Duo processors. OS used is Red-Hat for evaluation. MongoDB supports sharding at the replica set level and each secondary node has a full data of the primary node [11]. We create five shards over 4 machines, each shard consist its replica shard which consist of whole data of the shard. Sharding help us to achieve scalability. B. Setup for comparison of MongoDB with MySQL Here we use Intel core i5 2.90 GHz with 8 GB RAM having kernel Linux 3.7.9-104.fc17.x86_64 which is fedora OS. We use only single machine i.e. single node over here. V.

EVALUTION RESULTS

First we evaluate performance of MongoDB on cluster with 5 crore entries in database then we compare performance of MongoDB with MySQL on single node. A. Performance of MongoDB over large dataset We are giving the test results in the following tables which we tests on the cluster as well as on individual machines. We run number of threads in parallel. Table 2 to Table 7 shows the results of our analysis of searching time for queries on 5 crore data. All tables show the search time for query in which table 2 shows result of single thread for searching on different nodes irrespectively, table 3 shows result of single thread on complete cluster with 2000 and 25000 number of call (i.e. repeated call of same query with different value) on all nodes, table 4 shows result of searching for 3 threads on cluster with 2000 call, table 5 shows result of searching for 5 threads on cluster with 2000 call, table 6 shows result of searching for 6 threads on cluster with 5000 call, table 7 shows result of searching for 20 threads on cluster with 5000 and 7000 call respectively. TABLE II.

SEARCHING TIME OF SINGLE THREAD ON SINGLE MACHINE No of time performed

Used Nodes Machine 1 Machine 2

140 140

No of call 5000 5000

Average Time (ms) 923 1735

TABLE IV.

SEARCHING TIME OF 3 THREADS ON ALL NODES No of time performed

Used Nodes 163

All

2000

TABLE III.

SEARCHING TIME OF SINGLE THREAD ON ALL NODES No of time performed

Used Nodes All

163 22

No of call 2000 25000

Average Time (ms) 188 5997

In above table 3 we use all machines and perform 2000 and 25000 call with single thread for 163 and 22 times respectively. Time required to search is average time of all performed call.

Average Time (ms) 221

In above table 4 we use all machines and perform 2000 call with three threads for 163 times. Time required to search is average time of all performed call. TABLE V.


Used Nodes 163

All

No of call 2000


In above table 5 we use all machines and perform 2000 call with five threads for 163 times. Time required to search is average time of all performed call. TABLE VI.


Used Nodes 610

All

No of call 5000


In above table 6 we use all machines and perform 5000 call with six threads for 163 times. Time required to search is average time of all performed call. TABLE VII.


Used Nodes 610 455

All

No of call 5000 7000

Time (ms) 687 3278

In above table 7 we use all machines and perform 5000 and 7000 call with 20 threads for 610 and 455 times respectively. Time required to search is average time of all performed call. Now if we summaries above tables in single table then we will get values as given in following table. TABLE VIII. Call

In above table 2 we use individual machine and perform 5000 call with single thread. We do this 140 times and take average result of time required to search.

No of call

2000 2000 2000 5000 5000 7000 25000

SUMMARY OF ALL SEARCHES Thread

1 3 5 6 20 20 1

Time required (ms) 188 221 260 331 687 3287 5997

Now in table 8 for 5000 calls (i.e. calls are constant) we require 331 and 687 ms for 6 and 20 threads respectively. Difference between 687and 331 is 356 and we increase threads 6 to 20 i.e. 3.66 times. So if we calculate increment in time required with respective to increment in thread then we get

356/3.66 = 97.27 i.e. approximately 97. This 97 can be called as impact of thread. Additionally if we calculate for call increment then for 20thread (i.e. thread is constant) when we call 5000 times and 7000 times we require 687 and 3287 ms respectively. If we calculate difference of 3287 and 687 we will get 2600 and here call increases by 1.4 (7000/5000). If we calculate increment in time required with respective increment in call then we get 2600/1.4 = 1857.14 i.e. approximately 1857. This 1857 can be called as impact of call. Now if we consider thread impact and call impact together gives total impact then threads impact will have approximately 5% impact of total impact and call impact will be 95% of total impact. So by giving more importance to call than threads we show the result set in following figure 1.

MongoDB takes only 17860 ms i.e. approximately 18 seconds as shown in table 9. This shows us that MongoDB is very faster in insertion than MySQL. TABLE IX. No of entries 5,00,000

MYSQL VS. MONGODB INSERTION TIME. Time required to MySQL (ms) 16064999

Time required to MongoDB (ms) 17860

We also perform search on inserted data. When we search in 500,000 entries MySQL takes average 4 ms, 1374.5 ms and 621.75 ms to search on Primary key, Other columns without index and Other columns with index respectively as shown in table10, whereas MongoDB takes only 210.5 ms and 26.25 ms to search on Other columns without index and Other columns with index respectively as shown in table 11. TABLE X.

SEARCHING TIME OF QUERY ON MYSQL.

Searched on Primary key Other columns without index Other columns with index TABLE XI.

5,00,000

No of individual queries 4

4

5,00,000

4

1374.5

5,00,000

4

621.75

No of entries

Average Time (ms)

SEARCHING TIME OF QUERY ON MONGODB OVER 5,00,000 ENTRIES.

Searched on Other columns without index Other columns with index

No of entries

No of individual queries

Average Time (ms)

5,00,000

4

210.5

5,00,000

4

26.25

Following figures give brief idea of above tables.

Fig1. No. of call Vs. Average search time required The above analysis shows the performance evaluation in which the performance degrades rapidly after 5000 calls. So we conclude if call increases i.e. load increases then after some threshold performance degrade rapidly for big data. Additionally with the help of table 2 to 7 we can conclude searching require more time on single machine than cluster, so it show that Sharding and Configuration of cluster also have an important role in performance of MongoDB. We can reduce our searching time and improve performance by perfectly Sharding and configuring clusters. B. MySQL vs MongoDB Here we compare two different databases which are MongoDB and MySQL. We performed insertion and searching on MySQL and MongoDB. We insert 500,000 records in MongoDB and MySQL where each record has 28 columns. MySQL takes almost 1,606,499 ms i.e. 1,606 seconds where

Fig2. Comparison of insertion time required for 500,000 records

MongoDB makes difference in searching performance. The analysis also helps us to conclude that increase in call of query cause to degrade in performance rapidly after particular threshold. This threshold depends on configuration. So Sharding and Configuration in MongoDB have higher importance to get higher throughput. In addition it includes comparison of MongoDB and MySQL and shows how MongoDB is better than MySQL in insertion as well as searching. REFERENCES [1]

Fig3. No. of entries Vs. Average search time required for 500,000 records This shows us that MongoDB is very faster in insertion as well as faster in searching, so we can conclude that speedup of MongoDB with respect to MySQL is far greater in insertion than searching. VI.

FUTURE WORK

From above scenario MongoDB may be better replacement for structured database like MySQL if we properly configure cluster and provide proper joins on MongoDB. So we are working on joins for MongoDB that help to use MongoDB instead of other structured databases like MySQL for better performance. VII. CONCLUSION This paper gives overview of the distributed database and frameworks with their supported features. It also includes evaluation result of searching time on a cluster of MongoDB and shows how configuration of

James P. McGlothlin, Latifur khan, “Scalable Queries For Large Datasets Using Cloud Computing: A Case Study”, Proceedings of the 15th ACM Symposium on International Database Engineering & Applications, September 2011. [2] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber, “Bigtable : a distributed storage system for structured data”, in Proc. OSDI, 2006. [3] B. F. Cooper, R. Ramakrishnan, U. Srivastava, A. Silberstein, P. Bohannon, H.-A. Jacobsen, N. Puz, D. Weaver, and R. Yerneni, “PNUTS: Yahoo!’s hosted data serving platform”, in Proc. VLDB, 2008. [4] The Apache Software Foundation,“CouchDB”,in http://couchdb.apache.org, 2012. [5] 10-gen, “MongoDB”,in http://www.mongodb.org/about/ introduction, 2013 [6] Christian von der Weth, AnwitamanDatta, “Multiterm Keyword Search in NoSQL Systems”, Internet computing IEEE Computer Society, Jan 2012. [7] Zhou Wei, Guillaume Pierre, Chi-Hung Chi, “CloudTPS: Scalable Transactions for Web Applications in the Cloud”, Services Computing, IEEE Transactions, 2011. [8] Amazon.com, “Amazon SimpleDB”, http://aws.amazon.com/ simpledb, 2010 [9] Giuseppe Ottaviano, Roberto Grossi, “Semi-Indexing Semi-Structured Data in Tiny Space”, Proceedings of the 20th ACM international conference on Information and knowledge management, Oct 2011. [10] Jeong Hyun Lee “Log analysis system using Hadoop and MongoDB”,http://www.cubrid.org/blog/dev-platform/log-analysissystem-using-hadoop-and-mongodb, Dec 2011. [11] Jian Fang, “MongoDB features and comparisons with Cassandra and HBase”, http://johnjianfang.blogspot.in/2012/04/mongodb-features-andcomparisons-with.html, April 2012.

Performance Evaluation of Unstructured NoSQL data ...

Performance Evaluation of Unstructured NoSQL data ...

Suggest Documents

Performance Evaluation of NoSQL Databases: A Case Study

Big Data Analytics: Performance Analysis of NoSQL Databases and ...

Analysis and Evaluation of Unstructured Data: Text Mining versus ...

Performance Analysis of NoSQL Databases Having ...

Crazy NoSQL Data Integration with Pentaho - NoSQL Matters BCN ...

A COMPARATIVE EVALUATION OF NoSQL DATABASE SYSTEMS

EMC UNSTRUCTURED DATA CLASSIFICATION SERVICE

Mining Unstructured Data - Semantic Scholar

Automated Score Evaluation of Unstructured Text ...

Automated Score Evaluation of Unstructured Text

Evaluation of unstructured medical school examinations - CiteSeerX

Empirical Evaluation of Querying Mechanisms for Unstructured ...

The Performance Evaluation of DGPS Data ...

PERFORMANCE EVALUATION OF PROBING SYSTEMS IN DATA

PERFORMANCE EVALUATION OF DATA INTEGRITY ... - CiteSeerX

The Performance Evaluation of DGPS Data ...

EVALUATION OF FINAL COVER PERFORMANCE: FIELD DATA ...

NoSQL Databases for RDF: An Empirical Evaluation

Empirical Evaluation of Querying Mechanisms for Unstructured ...

Data warehouse framework for unstructured biogas data

NoSQL real-time database performance comparison

Which NoSQL Database? A Performance Overview - RonPub

NoSQL Databases for RDF: An Empirical Evaluation

Making Sense of Unstructured Text Data