framework for cloud computing for No-SQL databases. The ... Keywords: Cloud Computing; Storage Efï¬ciency Technology; ... all cloud Computing vendors face.
2015 1st International Conference on Next Generation Computing Technologies (NGCT-2015) Dehradun, India, 4-5 September 2015
An Efficient Storage Framework design for Cloud Computing: Deploying Compression on De-duplicated No-SQL DB using HDFS Ajay Jangra∗ , Vandna Bhatia† Upasana Lakhina‡ and Niharika Singh§ ∗ ‡ §
CSE department, UIET, Kurukshetra University, Kurukshetra, INDIA † CSE department, Thapar University, Patiala INDIA CSE department, UIET, Kurukshetra University, Kurukshetra, INDIA CSE department, UIET, Kurukshetra University, Kurukshetra, INDIA Though backup and storage in the cloud are different services to be provided over cloud for tangential needs, both can provide scalability in storage space in order to meet the amplifying requirements of the users. They will provide improvement in managing the content, and can be integrated efficiently to backup businesses data requirements, from server to personal systems [4].
Abstract—Cloud Computing is an alluring technology which provides elasticity, scalability and cost-efficiency over a network. In recent years, database obtrusion has become a crucial feature in cloud computing. Rapid data growth and need to keep it safer require organizations to integrate how they manage and use their data. To meet these necessities, No-SQL has proven to be better over RDBMS because of its high scalability. To provide better availability, No-SQL databases has a huge among of Redundant data. In order to address this neophyte challenges and further achieve a dependable and secure cloud storage service, different storage efficiency approaches are used these days. This paper proposed a novel design of efficient storage framework for cloud computing for No-SQL databases. The framework tackles the challenge of managing the huge amount of the data at the back end of the cloud service provider. The Framework uses Map-Reduce paradigm of Hadoop with the schema-less document-oriented MongoDB. The outcomes are improved storage efficiency by saving subsequent storage space and network bandwidth during data transfer among multiple nodes.
All the live unstructured or semi-structured data that is added each day to cloud cant be handled by traditional RDBMS. To manage this enormous data, a new concept has come into existence known as No-SQL (Not only SQL) with the objective to provide better scalability, elasticity, and availability in cloud network. No-SQL database has de-normalized structure with Eventual Consistency (BASE) instead of Traditional ACID properties to dynamically add new attributes to data records [5]. It has significant and growing industry for big data and real-time web applications. They have ability to replicate and to distribute partitioned data over many servers resulting in a significant amount of redundant data.
Keywords: Cloud Computing; Storage Efficiency Technology; No-SQL; Deduplication; Compression; Map- Reduce.
I.
II.
I NTRODUCTION
It is a measure of the ability of storage system to manage and store large amount of data that occupies the least amount of space with little or no impact on performance and have lesser operational cost. [7] Storage efficiency percentage indicates that how effectively the system addresses the real-world requirements by managing cost of handling data, reducing complexity and limiting risk.
Cloud computing is a propitious Web-based mechanism that allow scaling and virtualization of IT resources, provided as a service over a network. The main five essential inherent features which must be provisioned in cloud computing applications are: on-demand selfservice, ubiquitous network access, easy resource pooling, rapid volatile elasticity; and pay-per use utility based computing [1].
A. Storage Efficiency Technologies
By using cloud virtualized storage that is available on demand over a network, Cloud users dont need to worry about storage capacity. Resulting in the substantial amount of money savings spent on storage with infinite storage capacity and the simplicity of use. They just have to pay for the storage they use as per payas- you go [2]. Handling this huge data storage and backup is an accrete challenge that all cloud Computing vendors face. There are many ways to surmount this challenge that include on premise storage or network devices, using backup software, datacenter based offsite services, and cloud-based or online solutions [3].
978-1-4673-6809-4/15/$31.00 ©2015 IEEE
STORAGE EFFICIENCY
For any organization, management of growing amount of accumulated data is difficult. To increase the efficiency of management environment, there is a need to identify right management tool [5]. There are many technologies data storage administrators can provide in order to reduce and consolidate their data as explained below: Snapshot technology: It works by taking the snapshot of the storage device at a particular instance as a guide to restore the storage device to that instance if any failure occurs later. A snapshot technology creates a copy of all the data at an instant of time which is treated as an original copy. Instantly, a snapshot 55
2015 1st International Conference on Next Generation Computing Technologies (NGCT-2015) Dehradun, India, 4-5 September 2015
copy is created and is made available to other applications for use, such as data analysis, data protection or error reporting and data replication applications [10].
similarity and locality using a handprint based statefull routing algorithm in the backup stream. Compression is also used in cloud for storage savings. Balachandran et al. [14] proposed to use runs of hashes for file recipe compression. They find out that if a run of hashes occurs twice in the data stream, they replace it with the fingerprint of the first chunk and the length of the repeated sequence.
Thin provisioning technology: thin provisioning model allows one to consume storage on demand by eliminating the pre -allocation of storage capacity. It improves the storage utilization capacity by just occupying the utilized amount of storage rather than allocated storage. The remaining storage space can be utilized in some other application [11]. Single Instance Storage: Due to provide better availability in Cloud, many identical files persist in the network which increases the load on the file system. Single instance storage deals with such issues by identifying the duplicate files and storing just a single copy on the disk along with the references of all the duplicate entrees. It is also known as File level Deduplication.
IV.
Many applications throw the data in the form of flat text files in cloud. It is not efficient to store this data in the raw form because it is very huge in size and it also upsurges operating system overhead in procuring and storing the data. Traditional RDBMS is not a suitable platform to handle the bulk data. Map Reduce paradigm works great in such Conditions. With the expansion and increasing popularity of Cloud Computing, NoSQL databases are becoming the preferred choice of storing data in the Cloud. Platforms like Hadoop, Cassandra, MongoDB, etc. are well suited for storing and handling such database. Main types of No-SQL database that are used these days mostly are Key-value stores, Column oriented DBs, Document DBs, Graph Databases, etc. [14].
Data De-duplication technology: Data De-duplication takes single instance storage to an enhanced level. It can identify and removes duplicate entries of data more efficiently in cloud storage. Multiple blocks in which redundant data is present shares storage blocks, so data stored is in the same format as it was before deduplication [12]. It allows the storage system to provide data with no extra processing required before transferring it to the requesting host. Data Compression: Data compression improves storage efficiency by eliminating the redundant binary information within a block of data. Compression uses algorithms like deflate and inflate on data on write and read respectively. So data is stored in denser form than the original one [10]. We use compression in our daily life in the form of JPEG for image files or mp3 audio for music files and many more.
In this paper, we are dealing with Key-Value Stores and Document-oriented databases. In Document oriented data model, documents are stored with a key and this key is used to retrieve the document. In the proposed framework, storage is done in documented form to better deal with larger objects as key-value DBs were designed to deal with primarily smaller objects. V.
From the above mentioned Technologies, deduplication and compression are most effective when applied across multiple users as in the case of cloud. III.
USE OF NO-SQL DB STORAGE FOR CLOUD COMPUTING
PROPOSED FRAMEWORK
In Fig 1, the flow diagram of the proposed framework has been shown in multiple layers. First data is fetched from Hadoop Distributed file system (HDFS). It is designed for storing very large data files with streaming data access patterns, running on clusters of various nodes that are hundreds of megabytes, gigabytes, or terabytes in size. [30] Multilayered design of the proposed framework has been shown in figure 1. The database is procured from Hadoop Distributed file system (HDFS). HDFS is designed for storing very large data files with streaming data access patterns, running on multiple clusters having various nodes to deal with very large data in megabytes, gigabytes, or terabytes in size [16].
LITRATURE SURVEY
Much work have been done for data management in cloud computing. Abadi [11] in his work discussed the limitations and opportunities of deploying data management issues on some emerging cloud computing platforms. After doing all his research, he found that there is a need of a new DBMS, designed specifically for cloud computing environments. No-SQL databases overcome the limitations of SQL for large data handling. Hammes et al. [12] compared No-SQL and SQL on structured and unstructured data and tested both approaches to each model side-by-side in the same cloud environment. He found that No-SQL is better for storage in cloud rather than SQL. But NO-SQL has great amount of redundant data to provide better availability to cloud users.
Typical data deduplication system can be divided into two parts: data segmentation and duplication data comparison. The system will segment files into chunks and compare those chunks to find duplicate data. To use the proposed framework for large data, the redundant amount of data in the file is first removed by applying deduplication using Map Reduce paradigm in Hadoop framework.
Data Deduplication is widely employed on cloud based storage to improve the storage efficiency and storage cost. Tseng et al. [13] applied deduplication to cluster the stored chunks to reduce the time of excluding the false positive problem induced by bloom filter. Fu et al. [18] discussed scalable inline cluster based deduplication framework in cloud data centers. The framework was designed to exploit data
To store the original data at one place and producing a pointer table to locate the copies of that data, various keyvalue pairs received as the output of the map-Reduce phase are collected. Key-value pairs with the pointer table are now stored as a single document using Document-Oriented 56
2015 1st International Conference on Next Generation Computing Technologies (NGCT-2015) Dehradun, India, 4-5 September 2015
Fig. 1.
Flow Chart of the Proposed Framework. Fig. 2.
Proposed Framework.
database Mongo-DB. The reason of using MongoDB is its better scalability as a document may contain further various nested documents with different key-value pair and key-array pairs. So, MongoDB is used for this purpose, as document based N0-SQL storage is more efficient and easy in handling [17]. After Deduplication the storage space is preserved subsequently. To make the Storage more efficient, further compression is applied. The Compressed file is then saved in HDFS. Now the file after applying compression on deduplicated No-SQL DB is ready to transfer on other nodes which also save bandwidth consumption because of the lesser size. VI.
2)
IMPLEMENTATION PHASE
Input from HDFS, in the form of metadata containing various text files, is fetched. Metadata is the key element used for chunking the records into data -sets, and to identify duplicates effectively by comparing only identical data-set types is taken. The variable size chunks are then passed to Map Reduce phase, where actual processing takes place. 1)
Deduplication: Hadoop hashes all keys in the Map step and ensures that all values belonging to the relative key for corresponding field end up on the same Reducer. Here deduplication is done at file and sub-file level. First hash value is calculated for each block of data.
3)
A comparison is done between the hash values. If the hash value turns out to be same for two values, one copy is stored as a image and the other duplicates are replaced with a pointer to the object that is already in the database. When the chunking takes place, an
index is created from the results of map-reduce face to find out the duplicate data. Map-Reduce Paradigm: The input data is read sequentially and partitioned into a set of data-sets. The map function parses each record, and based on the metadata information produces a sequence (KeyValue) pairs which is stored as metadata. The Deduplication is performed by reducers. The master process receives a message from the reducer workers when the Deduplication processing is complete. Once first layer of reducer workers is done, the next layer of reducer worker receives the intermediately data. The pointer table will consist of the unique data and its corresponding pointers to the data. In the proposed framework, to get better storage efficiency in less processing time, the intermediate key-value pairs got after every map task are not stored individually. Instead the data is kept in memory and then directly transferred to the reducers. The same tenet is applied in between two reducers. The data is not transformed into key-value pairs, but transferred as it was and is later manipulated by the reducer. These extra transformation and manipulation steps are eliminated for each pair of data-sets that gives better execution speed. Integration of the key-value pairs: For storing the result of Map Reduce phase, MongoDB is used. The purpose of using MongoDB is its high performance and easy scalability. It is best suited for No-SQL schema as it works on the concept of collection and document. A collection in MongoDB is a group of documents. Document based No-SQL Database are basically, a keyvalue database, where each record is stored as
57
2015 1st International Conference on Next Generation Computing Technologies (NGCT-2015) Dehradun, India, 4-5 September 2015
4)
TABLE I.
a document or object and can be identified by a unique ID which is provided by hashing in this case. The pointer Table based on all key value pairs is created by fetching all key value pairs to MongoDB document. All key value pairs with the pointer table are stored in a document oriented file so that the original file can be recreated by combining them all together. Compression in Hadoop: In this case GZIP [32] compression algorithm is used for additional savings in storage space. It is among the native libraries of Hadoop. GZIP is based on the DEFLATE algorithm, which is a combination of Huffman coding and LZ77. It is easy to decompress the file back using DEFLATE algorithm with just a little extra execution time. VII.
COMPARISON IN FILE SIZES
EXPERIMENT AND EVALUATION
Hadoop and MongoDB are the open source platforms. For execution, Hadoop 2.6.0 and MongoDB 2.6.1 are installed in 4 systems with 32 bit Ubuntu 14.04 on a dual boot system with 4 GB RAM and hard disk of 1 TB. One system is considered as name node, second is considered as secondary name node and the other two are considered as a data node in distributed file system architecture. After Configuring Hadoop and MongoDB, the native libraries of Hadoop for compression are placed. In this approach Hadoop GZip compression is used. All the execution is done in Eclipse Juno by building the dependencies for Hadoop and MongoDB. Proposed Framework is a novel approach for storage savings, so dataset taken is absolutely certain of its content and structure. The data is genera ted based on Size, redundancy and tuple complexity resulting in 12 text files which are executed in 12 test runs. A consistent backup of the databases was performed before and after applying the deduplication and compression process to record the size of the files. Similarly, the size of the backup files was measured after running the experiments on records of varying sizes. In table 1, the comparison is shown in terms of file sizes having different number of records ta ken with different no. of attributes at a time.
Fig. 3.
Comparison of Data de-duplication ratio.
Subsequent amount of space has been saved already by using De-duplication technology. When Compression is applied on the de-duplicated storage, in fig 4 it has been shown that up to 91% of space can be saved for No-SQl databases.
It can be noticed clearly that for large file, better results are noticed than for small files having the similar type of contents. As redundancy increases with the amount of records, proposed framework in improving storage efficiency works great for large files.
In fig.4, the comparison of different file sizes in 12 test runs after applying deduplication and then on it applying compression has been shown. It is clear from this that Compression can save a significant amount of space after applying deduplication for No-SQL data stores.
Storage Savings: More than 50% of storage space can be freed if deduplication is used. More the duplicate data more will be the storage savings. The average deduplication ratios were calculated based on the above average DB file sizes. The results in fig. 3 shows the difference that how structural information of the file can affect the data deduplication ratio. On the other hand, a higher DD ratio is obtained with fewer larger chunks than with more but smaller chunks. The extra formatting for the pointers affects the backup space. However, the results based on chunk sizes, show much smaller differences between the DD ratios than the results based on tuple complexity proportion-wise.
Storage Efficiency: By considering the storage savings, taking the original fil e size and the file size after applying compression and De-duplication, storage efficiency of the proposed framework can be calculated. Depiction of storage efficiency of the proposed framework has been done in fig 5 for the specified 12 test runs. It saves much storage space and approximately doubles the storage savings of the de-duplicated system. The average time consumed in whole process depends on the size and the amount of records in file. Network Utilization: Data in H DFS needs to be 58
2015 1st International Conference on Next Generation Computing Technologies (NGCT-2015) Dehradun, India, 4-5 September 2015
transferred over the network, better storage savings directly affects the number of round trip s required to read compressed data over the wired or wireless medium. Also when data has to be replicated over multiple nodes, less bandwidth is consumed for transfer because the compressed file having non-redundant data is sent. It results in better bandwidth utilization and fast transfer. In fig 6, the bandwidth that is preserved by using the proposed framework is depicted graphically. The time taken in transferring a file from one n ode to another is represented in seconds. It has been clearly shown that the time Consumed to transfer a file after using compression in de-duplicated file system is much lesser than the file transfer of the domain size file. VIII.
Fig. 6.
CONCLUSION
In this paper, we discussed various storage efficiency approaches that can minimize the storage of data over the cloud. In the proposed framework, Data Deduplication and Compression has been deployed on HDFS data store which results in less storage space and better Bandwidth utilization. For relatively large clusters and big jobs, compression can lead to substantial benefits. Deduplication helps in improving overall performance of the NoSQL column oriented data-stores by optimally saving memory used and network bandwidth.
Bandwidth Consumption during Inter node transfer.
As in this work effect of applying storage efficiency techniques has been shown on human readable text data, no matter how will it perform in controlled or uncontrolled environment. In the future it can be extended to many other types of data like images and on live data. R EFERENCES [1] Fig. 4.
Comparison in terms of File size. [2]
[3] [4] [5]
[6]
[7]
[8] [9] [10]
Fig. 5.
Storage Efficiency for files with varying structural information.
59
Comp quip Technologies, White paper, Cloud Storage The Issues and Benefits available at http://www.compuquip.com/itservices- blog/wpcontent/uploads/2011/08/cq-i365-Cloud- Storage.pdf, (accessed on June 5, 2014). D. Agrawal, A. El Abbadi, S. Antony, and S. Das, Data Management Challenges in Cloud Computing Infrastructures, in proceedings of DNIS, Japan published in springer March 2010. SNIA Implementing, Serving, and Using Cloud Storage Cloud Storage Initiative October 2010. SNIA, The 2013 SNIA Dictionary pp 244, (accessed on June 4 2013). Peter Mell, Timothy Grance, The NIST Definition of Cloud Computing in Special Publication 800-145, September 2011, (accessed on June 5 2014). Cloud Storage , available at: http://blog.oxygencloud.com/2013/09/09/4reasons-why-cloudand- on-premises-storage-are-different/, (accessed on June 4 2014). L. Zhao, S. Sakr, A. Fekete, H. Wada, and A. Liu, ApplicationManaged Database Replication on Virtualized Cloud Environments, in Data Management in the Cloud (DMC), ICDE Workshops, 2012. Netapp specifications, available at: http://www.netapp.com/us/products/platform os/ dedupe.aspx, (accessed on June 5 2014). An Oracle White Paper on Thin Provisioning with Pillar Axiom 600 September 2011, pp 1 -2.
2015 1st International Conference on Next Generation Computing Technologies (NGCT-2015) Dehradun, India, 4-5 September 2015
[11] ”Understanding Data Deduplication” Druva, 2009. , Available at http://www.druva.com/blog/understanding-data-deduplication/ (accessed on June 4 2014). [12] Zhao, G., Rong, C., Li, J., Zhang, F., Tang, Y.: Trusted Data Sharing over Un-trusted Cloud Storage Providers. In IEEE Second International Conference on Cloud Computing Technology and Science (Cloud Com). (December 2010), Indianapolis, Indiana, USA pp. 97-103. [13] N. Mandagere, P. Zhou, M. A. Smith and S. Uttamchandani, ”Demystifying data deduplication,” in Proceedings of the ACM/IFIP/USENIX Middleware ’08 Conference Companion, Leuven, Belgium, 2008, pp. 12-17. [14] Q. He, Z. Li, and X. Zhang, Data deduplication techniques, in International Conference on Future Information Technology and Management Engineering, in 20 10, pp. 431432. [15] M. Armbrust, A. Fox, R. Griffith, A. D. Joseph, R. Katz, A. Konwinski, G. Lee, D. Patters on, A. Rabkin, I. Stoica, and M. Zaharia. Above the Clouds: A Berkeley View of Cloud Computing. Technical Report 200928, UC Berkeley, 2009. [16] P. Kulkarni, F. Douglis, J. LaVoie and J. M. Tracey, ”Redundancy elimination within large collections of files,” in Proceedings of the Annual Conference on USENI X Annual Technical Conference, Boston, MA, 2004, pp. 5-5. [17] Big-data trade-offs are, available at: http://breakinggov.com/2012/11 /12/big-data-tradeoffs-whatagencies- need-to-know-nists-pet er-me/, (accessed on June 4 2014). [18] Storage-efficiency is available at: http://www.crn.com/news/storage/ 231902774/ storage-efficiencykey- to-managing-fast-growing-data.htm, (accessed on June 4 2014). [19] Cloud vendors details, Available at: http://computer.financialexpress.com/20100329 /20thanniversary09.shtml, (accessed on June 5 2014). [20] H. Baer. Whitepaper: Partitioning in Oracle Database 11g. Technical report, Oracle, June 2007. [21] Balachandran, S.,and Condtantinescu, C., Sequence of Hashes Compression in Data De-duplication in Proceedings of the 18th Data Compression Conference (DCC) (2008), pp. 671682 [22] Philip Shilane, Grant Wallace, Mark Huang, and Windsor Hsu, Delta
[23] [24] [25]
[26]
[27] [28]
[29]
[30]
[31]
[32] [33]
[34]
60
Compressed and De-duplicated Storage Using Stream- Informed Locality Backup Recovery Systems Division EMC Corporation. Pp 2 to 3. Data storage , available at: http://searchstorage.techtarget.com/podcast/How-to-build-an efficientdata-storage-environment (accessed on June 6 2014). F. Chang, J. Dean, S. Ghemawat,W. C. Hsieh, D. A.Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: a Distributed Storage System for Structured Data. In Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation - Volume 7, OSDI 06, pages 1515, Berkeley, CA, USA, 2006. USENIX Association MongoDB Overview - Tutorials for Swing, Retrieved from: http://www.tutorialspoint.com/mongodb/mongodb overview.htm. (Accessed on 10 June 2014). Daniel J. Abadi, Data Management in the Cloud: Limitations and Opportunities, IEEE Data(base) Engineering bulletin,2009, pp. 3 to 12. Q. He, Z. Li, and X. Zhang, Data deduplication techniques, in International Conference on Future Information Technology and Management Engineering (FITME), Changzhou, China, October 2010, pp. 430433. B. Zhu, K. Li, and H. Patterson, Avoiding the disk bottleneck in the data domain deduplication file system, in 6th USENIX Conference on File and Storage Technologies (FAST 08), 2008, pp. 269271. N. Mandagere, P. Zhou, M. A. Smith and S. Uttamchandani, ”Demystifying data deduplication,” in Proceedings of the ACM/IFIP/USENIX Middleware ’08 Conference Companion, Leuven, Belgium, 2008, pp. 12-17. Cho Cho Khaing, Thinn Thu Naing,The efficient data storage management system on Cluster-based private cloud data centre in Proceedings of IEEE CCIS 2011,pp 235-239. Hadoop: The Definitive Guide, 3rd Edition ¿ Chapter 3: The Hadoop Distributed File system, Available at http://my.safaribooksonline.com/book/software-engineeringanddevelopment/9781449328917/3dot-the-hadoop distributedfilesystem/ id2412156, (accessed June 6, 2014). Gzip. Wikipedia, the free encyclopedia, Web. [Online] http://en.wikipedia.org/wiki/Gzip/html (accessed on june 4 2014)