Available online at www.sciencedirect.com
ScienceDirect Procedia Computer Science 78 (2016) 224 – 232
International Conference on Information Security & Privacy (ICISP2015), 11-12 December 2015, Nagpur, INDIA
Critical Study of Performance Parameters on Distributed File Systems using MapReduce Madhavi Vaidyaa , Dr Shrinivas Deshpandeb a
b
Department of Computer Science, VES College, Mumbai, India
Department of Computer Science, HVPM, Amravati,India
Abstract There is a lot of data generated by the network is growing every day. MapReduce is a promising parallel programming model for processing large data.In this paper we surveyed several distributed storage and computation systems. We have studied various parameters like Fault Tolerance,Replication,Checkpointing,Security and Optimizing small file access using MapReduce and reviewed the distributed file systems like GlusterFS,Lustre,Ceph and HDFS which are open source distributed file systems. Cloud computing is one of the distirbuted file system which plays a very important role in protecting applications’data and the related infrastructure with the help of policies, technologies, controls, and big data tools. In our study we have proposed that MapReduce is the efficient and scalable programming platform for data processing which provides computational capabilities and distributed storage on clusters of commodity hardware. © 2016 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license © 2016 The Authors. Published by Elsevier B.V. (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-reviewunder underresponsibility responsibility organizing committee of the ICISP2015. Peer-review of of organizing committee of the ICISP2015 Keywords:Lustre; Ceph; MapReduce;Small;HDFS;Securtiy;Replication
1. Introduction There is a lot of data generated from the network is growing every day. In the face of massive data processing and inadequacy in storage of traditional centralized relational databases is observed. A distributed file system(DFS) stores data on multiple nodes and to remove this bottleneck the clients are allowed to access data in parallel from storage nodes. Significant challenges for such a distributed file system are extended to a large number of storage nodes and providing reasonably degraded operations when there are chances of hardware failure1. There is often a need to study and survey the distributed systems.In this section we survey several distributed storage systems.
1877-0509 © 2016 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of organizing committee of the ICISP2015 doi:10.1016/j.procs.2016.02.037
Madhavi Vaidya and Shrinivas Deshpande / Procedia Computer Science 78 (2016) 224 – 232
________ *Madhavi Vaidya. Tel.: +91-9869026553. E-mail address:
[email protected]
There is often a need to study and survey the distributed systems.In this section we survey several distributed storage systems. Furthermore, DFSs are sometimes called network, cluster, grid, cloud or parallel file systems among others have been studied in this paper. Cloud computing is also a distributed file system which handles Terabytes and even Petabytes level massive data2. They are Ceph File System, Gluster File System, Lustre File system and finally HDFS have been elaborated which use the MapReduce for data processing. Even this paper discusses security and privacy issues of various distributed file systems.There are other file systems like Quantcast and NoFS are also the part of our research which support the MapReduce extensively but because of the space constraint, they have not been studied here. The above four file systems are selected after a literature survey as a part of our research for distributed file systems and selected for their popularity in usage.The authors have taken all the above file systems where comparison of Parallel DBMS and MapReduce is done and which mechanism is efficient to process data is exemplified. The placement of the paper is done as below. Firstly,by reviewing the research, the parameters like Fault Tolerance, Replication and CheckPointing,Dealing with Small Files and Security have been discussed. In the later part of this paper, on the basis of said parameters,Parallel DBMS and MapReduce are the two mechanisms which have been compared in short. Towards the end of the paper, file systems are compared on the basis of use of MapReduce on DFS and finally comparison has been done on the basis of the said parameters. 2. Study of Parameters 2.1. Fault Tolerance and Replication A simple distributed file system is needed with integrated fault tolerance for efficient handling of small data records.Data is distributed over multiple machines where there are chances of network failures. Actually, Fault tolerance is achieved through the division of a file into smaller fragments or chunks, which are managed and processed by a set of servers. Fault-tolerant data storage is becoming popular as moving data is done to the cloud. Fault-tolerance is an important aspect in cloud storage where robustness of data is a major concern3. Now the question is how to find the criteria that can be used for comparison of various distributed file systems which are fault tolerant in nature. These criteria are listed down, but not limited, to aspects such as (1) The ability to recover the multiple machines from concurrent loss, (2) During the read or write operation on a file the ability to handle interruptions or (3) The time taken to recover and to make the node synchronized from the network failure4. Data replication is used to prevent data availability problems. In the face of node failures, following techniques are used e.g., mirrored disks, data clustering, disk arrays with redundant check information, and chained declustering5.If one node is down, then the table partitions stored on that node can be retrieved from other nodes. Several techniques have been proposed for data availability. File chunks are stored on servers in two groups, they are nothing but write and replica. The former group is composed of servers in charge of receiving or providing fragments(chunks) of file, and is capable of reading and writing operations. The later group is composed of replicas of the first group and allows only reading operations of file chunks6. MapReduce avoids the data transfer and overhead of checkpointing in comparison with traditional MPI applications. It adopts multiple replicas of each data block and re-executes the failed tasks if the failures occurs7 .
225
226
Madhavi Vaidya and Shrinivas Deshpande / Procedia Computer Science 78 (2016) 224 – 232
2.2. CheckPointing Checkpointing is an indispensable fault tolerance mechanism adopted by long running data intensive applications. It occurs at regular intervals, applications undergo two methods which are compute and checkpoint operations. Checkpointing is a write I/O intensive operation. If the failure occurs, checkpoint data is often written once and many a times read only8. Generally Checkpointing is done , on local-node storage and on shared file system. But the locally stored data is lost when the node gets crashed. Second is compute nodes can also check point to a shared,central file server. It has been stated as in9 shared file servers are crowded with I/O requests and have limited space. However, flooding the central server is possible by the hundred nodes in a cluster with simultaneous check pointing I/O operations. There are chances that a parallel file system like Gluster and Luster offer high I/O throughput. Some authors have studied and stated that checkpointing though critical for keeping the reliable data intact, it can be the overhead for the time spent on useful computations. It has been observed that the checkpointing is done in various manners; on central file server parallel file system or on cloud of free disk space or temporary buffer. The process of Periodical Checkpointing saves the state of the application. This process is done frequently by verifying the health of the cluster to check the progress. If the failure occurs, then the process can be restarted from most recent checkpoint. Multiple asynchronous checkpoints may create a difficulty in reconstructing a consistent image, HPC applications checkpoint synchronously (i.e. Following a barrier). When the synchronous checkpointing occurs, it merely shifts the control to the parallel file system which coordinates the simultaneous access from many more compute nodes and hence there are chances of failures10. In this paper, stdchk11 saves checkpoints into a cloud of free disk space or temporary buffer12 collected from a cluster/compute node which is made up of workstations and is utilized. Diskless checkpointing13 saves checkpoints into compute node memory, but does not transfer to persistent storage. The fault protection is achieved on the checkpoint image at distributed place by erasure coding. Large HPC parallel applications jealously utilize all memory and demand a high degree of determinism in order to avoid jitter14 and seen to be poorly served by techniques reducing available memory to execute background processes during computation. MapReduce takes care of checkpointing to detect the failure node in a pre-defined timeout parameter which has been observed in the following phases. If a node does not respond after a specified timeout period,then that node is asserted to be dead. MapReduce executes in three stages: (1) The immediate results are saved to the local storage as soon as the Map task finishes; (2) The local results are transferred to the Reduce task in shuffle stage; (3) The results are saved in distributed file system once the process of Reduce task is done15. 2.3. Dealing with Small Files The object-based storage systems like Ceph has the data organized and can be accessed, struggle with workloads that access large number of small files,which are developed through software and user workspaces, there are two reasons for this: interactions of each file with metadata server and loss of namespace locality at the storage devices. For a client accessing the contents of a large file, the interaction usually creates a minor overhead done over many data accesses. For a small file, on the other hand, there can be as few as one data access. Since accessing each file’s data requires first interacting with the metadata server, client latency gets doubled and the metadata server can be asked to service as many requests as all the storage devices combined16 . Accessing a large number of small files on a parallel file system shifts the I/O challenge17 from providing high aggregate I/O throughput for supporting highly concurrent tasks. It has been presented in18 ,a log-structured file system,authors claim that such a system exhibits a performance increase of an order of magnitude for small-file writes, while matching the performance for large files in comparison to non-log-structured file systems of the time. Managing free space is an important issue, since the problem is how to make large extents of free space available for writing after many deletions and overwrites of small files. 2.4. Security
Madhavi Vaidya and Shrinivas Deshpande / Procedia Computer Science 78 (2016) 224 – 232
Airavat is the first system which has integrated mandatory access control with differential privacy, many privacypreserving MapReduce computations are enabled without the need to audit untrusted code. Recent efforts recognize the importance of self-protection of such big data management systems, but they mainly focus on data privacy and correctness19. Airavat is a relevant example to avoid unauthorized access to storage20. It has integrated decentralized information flow control and differential privacy techniques. It has provided rigorous security control in the computation for individual data in Map-Reduce frameworks21.The increasing popularity of storing big data and analytics create a need of secure and efficient data management mechanism. One of the most relevant security topic for handling such big data refers to preventing the users from damaging the stored data or from breaking security policies and data-access protocols. For the security of the large data, techniques such as logging, privacy techniques and encryption etc. are necessary. IBM researchers also explained that the query processing using MapReduce should take place in a secured environment using Kerberos. Kerberos is a system of authentication ,developed at MIT. Kerberos uses an encryption technique along with a trusted third party, an arbitrator, to be able to perform a secure authentication on an open network22. To be more specific, Kerberos uses cryptographic tickets to avoid transmitting plain text passwords over the wire. Kerberos is based upon Needham-Schroeder protocol. Distributed file systems is one of the cloud computing systems, security issues of these systems and technologies are applicable to cloud computing using Kerberos23,24 3. Study of Various Mechanisms We have decided to study two mechanisms for analyzing the said parameters. The two approaches which have been studied here are, MapReduce programming and the Parallel Databases for handling large data and performing analytics for studying the parameters like Fault Tolerance,Replication, Checkpointing, Dealing with Small files and Security. 3.1. MapReduce Framework MapReduce is a framework to easily write applications that process large amounts of data in parallel on clusters of compute nodes. Generally, in a MapReduce environment, the compute and storage nodes are the same, and computational tasks run on the same set of nodes that permanently hold the data required for the computations. The MapReduce algorithm breaks input data into a number of limited-size splits, on a parallel basis. The algorithm converts the data in each split to a group of intermediate key-value pairs in a set of Map tasks. Then, it collects each key’s values (a key may have multiple values), in what is called the “shuffle” stage and processes the combined key/values as output data, via a set of Reduce tasks 25. 3.2. Parallel Databases A parallel DBMS has horizontal partitioning, where the rows of a relational table are distributed across nodes of the cluster used to compute expensive tasks in parallel26. z
Handling Large Data - Parallel DBMS were developed to improve the performance of database systems. As there is an improvement in processor performances, it has been outstripped disk throughput, hence critics have predicted from time to time that I/O bottleneck would be a major problem.Both MapReduce and Parallel DBMS provide a means to process large volumes of data. As the volume of data captured continues to rise, questions have been asked as to whether the parallel DBMS paradigm can scale to meet demands. “There are no published deployments of parallel database with nodes numbering into the thousands27. As more and more nodes increase, there is an increase in the chance of a node failure. Parallel DBMS do not handle node failure.
227
228
Madhavi Vaidya and Shrinivas Deshpande / Procedia Computer Science 78 (2016) 224 – 232
MapReduce has been designed to run on thousands of nodes and is inherently fault tolerant. z
Analytics - Both MapReduce and Parallel DBMS can be used to produce analytical results from big data. Parallel DBMS uses SQL as the retrieval method, while MapReduce uses programming languages. In many data mining and data clustering applications, the algorithm is complex and requires multiple passes over the data which is performed by MapReduce very well28. The output from one subprocess is the input to the next such algorithms are difficult to implement in SQL. Performing these tasks in many steps reduces the performance benefits gained from parallel DBMS.
Several aspects to the MapReduce process worth mentioning over Parallel DBMS: z MapReduce can be transparently scalable. The user does not need to manage data placement or the number of nodes used for their job. The underlying hardware has no dependencies. z Data flow is highly defined and in one direction from the Map to the Reduce, with no communication between independent mapper or reducer processes. z Because processing is independent, failover is trivial. A failed process can be restarted, provided the underlying filesystem is redundant like HDFS. z MapReduce though powerful, does not fit all problem types 29. The various characteristics of distributed file systems have been studied. Based on the information above, MapReduce is the mechanism can perform better on various distributed file systems like Ceph, GlusterFS,Lustre and HDFS. The reason behind studying and illustrating the four file systems because all of them can handle large data analitycs using MapReduce framework. This paper is written on improving the fault tolerance,checkpointing,handling small files and a better way to manage volume in big data are made by deploying Ceph, GlsuterFS and Lustre on Hadoop Distributed File System is one such attempt that increases the efficiency of HDFS using MapReduce. We have explained here how MapReduce is feasible for all above parameters than the other mechanism, Parallel Databases which is described in this paper and the Map Reduce algorithm is useful as it splits a data set to process in parallel over a cluster. Whereas inputs, scheduling, parallelization and machine failures are handled by the framework itself. 4. Discussion on various Distributed File Systems It has been discussed at the beginning of this paper that various DFSs which are on network, cluster, cloud or parallel file systems will be studied in this paper. 4.1. Ceph Ceph is reliable, easy to manage, and free. The power of Ceph allows to manage vast amounts of data. It delivers extraordinary scalability where thousands of clients can access petabytes to exabytes of data. A ceph node can work smoothly on commodity hardware,which accommodates large numbers of nodes, which can communicate with each other to replicate and redistribute data dynamically30.Ceph stores a client's data in Objectbased Storage Device (OSD)31, it stores a client's data as objects within storage pools. Using the CRUSH (Controlled Replication Under Scalable Hashing) algorithm,the placement policies can separate the object replicas across different failure domains but still maintains the desired distribution. It may be desirable to ensure that the data replicas are placed on devices using different shelves,racks, power supplies, controllers, and/or physical locations to address the possibility of concurrent failures32.Ceph OSDs are intelligent enough to detect errors and failure recovery. Since ceph is multi-petabyte scale parallel, network, and distributed file system, its OSDs should also be dynamic and autonomic enough to do data migration in the time of ceph cluster expansion33.
Madhavi Vaidya and Shrinivas Deshpande / Procedia Computer Science 78 (2016) 224 – 232
4.2. GlusterFS GlusterFS distributes load using a distribute hash translation (DHT) of filenames to it's subvolumes, which are duplicated to provide load handling and fault tolerance on scale-out distributed file system supporting thousands of clients34. In GlusterFS, the elementary storage units are called as Bricks. A server can have one or more bricks where they can store data via translators on lower-level file systems. GlusterFS can work with three different types of volumes, such as - Distributed Volumes, Replicated Volumes and Striped volumes35. GlusterFS architecture collaborates several disks together and memory resources in a single global namespace with one common mount point on a Linux machine. Thousands of applications and clients can connect to the GlusterFS file system via this mount point and interact with the stored data36. 4.3. Lustre Lustre is a file system having high performance computing(HPC) ability and is able to process Big Data. Here, system performance is of greater importance, the “moving computation” assumption no longer holds true. The computations are moved to data. Lustre is a cluster file system based on client/server model . Data is stored on Object Storage Servers (OSSs) and metadata is stored on Metadata Servers (MDSs). Lustre file system achieves great scalability and performances as the separation in metadata operations is seen from normal data operations. The mechanisms of journaling provides recoverability from failure conditons.Hadoop’s Distributed File System (HDFS) and Lustre have similarity in terms of performance and storage capabilities. The basic difference in HDFS and Lustre is,HDFS has the costs and benefits of local storage while Lustre has the costs and benefits of centrally located storage37. 4.4. HDFS A fundamental assumption Hadoop system (HDFS) is based on, “moving computation is cheaper than moving data”. It means that performing computations on nodes is more efficient than storing the large data locally. HDFS gives good performance on “commodity” clusters which are inexpensive in nature with relatively slow network fabrics38. HDFS cluster employs a master-slave architecture consisting of a single NameNode (the master) and multiple DataNodes (the slaves), usually one per node in the cluster expecting high throughput of data access rather than low latency of data access. The NameNode manages the file system namespace and regulates access to files by clients, whereas the DataNodes are responsible for serving read and write requests from the file system’s clients. In traditional Map/Reduce environments, input and output data are stored on the HDFS as referred in Fig.1,with intermediate data stored in a local, temporary file system on the Mapper nodes, and shuffled as needed (via HTTP) to the nodes running the Reducer tasks 39.
229
230
Madhavi Vaidya and Shrinivas Deshpande / Procedia Computer Science 78 (2016) 224 – 232
Fig. 1 : MapReduce Architecture25 Two main methods used to implement fault tolerance in HDFS: i) Data duplication and ii) Checkpoint and recovery40. Data duplication consists in duplicating the data in multiple DataNodes as they are distributed. To write a file to the HDFS, client first contacts the NameNode and then the NameNode nominates a number of (three by default) DataNodes used to replicate the data. The number of replicas can be increased, improving the fault tolerance and the bandwidth in reading the file41. The checkpoint and recovery techniques are similar to the concept of rollback. If a failure occurs, the system rollbacks to the last saved synchronization point, and the transaction starts again. This method is slower than data duplication, but on the other hand, it needs less additional resources 42. 5. Analytical Study and Discussion on Various Distributed File Systems A computational paradigm named MapReduce, where an application is divided into many small fragments of work,each of which may be executed on any node in the cluster. The cluster can be HPC or the client can be a node ondistributed file system or centralized system. Ceph is unable to provide coherent file chunk replicas and thus isbandwidth limited. Ceph's main goals are to be completely distributed without a single point of failure, scalable tothe Exabyte level, and freely-available. The data is replicated, making it fault tolerant. While Ceph avoids theproblem of having a single meta-data server, it is still limited in terms of the number of file-creates that can beperformed per second 43. Many of researchers have written that MapReduce can be run on Glusterfs and will givebetter performance than HDFS. It has been studied that the execution of the same is not given anywhere. Thecompatibility of Hadoop Plug-in is made with Glusterfs where MapReduce and Hadoop-based applications can beexecuted so that code rewrites are eliminated and may provide a fault tolerant file system. Glusterfs is a robust filesystem where fault tolerance is one of the key aspects of the Glusterfs file system and is proved to be reliable forresilient data storage where metadata server is not the part of this file system44,45. It would seem that the overarchingadvantage that Gluster provides is its flexibility in terms of volume types, access methods, and integration withvarious other tools46. Features
Glusterfs
Lustre
Ceph
HDFS
Architecture
Scale out Network Attached File System
ParallelDistributed
Completely Distributed
Distributed
Node Failure
Node failure occurs
Chances of node failure
No failure
Secondary Namenode takes charge
Placement Strategy
Manual
No
Auto
Auto
Madhavi Vaidya and Shrinivas Deshpande / Procedia Computer Science 78 (2016) 224 – 232
Fault Detection
Detected
Manual
Fully connect
Fully connect
Replication
Replication
No
Replication
Upto 3 Replicas
Small Files
Not Efficient
Not Efficient
Suitable
Small quantity of Big Files
Checkpointi ng
Favourable
High
Favourable
High
Security
IP/port-based access controls
Posix AccessControl List
Object Replication and Erasure Codeing
Using Kerberos
Fig.2 : Comparison based on DFS46 While MapReduce implementations provide a straightforward job submission process which involves the whole cluster, HPC users submit their jobs to a Resource Management System (RMS) and need to specify the number of nodes and amount of time that should be allocated for complete the job execution.This approach may confuse typical MapReduce users that are not used to do it, and they may, in return, always try to allocate the whole cluster for the longest time as possible. As a consequence, the turnaround time of the MapReduce job will probably increase47. The reliability is ensured by HDFS using replication factor and Lustre does not replicate (Fig.2) any data but ensures reliability by having many Object Storage Servers connecting to multiple Object Storage Targets48.Finally,we would like to state that MapReduce framework is the right mechanism to apply on the distributed file systems which satisfies the large data analyics, it maintains the performance of the nodes whether they are client/server nodes in case of HDFS or HPC nodes as mentioned in Fig.2 in case of Lustre Parallel Distributed File System. Following is the analytical study which shows the use of MapReduce on Distributed File Systems and performance of MapReduce on the parameters elaborated in this paper. 6. Conclusion To conclude,MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster. It is a framework that facilitates writing arbitrary distributed processing frameworks and applications. The choice of backing file system or cluster on HPC, the MapReduce supports the data availability, placement and data security concern. The desire to integrate MapReduce on the distributed node or any HPC cluster will be the part of the future study and implementations. However, comparative testing on a much larger scale would be undertaken in the future. References 1. F.D. Sacerdoti,”Performance and Fault Tolerance in the StoreTorrent Parallel Filesystem”,arXiv Publications; 2010,pp. 1-13. 2. LI Jing-min, HE Guo-hui,Research of Distributed Database System Based on Hadoop, In Proc. of Information Science and Engineering (ICISE);2010,pp.1417 - 1420. 3. M. C. Chan, J. R. Jiang, and S. T. Huang,Fault tolerant and secure networked storage,7th International Conference on Digital Information Management, ICDIM;2012, pp. 186-191. 4. S.Verkuil, A Comparison of Fault-Tolerant Cloud Storage File Systems,19th Twente Student Conference on IT;2013. 5. A.J. Borr. Transaction Monitoring in ENCOMPASS: Reliable Distributed Transaction Processing, in : Proceedings of the 7th Intl. Conf. On Very Large Data Bases;1981,pp.155–165. 6. David A. Patterson, Garth A. Gibson, and Randy H. Katz. A Case for Redundant Arrays of Inexpensive Disks (RAID). In : Proceedings of the 1985 ACM SIGMOD International Conference on Management of Data;1988,pp. 109–116. 7. T. Kosar, Data Intensive Distributed Computing: Challenges and Solutions for Large Scale Information Management, IGI Publications;2011. 8. A.Silberschatz,P.Baer Galvin,G.Gagne, Operating Systems , Publication By John Wiley & Sons. 9. L. Lamport. Time, clocks, and the ordering of events in a distributed system. Commun. ACM;1978. 10.J.Bent,G.Gibson,G.Ben McClell,P.Nowoczynski, J. Nunez, M.Polte, M. Wingat,PLFS: A Checkpoint Filesystem for Parallel
231
232
Madhavi Vaidya and Shrinivas Deshpande / Procedia Computer Science 78 (2016) 224 – 232
Applications,in:proceedings of the Conference on High Performance Computing Networking, Storage and Analysis;2009. 11. S.A. Kiswany, M. Ripeanu, S. S. Vazhkudai, A. Gharaibeh.,stdchk: A Checkpoint Storage System for Desktop Grid Computing.In Proceedings of the 28th International Conference on Distributed Computing Systems ,ICDCS 2008;June 2008,pp. 613-624. 12. S.A. Kiswany, M. Ripeanu, S. S. Vazhkudai, Aggregate Memory as an Intermediat Checkpoint Storage Device, Technical Report ORNL Technical Report 013521. 13. J. S. Plank, K. Li, and M. A. Puening. Diskless Checkpointing, IEEE Transactions on Parallel and Distributed Systems;October 1998; pp.972–986. 14. A. C. Arpaci-Dusseau. Implicit Coscheduling: Coordinated Scheduling with Implicit Information in Distributed Systems. PhD Thesis, University of California, Berkeley;1998. 15 P. Hu, W. Dai. Enhancing Fault Tolerance based on Hadoop Cluster,International Journal of Database Theory and Application,Vol.7, No.1; 2014, pp.37-48. 16 R.R. Sambasivan, S.Sinnamohideen, G.R.Ganger,J.Hendricks. Improving Small File Performance in Object-Based Storage;CMU-PDL-06 104;May 2006,pp. 1-19. 17. P. Carns, S.Lang, R.Ross,M.Vilayannur,J.Kunkel, T. Ludwig, Small-File Access in Parallel File Systems, Journal of Network and Compute Applications 35;2012, pp.1847–1862. 18. M.Rosenblum,J.K.Ousterhout, The Design and Implementation of a Log Structured File System. ACM Transactions on Computer Systems ;Feb 1992,pp. 26–52. 19. K.D. Bowers, A.Juels,A.Oprea. Hail: A High-Availability and Integrity Layer for Cloud Storage. In: Proceedings of the 16th ACM conference on Computer and communications security, CCS ’09, New York, NY, USA;2009,pp.187–198. 20. I.Roy, S. T.V. Setty.A.Kilzer,V. Shmatikov,E.Witchel,Airavat: Security and Privacy for MapReduce;2010,pp.1-16. 21. G.Ateniese, R.D.Pietro, L.V.Mancini, G.Tsudik,Scalable and Efficient Provable Data Possession, in:Proceedings of the 4th International Conference on Security and Privacy in Communication Networks, SecureComm’08;2008. ACM,pp. 1–9:10. 22. Q Tran,A solution for privacy protection in MapReduce,Computer Software and Applications Conference (COMPSAC), IEEE 36th Annual Conference;Jul 2012,pp.515-520. 23. V.N.Inukollu1,S.Arsi1,S.Rao Ravuri,Secutiy Issues Associated with Big Data Cloud Computing,in: International Journal of Network Security & Its Applications (IJNSA), vol.6, no.3;May 2014,pp.45-56. 24.Source-https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SecureMode.html 25.J.Dean,S.Ghemawat,MapReduce: Simplified Data Processing on Large Clusters,Communication of The ACM;Jan. 2008,pp. 107-113. 26. A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, and A. Rasin, Hadoopdb: An Architectural Hybrid of MapReduce and DBMSTechnologies for Analytical Workloads, in: Proeedings. VLDB Endow., vol. 2, no. 1; Aug. 2009, pp. 922–933. 27. M. Stonebraker,D. Abadi,D. J. DeWitt,S. Madden,E. Paulson, A. Pavlo, A. Rasin,MapReduce and Parallel DBMSs: Friends or Foes? Communications of ACM, Vol. 53, no. 1, Jan. 2010. [Online]. Available: http: //0-doi.acm.org.ditlib.dit.ie/10.1145/1629175.1629197;pp.6471. 28. A. McClean, R.C. Conceição,M.O’Halloran,A Comparison of MapReduce and Parallel Database Management Systems, ICONS:The Eighth International Conference on Systems;2013. 29. Article by - D.Eadline, Is Hadoop the New HPC?;Newsartcile from Admin-Magazine;2015. 30. Source - http://docs.ceph.com/docs/master/architecture/; Ceph Architecture. 31. D.Dalessandro P.Wyckoff P. Sadayappan Nawab Ali, A.Devulapalli,An OSD-Based Approach to Managing Directory Operations in Parallel File Systems;2008,pp.175-184. 32. S.A.Weil,S.A.Brandt,E.L.Miller,C.Maltzahn,CRUSH:Controlled, Scalable, Decentralized Placement of Replicated Data;IEEE Explore;2006. 33. Sage A.Weil. Ceph: Reliable, scalable, and high-performance distributed storage;2007. 34. Joe Julian - Article on GlusterFS Replication Do's and Don'ts;2013. 35. Source : About Gluster, [Online :http://www.gluster.org/about/];2007. 36. http://www.systemfabricworks.com/products/system-fabric-works-storage-solutions/lustre-hadoop 37. N.Rutman,White Paper on MapReduce on Lustre;2011. 38. V.S.Patil, P.D.Soni, Hadoop Skeleton & Fault Tolerance in Hadoop Clusters,International Journal of Application or Innovation in Engineering & Management, Volume 2, Issue 2;February 2013,pp.247-250. 39. J. Evans, Fault Tolerance in Hadoop for Work Migration, Technical Report CSCI B534 (Survey Paper), Indiana University;November 2011. 40. I.Goiri,F.Julià,J.Guitart,J.Torres,Checkpoint-Based Fault-Tolerant Infrastructure for Virtualized Service Providers. IEEE/IFIP Network Operations and Management Symposium,IEEE. Osaka, Japan;April 2010,pp. 455-462. 41. M.C. Srivas et.al,Map-Reduce Ready Distributed File System,Patented Document Appl. No.: 13/162,439;Dec 2011. 42. R. Buyya, J.Broberg,A.Goscinski,A book on Cloud Computing Principles and Paradigms,Wiley Publications;2011. 43. A.T.Yimer, Investigating the Performance, Scalability & Reliability of a Distributed FileSystem:Ceph;May 2011. 44. J. Edge, An introduction to GlusterFS;March 2015. 45. Source - http://www.gluster.org/community/documentation/index.php/Hadoop 46. B.Depardon,G.L.Mahec,C.S´eguin,Analysis of Six Distributed File Systems;Feb 2013;pp. 2-37. 47. M.V.Neves,T.Ferreto,C.D.Rose,Scheduling MapReduce Jobs in HPC clusters?,ProceedingEuro-Par'12 Proceedings of the 18th International Conference on Parallel Processing,2012;pp.179-190. 48. Karan Singh, A book on Learning Ceph;Jan 2015.