storage devices), file layer (typed data managed by the operating .... metadata, and hundreds of GFS chunk servers to store file's ..... equipment to low-cost.
2014 IEEE International Conference on High Performance Computing and Communications (HPCC), 2014 IEEE 6th International Symposium on Cyberspace Safety and Security (CSS) and 2014 IEEE 11th International Conference on Embedded Software and Systems (ICESS)
Evolution towards Distributed Storage in a Nutshell Pistirica Sorin Andrei, Asavei Victor, Geanta Horia, Moldoveanu Florica, Moldoveanu Alin, Negru Catalin, Mocanu Mariana Faculty of Automatic Control and Computers University POLITEHNICA of Bucharest Bucharest, Romania {andrei.pistirica, victor.asavei, horia.geanta, florica.moldoveanu, alin.moldoveanu, catalin.negru, mariana.mocanu}@cs.pub.ro
identify the main characteristics of storages and the deficiencies of existing implementations. Thus, we propose a genuine and comprehensive classification of storage systems based on four main characteristics: locality, sharing, distribution level and semantics of concurrent access.
Abstract—Distributed storage systems have greatly evolved due to cloud computing upsurge in the past several years. The distributed file systems inherit many components from centralized ones and use them in a distributed manner. There are two ways to grow the storage capacity: by scaling-up or by scaling-out and growing the number of storage devices in a storage system. The growth of storage devices impose many challenges related to interconnection protocols and topologies, error handling, data consistency, security and so on. In this article we have studied how distributed and parallel storages have evolved from direct connected storages in terms of architecture, data management and organization and how the new challenges imposed by data distribution have been solved. We have selected for studying several of the most representative distributed storages solutions: Andrew File System, Google File System, General Parallel File System, Lustre and Ceph. First, we emphasize how a generic distributed storage layout has inspired from structured disk layout (Berkeley Fast File System). Second, we describe the evolution path of distributed storages from a wide variety of perspectives, including: distributed units which are moving from blocks to objects due to their undeniable advantages or distribution methods that have evolved from lists much like inode mapping to deterministic hash functions like RUSH or CRUSH. Third, the networks are evolving very fast in terms of topologies and protocols. Using graph theory, researchers are continuously improving different aspects of cluster networks. Fourth, storage security is a critical component due to the demand of storing sensitive data for long term, sharing it in a secure way and impacting as little as possible the system performance.
In terms of low level data organization, we have identified a general low level scheme, which comprises centralized and distributed storages as well. Basically, distributed storages use the same main elements as centralized ones, but in a distributed manner (i.e. metadata and data units); and therefore different data units are used to scatter data over the networked system: files, objects, blocks or data chunks. We have chosen to study a representative implementation for each method (e.g. one of the first pioneers Andrew File System which exports files or Ceph which uses objects to distribute data among the system nodes) and emphasize their advantages and disadvantages from many architectural point of views. Another important aspect of distributed storages is the way that everything is linked together, because it plays a very important role in the system I/O performance. So, we have conduct studies of network topologies and network protocols which continue to evolve very rapidly. Basically, network topologies are graphs base on which we have identified several key characteristics such as: maximum routing distance, network scalability, fault-tolerance and costs. Obviously, these characteristics greatly influence the system performance. Often, storage systems contains sensitive data, thus the security methods are crucial. Because, the data has to be shared and replicated in a secure way, such architecture poses many challenges. For identifying these challenges, we have investigated several main security mechanisms used in distributed storage systems.
Keywords—File System, Distributed File System, Data Center Networks
I.
INTRODUCTION
The main purpose of this article is to identify an evolutional path towards distributed storage systems from architecture point of view (starting from centralized storages), rather than time of occurrence and to serve as starting point documentation for research of different topics related to distributed and parallel storages. This being the case, we focus on many important aspects of these storage systems designed for infrastructures of complex systems such as clouds or high performance computing.
This article is organized in several sections: section II presents the storage system classification and scaling methods, section III identifies an evolution path of distributed storage systems, section IV presents the evolution of data center networks, in section V we have studied the security of distributed storage systems and section VI concludes this work.
Firstly, is very important to see how distributed storages were born based on scaling methods, thus we show how systems scales-out and gives birth to different distributed storage types (e.g. SAN or NAS). Then, classifications help to 978-1-4799-6123-8/14 $31.00 © 2014 IEEE DOI
1267
Fig. 2. Storage system classification TABLE I. Semantics POSIX Semantics Session Semantics Immutable Files Transactional semantics
Fig. 1. Storage Systems Scale-out diagram
II.
THE STORAGE SYSTEM
Storage Systems comprise storage devices and a collection of software modules for managing data units. In a distributed system both storage devices and software components are distributed among cluster nodes over a computer network. In the next subsection we present studies against storage scaling methods and we propose a storage systems classification. A. Storage Scaling Basically, there are three main types of data organization: structured, log-based and tree-based. Unix-like (*nix) systems often use a structured storage organization named Unix File System (also called Berkeley Fast File System). In 2007, Binary Tree File System (BTRFS) was introduced by Oracle. It scales better than BFFS and also showed performance improvements due the fact that inodes and data are stored all together. Linux is the most successful Unix-like system and it has support for a very large variety of file systems, including BFFS and BTRFS. Ext file system architecture (BFFS implementation from Linux) evolved to meet storage space needs to the present version 4. Log-based file systems are not usually used in distributed storages, rather are found in optical or flash devices. In recent years, cloud computing caught upsurge and thus it support a large number of applications requiring a very big storage space of petascale or even exascale size. In Fig. 1 we present the scale-out methods following an architectural evolution based on file system layers rather than time of occurrence. Basically, any storage system has several basic layers of abstraction: storage device (the hardware), block or object layer (an uninterpretable sequence of bits stored on storage devices), file layer (typed data managed by the operating system) and application layer. Depending on where the system is decoupled, different networked systems are built (e.g. SAN-FS scales at device I/O level, while NAS-FS uses application level protocol (CIFS, NFS etc.) to access files). B. Sharing Semantics One of the most important characteristic of any storage system is the file sharing semantics (i.e. the semantics of
SHARING SEMANTICS Description Every operation on a file is instantly visible to all clients. No changes are visible until the file is closed. No updates are possible. Based on ACID properties – all or nothing property.
concurrent reading and writing). We consider the taxonomy proposed by Tanenbaum in his book “Distributed Operating Systems” [11] (Table 1), because it is simple but comprehensive. C. Storage Classification The storage taxonomy is very important, because it helps to better understand what are the main characteristics of a storage system and helps to improve the existing solutions. We propose storage systems classification based on four main characteristics: locality, sharing capability, distribution level and semantics. The main purpose of such a classification is to include any storage system despite of its data organization. So, the system can be local or networked, data can be private or shared among tenants and the distribution level depends of scattered data unit (data container, file, chunk, object or block), Fig. 2. SNIA (Storage Networking Industry Association) proposes a different taxonomy [1] that splits file systems in three major categories: local, shared and networked. The two classifications are not disjunctive, but in our opinion SNIA’s taxonomy is a mixture of architectural considerations and storage intrinsic characteristics, while we propose a classification based only on characteristics that include systems with all kind of data organization or architecture. III.
STORAGE ARCHITECTURE EVOLUTION
In the next section we present studies to identify an evolution path from a technical perspective of several of the most important storage systems targeting different important aspects: low level organization of data units, architectural details and implemented sharing semantics. In the end of this section, we conclude with a comparison of their characteristics emphasizing advantages and disadvantages of each of them. A. Low Level Organization Despite of the data organization, in essence any storage system has two main components: metadata and data area. Further, we consider the BFFS structured storage system and
1268
2) G oogle File Syste m G oogle File Syste m (GFS ) [2] was built to Fig. 5. Google File System mana ge TABLE II. GFS SEMANTICS and Actions Writes Appends mani Defined Serial success Defined intercalated with pulat Concurrent success Consistent and inconsistent e undefined huge Inconsistent Failures amou nt of
Fig. 3. Centralized to Distributed Storage Systems Layout
Fig. 4.
Andrew File System Architecture
we place these two components in two layers (Fig. 3). Basically, metadata is the means of storage organization, while the data area is a flat address organization of units (e.g. blocks or objects). Each file is composed of a variable number of data units (blocks or objects) which are spread across a data area of a disk partition or across a number of storage nodes of a distributed storage system (e.g. RADOS). B. Distributed Storage Architectures There are many implementations of distributed storage systems of which we have chosen several of the most representatives to cover most of the important topics and emphasize how the architecture evolved: AFS (Andrew File System), GPFS (General Parallel File System), GFS (Google File System), Lustre and Ceph.
data capitalizing strength of expensive and reliable servers while compensating cheap hardware weakness. It’s not an open source system, but serves as model for other systems (e.g. Hadoop (HDFS) [13] is one open source implementation that follows GFS architecture). In essence, GFS architecture is composed by one active GFS master server and few clones (for enabling failover scenarios) to manage the file system’s metadata, and hundreds of GFS chunk servers to store file’s chunks in a distributed manner. The system implements a series of mechanisms to support a huge number of clients that can access a consistent and shared file system in parallel (Fig. 5). GFS’s main characteristics: • Files are divided in big data chunks for I/O performance optimizations and distribute them across a storage cluster using cheap commodity hardware; • It uses only one active server for metadata manipulations (atomic metadata mutations) to simplify the system design; • By profiling it was observed that files are usually scarcely modified and appends are more often, therefore the system guarantees a more simplified POSIX sharing semantic model (Table 2); • It rebalances the data in the cluster periodically; • Logging and checkpointing for chunkservers faulttolerance; • Snapshotting to create branch copies.
1) Andrew File System Andrew File System (AFS) is a pioneer and one of the most representative NAS (Networked Attached Storage) that uses files (i.e. data unit) to distribute data. AFS organize data in cells and volumes [12]. Each volume is managed by a server, while cells are logical divisions of file space (in an analogy with centralized storages, a volume can be seen as a partition). Files are hierarchically organized in sub-trees grouped in cells, Fig. 4. Redundancy is achieved with a special type of volume named read-only copy, used also to acquire high availability and workload balance.
3) General Parallel File System
1269
Fig. 6.
a) Lustre Architecture
b) GPFS Architecture
c) Ceph Architecture
namespaces over a cluster of MDSs and MDTs: Distributed Namespace (DNE) cluster, former Clustered Metadata (CMD) – directories are striped across more MDTs.
General Parallel File System (GPFS) [7] is the distributed and parallel file systems solution from IBM. As GFS, GPFS is a closed and there are many unknown mysteries about its architecture. In contrast with GFS, GPFS supports RAID systems over Fibre Channel SAN, but it has also the ability to use LAN systems through NSD (Network Shared Disk) which enables block access to data. The choice between these two access modes depends of the system purpose: accessing data through a SAN is much faster, but requires more expensive network equipment, while LAN access is less expensive, but much slower.
The data objects are managed by keeping a map similar with the inode map in the BFFS systems – this can be considered a drawback since it puts pressure on Storage Servers. It supports configurations using commodity storage devices, but also SAN based storage. 5) Ceph File System Ceph, is a new open source system available in Linux kernel. It gathered many advantages from previous existing implementations and distributes everything (namespaces, monitoring and data storages) making it very scalable, but very complex as well being considered the new dream distributed file system.
GPFS supports deployments of thousands of storage devices, spreading files across them achieving a very good high availability. It supports replication, logging and recovery capabilities, thus is a fault tolerant system as well. GFS has no data placement policy published, while GPFS distributes data by means of two types of policies: file placement policy – distributes data to a specific set of disks (disks pool) and file management policy – move or replicate data across system disks. The file system namespace can be split into small groups by means of file sets – achieving management at a smaller granularity. Another GPFS unique characteristic is that the metadata management is spread across all GPFS clients and a single point of failure is avoided. It is also highly configurable and basically it supports three deployment types: share-disk, networked I/O and multi-cluster (Fig. 6 b).
Like Lustre, it splits files into objects and distributes them across a cluster of Object Storage Devices (OSD). One of the main characteristics of this system is that it completely divides the metadata management from data storage (RADOS – Reliable, Autonomic Distributed Object Store), Fig. 6 c. a) RADOS RADOS is composed by a large OSD cluster and a small cluster of monitors as supervisors (to keep cluster synchronized – PAXOS: quorum based algorithm). The OSD cluster layout and state is described by a hierarchical structure named: cluster_map. RADOS [4] distributes and replicates objects (i.e. high availability and fault tolerance) by means of a deterministic hash function: CRUSH (Controlled Replication Under Scalable Hashing). The degree of replication declustering is controlled by means of Placement Groups (PG), thus each object requiring a specific replication degree (r) is mapped to a certain PG (pgidÅ(r, hash(oid)). The placement of objects within OSD cluster is controlled by policies named: placement_rules. To determine the list of object locations, CRUSH [5] generates this list using: cluster_map, placement group id and placement_rules. The big advantage of using a deterministic hash function is that it runs on any system entity, thus it eliminates the pressure of a dedicated server.
4) Lustre File System Lustre [21] is an open source system (GPLv2 licensed) that has several unique particularities. Briefly, it is composed by three main entities: metadata server (for namespaces management), management server (used to manage all Lustre file systems) and object storage servers (used to manage objects stored on targets), Fig. 6 a. In contrast with GFS and GPFS, it splits files into objects, more autonomous storage systems. The high availability and fault tolerance of metadata and data stored is achieved by a failover configuration (redundant systems). Each Object Storage Server (OSS) can have a failover configuration achieved by using more targets (OST). For Metadata Servers (MDS) and Management Servers (MGS) is used similar mechanism. It also distributes
1270
implement different replication mechanism to overcome this issue. Basically, all considered systems have such mechanisms.
b) Namespace management The namespace management is distributed across a metadata cluster, which uses a technique named: adaptive TABLE III. Storage System AFS GFS GPFS Lustre Ceph
• Another important characteristic is the method used by clients to access the storage system. There are two main categories: using a conventional API (Berkeley socket API) or using a custom API. A conventional API has the advantage of transparency, but a custom API has the fine tune functionality to increase performance. Ceph mixed the two methods and has enriched the conventional API.
STORAGE SYSTEM TAXONOMY
Locality
Sharing
Networked
Shared
Distribution Level File Chunk Block Object
Sharing Semantics Session Mimic POSIX POSIX
workload distribution – dynamic sub-tree partition to achieve scalable performance. Basically, Ceph migrate or replicates pieces of tree management at directory fragment level (smaller granularity), rather than at the entire directory, thus achieving better results. The replication and migration decisions are taken based on popularity metric and therefore metadata high availability is achieved.
• Due to the fact that AFS distributes files, it is more sensitive to fault tolerance and high availability. In fact, it distributes volume clones instead of small pieces of files. Nevertheless, Open AFS has a solution to use OSDs and distribute objects. • File Sharing Semantics is a very important characteristic. Each application must be aware and to handle different situations related to data integrity and coherency. The systems described have different implementations as follows: AFS: Session Semantics (weak), GPFS, Lustre and Ceph: POSIX Semantics and GFS/HDFS: mimic POSIX Semantics (more relaxed semantics).
Despite its advantages, there is an important drawback of distributing namespace management that is given by the cost of management when the number of nodes is large. C. Comparison Firstly, the mapping of the considered systems described in previous sections on our proposed taxonomy is depicted in Table 3. • The distribution address space (locating the pieces of data for each file in the storage system) is managed in basically two ways: by using maps (similar to inode map from extX file system) or by deterministic hash functions (CRUSH [5] or RUSH [6]). The drawback of using mapping is that increases the pressure on dedicated servers (Ceph advantage over Lustre). • Distributing metadata management has pros and cons. In essence, distributed management comes with a better I/O performance, but a more complex architecture. Ceph and GPFS handles metadata in a distributed way, since GFS/HDFS uses only shadow servers for high availability, but the actual activities are handled only by the active server. • The distribution level refers to the data unit used for distribution: file, object and so on. The distribution level for each file system is listed in Table 3. OSDs [8] has several important features including: better data security (per object), flexible size (therefore it can be adjusted to increase I/O performance), standard API [9] and a local space management. Basically, Lustre and Ceph use OSDs, but the future leads to OSDs for AFS (OpenAFS) and GPFS as well. • GPFS is sensitive to Storage Devices failures, because usually it uses a declustered RAID approach. There is also a trend to use Object RAID [10] where PanF is leading. • By lowering the hardware price and using commodity hardware, failures are a common behavior rather than an exception. So, file system solutions have to
IV.
DATA CENTER NETWORKS
Storage systems represent the foundation of data centers and other than being storages are also network based systems. In essence, the data centers network has two main topics: interconnection topology and protocol stack. A. Interconnection Topologies Basically, the network topology of a data center is a graph used to connect its nodes [14]. We have identified several of most important graphs suitable for such networks as follows: 1) Complete graph: A Complete graph (or complete mesh) is an undirected graph where each pair of nodes is connected by one edge; 2) 2D Torus(k): A 2D Torus is a graph with nodes interconnecting to the nearest neighbor and corresponding node of opposite edge of the interconnection array; 3) Fat-tree(h): A Fat-tree is a tree where the edges become fatter as one moves towards the root (in our research we consider fat-trees as binary trees, but that is not a necessary requirement); 4) Hypertree(k,h): A Hypertree is a type of 3D tree of height h and degree k. It can be viewed as a complete binary tree of height h from bottom-up and a complete k-ary tree (k2) view from top-down; 5) Hypercube(k): A Hypercube of degree k is a graph with 2k nodes and a total of k·2k-1 edges where each node has exactly k neighbors and with mutually perpendicular sides; 6) Cube-connected-cycles(k): A Cube-connected-cycle graph of degree k is a graph obtained by replacing each node of a Hypercube of degree d by a cycle of length k, where k d (if k=d, then CCC(d,k)ÆCCC(k).
1271
B. Interconnection Characteristics The topology characteristics are very important, because they influence important aspects of data centers including: path length, system costs, bandwidth and scalability.
simpler, because it doesn’t require a lossless environment. The application layer communicates with file servers through a means of a group of protocols (NFS, CIFS, SMB, AFS and so on) encapsulated into the very well-known TCP/IP stack.
Path length is defined by the graph diameter (i.e. maximum routing distance) and scalability is defined by the node degree, number of nodes and edges.
D. REAL WORLD In the real world, the cost of the data center network is one of the most important criteria, therefore the tradeoff between price and performance plays a strategic role in designing a storage system. The most expensive part in a data center network are the switches, mainly their ports, and also switches determine the way that network scales by the number of ports and speed. The switch ports evolved in the past decade from 1GB/s to 100GB/s. Nowadays an L2/3 switch has a 10GB/s port at about 1250$, 40GB/s at about 3000$ and the 100GB/s is still very expensive, at about 12000$ - faster links increase expenses.
In our research we do not take into consideration the full/half duplex characteristic, basically it can be considered that all links are full duplex. • The diameter of Fat-trees and Hypertrees is the same (2h), but the Hypertree has the advantage that it uses a constant link speed for all levels, while Fat-trees must provide fatter links (doubles speed of each edge) toward the tree root. • The node degree is constant for Torus, Fat-trees and Hypertrees. Nevertheless, Hypertrees have the ability to scale horizontally changing the k-ary dimension. The node degree of the Hypercube grows faster than the other topologies with the number of nodes. • The fault-tolerance is a very important requirement. Except for Fat-trees, all other topologies offer alternative routing paths. Hypertrees preserve diameter in case of faulty nodes, a very important advantage. Cube style topologies are the most fault-tolerant networks, since they have many links between any two nodes. Researchers continuously seek better topologies and discover new graphs (e.g. De Brujin has been taken into consideration – large number of nodes, few connections per node and preserve short distance). When designing a topology the chosen graph depends of traffic profile: east-west/southnorth, low latency or high throughput and of course the balance between costs and performance. C. Protocol Stacks Basically, the protocol stack is imposed by the distributed storage system type (SAN/ NAS). Most SAN file systems uses SCSI (Small Computer System Interface) to communicate with disk devices, therefore the protocol stack encapsulates SCSI requests and carry them from applications to block devices. Since SCSI does not have any mechanism for contention or retransmission, the protocol stack has to assure a lossless environment. Lossless is guaranteed by two types of networks: Fiber Channel and Ethernet enriched with a group of protocols named DCB (Data Center Bridging) by IEEE group. It contains the following protocols: PFC (Priority Flow Control), ETS (Enhanced Transmission Selection), DCBX (Data Center Bridging Exchange) and QCN (Quantized Congestion Notification). Basically, DCB enables the convergence of LAN and storage network in a data center, therefore it allows sharing the same infrastructure, and thus it reduces the capital costs and makes the management much simpler. For NAS file systems, one of the oldest architecture which became popular with AFS in 1988, the protocol stack is much
Other than capital costs, the choice depends also on the data center performance requirements, like: latency and throughput. In this case the switch's ports are chosen based on two criteria: to optimize the transfer speed of each storage node and to support aggregate speeds of a number of storage nodes. In a SAN environment, for a chunk of 512K transfer at 25MB/s, a 1GB/s port supports up to 2-3 RAID devices and a 10GB/s port supports up to 25 RAID devices [15]. A fat-tree topology is expensive because of the high bandwidth needs while moving up in the tree. Nevertheless, there are methods to solve this issue like link aggregation (LAG). A butterfly topology requires fast links because the topology must accommodate the entire traffic and they do not have the fault tolerance property like the other topologies. Mesh and torii are not very expensive, but they need extra connections to storages and usually are fit for storage systems build from blocks proposed by IBM and HP. Hypercubes are a special case of torii, but the bandwidth scales better [15]. In the last years the growth of data centers along with cloud computing led to a very competitive market for network equipment and the companies shift from big and expensive equipment to low-cost. The evolution of hardware led to new software trends that help to lower the capital costs by decoupling the data path from the management path: Software Defined Networks (SDN). V.
STORAGE SECURITY
Distributed Storage Systems are inherently more open and exposed to attacks than centralized storage systems, the main reason being the unsecure network of loosely connected nodes. In order to compare existing systems in terms of the level of security provided, Riedel et. al [19] defined a framework of six core security primitives: authentication, authorization, securing data on the wire, securing data on the disk, key distribution, and revocation. Kher and Kim [16] take a similar approach – the fundamental security services identified being: authentication and authorization, availability, confidentiality
1272
and integrity, key sharing and key management, auditing and intrusion detection, usability manageability and performance. Authentication is a fundamental service. AFS uses a slight variation of Kerberos (version 4) for mutual client – server authentication [20]. Lustre [21] is more flexible and supports GSS-API backend like Kerberos, LIPKEY and OPEN. Process authentication groups (PAGs) are used to organize processes that are linked to an authentication event. This provides a more fine-grained control than uid-based mechanism. AFS and Lustre use PAGs, while NFS does not, being vulnerable to root setuid attacks. Authorization or access control is performed by matching the identity of the requester with the ACL attached to the object being accessed. Although POSIX.1e ACLs have removed some of the limitations of the initial POSIX model, being supported on ext2, NFS etc., some of the DFSs implement their own non-standard access control. For example, AFS uses group inheritance for easier administration. Positive and negative access lists are associated with directories (not files). Venus clients emulate POSIX semantics for files, which are ignored by Vice servers. Negative access lists take precedence over the positive ones and speed up the access revocation process by being propagated ahead of the positive rights. Securing data on the disk should be preferred to securing data on the wire due to several considerations: securing data on the disk at user endpoints provides confidentiality both for data at rest and in transit, allowing for outsourcing the storage services to untrusted servers (public cloud storage), limits the damage done by data theft, offers better performance when using optimizations like lazy or automatic revocation of authorization and makes the system scale easier since compute-intensive operations are performed at the endpoints. AFS provides message and data confidentiality on the wire – where this stands for encrypted communication between a Venus client and a Vice server, be it an authentication server or a file server – but data on the disk is not secured, trusting the servers being a mandatory requirement. More recent distributed file systems – for example Lustre – are able to / make use of securing data on the disk. Thus, the effort shifts to designing efficient key management (sharing / distribution, revocation) schemes, access control (authorization) models having just the right granularity, searching on encrypted data. Maat [17] is a security protocol implemented in Ceph [3] that uses Merkle trees to define authorized users for a group of files and proposes novel techniques to improve performance of petascale file systems and HPC: extended capabilities, automatic revocation and secure delegation. Horus [18] is a proposed method of data encryption that introduces a novel idea to generate different region based encryption keys using Merkle trees, named Keyed Hashed Tree (KHT). KHT allows generation of keys for different ranges of blocks in a file shared by different clients, each with its own permissions over different regions from that file. The main advantage of this method is that it offers different grained security, but the KHT height and the region size at each KHT level must be pre-set (scalability penalty).
The basis for fully homomorphic encryption schemes are already set [22] – these allow for computing arbitrary functions (e.g. search) over encrypted data. There are ongoing efforts to make FHE schemes practical in terms of performance [23]. VI.
CONCLUSIONS
Based on structured storage systems layers we showed how various distributed storages are constructed and we have we proposed a simple and yet comprehensive taxonomy to classify distributed storages disregarding their architectures. We have analyzed several of most important distributed storage systems (AFS, GFS, GPFS, Lustre and Ceph) and we have identified technical key characteristics emphasizing different advantages and disadvantages of each of them. In terms of low level layout, we have identified the similarities between distributed and centralized storages (DAS). We have showed that all storage systems are moving towards distribution of smaller units (objects) by adding new layers (e.g. AFS) to achieve better data high availability and fault tolerance. The management of distribution address space is moving towards hash functions to lower the pressure of dedicated servers. Metadata management has been distributed as well to increased I/O performance to the system, but increases also the architecture complexity. As we showed, Data Center Networks have two main topics: topologies architectures and protocols. Researchers are continuously seeking better ways to connect nodes (new graphs) in a cluster and to combine existing ones as well, and serve better the cluster needs. Protocols are more expensive for high performance storages like SAN-FS, but researchers found solutions to lower the costs by using DCB protocols over Ethernet (network convergence). Security plays a very important role in distributed storages due to unsecure network of loosely connected nodes. Based on six main security primitives we have identified how storage security evolved along with Kerberos, PAGs and POSIX.1e ACLs. Ceph introduce two new genuine components: Maat protocol and Horus encryption method. A new trend in storage systems is the storage as a service (STaaS) that imposes many challenges related to storage elasticity, service level agreement, scalability or security. VII. FUTURE WORK In the future we plan to do performance measurements on open source storage systems and to investigate what are the penalties of storage systems in converged networks using different types of topologies. ACKNOWLEDGMENT Part of the research presented in this paper has been funded by the Sectoral Operational Programme Human Resources Development 2007-2013 of the Ministry of European Funds through the Financial Agreement POSDRU/159/1.5/S/134398. This work is also supported by CyberWater grant of the Romanian National Authority for Scientific Research, CNDI- UEFISCDI, project number 47/2012
1273
REFERENCES [1]
Christian Bandulet, “The Evolution of File Systems”, http://www.sniaeurope.org/objects_store/Christian_Bandulet_SNIATutorial%20Basics_ EvolutionFileSystems.pdf, Storage Networking Industry Association, 2012 [2] Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung, The Google File System , 19th ACM Symposium on Operating Systems principles, pp. 29-43, December 2003 [3] Sage A. Weil, Scott A. Brandt, Ethan L. Miller and Darrell D. E. Long, Ceph: A Scalable, High-performance Distributed File System , 7th Conference on Operating Systems Design and Implementation, November 2006 [4] Sage A. Weil, Andrew W. Leung, Scott A. Brandt and Carlos Maltzahn, RADOS: A Scalable, Reliable Storage Service for Petabyte-scale Storage Clusters, Petascale Data Storage Workshop, November 2007 [5] Sage A. Weil, Scott A. Brandt, Ethan L. Miller, Carlos Maltzahn, "CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data", Proceedings of the 2006 ACM/IEEE conference on Supercomputing, 2006 [6] Honicky, R.J. ,Miller, E.L., RUSH: Balanced, Decentralized Distribution for Replicated Data in Scalable Storage Clusters, Proceedings of the 20th IEEE - 11th NASA Goddard Conference on Mass Storage Systems and Technologies, 2003, pages 146–156 [7] Frank Schmuck and Roger Haskin, “GPFS: A Shared-Disk File System for Large Computing Clusters”, Proceedings of the Conference on File and Storage Technologies, pp. 231–244, January 2002 [8] Michael Factor, Kalman Meth, Dalit Naor, Julian Satran, Ohad Rodeh, “Object Storage: The Future Building Block for Storage Systems”, Local to Global Data Interoperability - Challenges and Technologies. IEEE Computer Society. pp. 119–123, 2005 [9] Information Technology - SCSI Object Based Storage Device Commands -2 (OSD-2), Revision 4, 24 July 2008 [10] PanFS, Object RAID, https://www.panasas.com/products/panfs/objectraid [11] Andrew S. Tanenbaum, “Distributed Operating Systems”, August 25th 1994 by Prentice Hall [12] OpenAFS, Open source implementation of the Andrew Distributed File System, http://www.openafs.org/
[13] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler, “The Hadoop Distributed File System”, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies, pp. 1-10 [14] Goodman, J.R.,Sequin," Hypertree: A Multiprocessor Interconnection Topology", IEEE Transactions on Computers, 1981, pp. 923-933 [15] Andy D. Hospodor, Ethan L. Miller, “Interconnection Architectures for Petabyte-Scale High-Performance Storage Systems”, 12th NASA Goddard Conference on Mass Storage Systems and Technologies, April 2004 [16] Vishal Kher, Yongdae Kim, “Securing Distributed Storage: Challenges, Techniques and Systems”, Proceedings of the 2005 ACM workshop on Storage security and survivability, pp. 9-25 [17] Andrew W. Leung, Ethan L. Miller, Stephanie Jones, “Scalable Security for Petascale Parallel File Systems”, Proceedings of the 2007 ACM/IEEE conference on Supercomputing, 2007 [18] Yan Li, Nakul Sanjay Dhotre, Yasuhiro Ohara, Thomas M. Kroeger, Ethan L. Miller, Darrell D. E. Long, “Horus: Fine-Grained EncryptionBased Security for Large-Scale Storage”, Proceedings of the sixth workshop on Parallel Data Storage, 2013, pp. 19-24 [19] Riedel, Erik, Mahesh Kallahalla, and Ram Swaminathan. “A Framework for Evaluating Storage System Security.” Proceedings of the 1st Conference on File and Storage Technologies, January 2002, Volume 2, pp. 15–30 [20] Satyanarayanan, Mahadev. 1989. “Integrating Security in a Large Distributed System.” ACM Transactions on Computer Systems (TOCS) 7 (3): 247–280. [21] Braam, Peter J. 2004. The Lustre Storage Architecture. http://www.cs.ucsb.edu/~ckrintz/classes/f11/cs290/papers/lustre2004.pd f. [22] Gentry, Craig. 2009. “Fully Homomorphic Encryption Using Ideal Lattices.” In Proceedings of the 41st Annual ACM Symposium on Theory of Computing, 169–178. STOC ’09. New York, NY, USA: ACM. Doi:10.1145/1536414.1536440. [23] Vaikuntanathan, V. 2011. “Computing Blindfolded: New Developments in Fully Homomorphic Encryption.” In 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science (FOCS), 5–16. doi:10.1109/FOCS.2011.98 [24] Catalin Negru, Florin Pop, Valentin Cristea, Nik Bessisy, Jing Li, "Energy Efficient Cloud Storage Service: Key Issues and Challenges", Emerging Intelligent Data and Web Technologies (EIDWT), 2013 Fourth International Conference on, pp. 763-766,Sept. 2013
1274