ZettaDS: A Light-weight Distributed Storage System for ... - CiteSeerX

6 downloads 239482 Views 406KB Size Report
running on a non-dedicated system. ... FTP servers had almost run out of space and new disk can ..... Besides, it requires a dedicated server as the master,.
ZettaDS: A Light-weight Distributed Storage System for Cluster Likun Liu+ , Yongwei Wu∗ , Guangwen Yang∗ , Weimin Zheng∗ Department of Computer Science and Technology Tsinghua National Laboratory for Information Science and Technology, Tsinghua University Beijing 100084, China + llk04 @mails.tsinghua.edu.cn, {wuyw∗ , ygw∗ , zwm-dcs∗ }@tsinghua.edu.cn

Abstract We have designed and implemented the Zetta Data Storage system (ZettaDS), a light-weight scalable distributed data storage system for cluster. While sharing many common characters with some of modern distributed data storage systems such as single meta server architecture, running on inexpensive commodity components, our system is a very light-weight one and aims to handle lots of small files efficiently. The emphases of our design are on scalability of storage capacity and manageability. Throughput and performance are considered secondary. Furthermore, ZettaDS is designed to minimize the resource consumption due to running on a non-dedicated system. The paper describes the details and rationales of the design and implementation. Also, we evaluate our system by some experiments. The results demonstrate that our system can use the storage spaces more efficiently and achieve better transfer performance when facing a large number of small files.

1. Introduction We have designed and implemented the Zetta Data Storage system (ZettaDS), a light-weight scalable distributed data storage system for commodity cluster, to meet our own requirements of daily work. ZettaDS shares many of the same goals with some of modern distributed storage systems (e.g. Google File System (GFS)[10], Hadoop[1], etc), such as, reliability and availability. However, its design is driven by our specific requirements and the environments in which it will be use. The purpose of this light-weight distributed data storage system is to allow the commodity cluster to be used as a massive storage system. The primary goals include storage capacity, manageability and easy-to-use; throughput and performance are considered secondary. This is mainly because that the ZettaDS is originally inspired by the de-

mands on the need of a scalable storage system to substitute our own FTP which provides file storing and sharing service for our laboratory. As the administrators as well as the users of the FTP, we, during the past two years, has suffered a lot from not only adding new disks but also exhausting the storage space. Before our system come into service, all our FTP servers had almost run out of space and new disk can not be added since no more slots were available. Another requirement which makes us design and implement ZettaDS is that we encounter some huge datasets beyond the capacity of any single server that we currently have. One of our research projects on draughts has to storage huge amount of data as the Endgame-Database. The database for last 8 steps of the game has reached almost 0.8 terabytes. And the data requires to be stored together under a single I/O space for further processing. Additionally, there are some other circumstances with the similar requirement, such as: the trace files of our simulations experiments, and the log and archive of our real system (our cluster log and CGSV’s archives.), etc. Besides, we have our own clusters for our laboratory: one is consisted of 16 nodes with 2x160 gigabytes storage space for a single node, while the other has 128 nodes with 2x36 gigabytes for each. We expect to reuse these clusters for storing these data. This will help us to increase the utilization of the clusters as well as to save cost of adding new device. Furthermore, by storing the data on the cluster which will process it, we hope to improve the performance of data processing by re-design our application in the future. Building ZettaDS is a more engineering effort required to fill the needs mentioned above than research. The purpose of this paper is to describe what we did and why, rather than to advocate it. The paper is organized as follow: following this section of introduction, we describe, in section 2, ZettaDS’ design issues, such as, assumptions, architecture, interface, consistency model and etc; section 3 gives the details of implementation including metadata storage, physical data organization, Replica technology, and manageability issues; some experiments are given in section 4 to

evaluate ZettaDS; section 5 describes some related works, and section 6 concludes this paper and proposes some future works.

2. Design Overview The most important lesson we learned from some of our experiences with system design and implementation is the value of simple design; hence, in design and implementation of ZettaDS, we try our best to keep it simple, to keep it special.

2.1. Rationale One might argue that we should use some existed distributed storage systems such as NFS[5], GFS, Hadoop[1], PVFS[8] etc, rather than build our own. The existed systems will serve our circumstance very well and maybe provide higher performance, since most of them are tested under various circumstances, hence are mature enough. Indeed, we have considered using mature software to build our system, and tried to use Hadoop and NFS to solve our problems. Both of them work to some extent. But they are not suitable in following aspects: Firstly, Hadoop, as well as GFS, is designed for large file and streaming access, it is not suit for our real applications which have to handle many small files. Both systems maintain the metadata in the master’s memory. This design makes the master operations very fast and makes some operations possible, whose costs are too high to implement if using disk storage (such as, periodic scanning for garbage collection and re-replication). But it suffers from too much memory consumption when facing too many small files. And even worse is that in our circumstance the master is not dedicated and still has to serve as a computing node of the cluster. The huge memory consumption makes it much slower than the rest nodes and seriously declines the performance of the whole system. Secondly, Hadoop use 3 replica to achieve high reliability, which is not space efficient enough when capacity is primary goal. Coding using Java makes it consume more resources. Its functionality designed for supporting MapReduce[9] is too complex for our current application. What we need just a scalable storage system which storage data in a single I/O space. Finally, NFS also suffers from poor space utilization and transfer performance while handling a large number of small files. Besides, designing our own system makes the upgrade easier when we want to modify it to support more applications. All these arguments suggest a key decision that we choose to build our own system.

ZettaDS is intended to 1) provide massive storage capacity on common cluster system while minimizing consumption of resources, and 2) provide single I/O space for data storing and accessing. 3) The manageability is another important issue of ZettaDS. We try our best to reduce the management load that the administrator has to do, and make it easier for the user to use it. Again, throughput and performance are considered secondary.

2.2. Assumption Some key reasons why we design ZettaDS have been metioned above, and then, we describe some assumptions: • We have to deal with huge dataset. Storage capacity under single I/O space is the primary goal. The system should be easily extended by adding more storage components. Compared with capacity, the throughput and performance are not that critical and hence considered as secondary. • The size of files handled by our system are diverse from several kilobytes to several gigabytes. We have to support the Endgame Database or simulation trace which are often several gigabytes or even more. But It is also important that we have to deal with millions of small files which are usually several kilobytes or even less. These files include word documents, images, web pages, source codes, etc. • For most of the time, a file in the system can only be write once when creating. After that, only reading access is allowed. Besides, deleting operation is still required. • The system should be very easy to manage and to use. More specifically, after a very simple installation, the system can bootstrap itself with none extra manual mediates. Like GFS, it should support to add/remove storage components without system stop. Also, the system should be able to detect the failed components, and tolerate certain failures under a degrade service model, rather than encounter an entire system crash or service stop. • The system will be run on a non-dedicated cluster, and should minimize the resource consumption especially the bandwidth and memory consumption. A big performance decline of the cluster due to running ZettaDS is unacceptable. • Besides, the system should be kept as simple as possible, but also be easy to extend. Inspired by GFS, we also believe that on some circumstance (even often) moving computing is much cheaper than moving

data. We hope, in future, the cluster node can process the data on itself. Although, this is only a requirement for the future, indeed, it affects our system design to a certain extent.

2.3. System architecture Driven by the basic idea and conception of GFS, our system shares the same architecture with GFS: one MetaServer and many ChunkServers[10](the same architecture is used by Hadoop also). But our system is a much light-weight one. Figure.1 shows an overview of our system components. ZettaDS has three main components: MetaServer(MS), ChunkServer(CS), and a library(CL) that client applications link against. Client Apps

High Level API ( File System Like) Parallel Slice Transfer Engine

Chunk Server

Data Slices Cache

Chunk Server

Meta Data Cache

Meta Server

ZettaDS Lib

Ftp Server



Cluster

Virtual Network Disk

Figure 1. ZettaDS overview

2.4. Interfaces The Interface provided by ZettaDS is particular at the very low level (primitive). Based on our previous assumption that the file should only be written once, we design different primitives for reading and writting. The writting process follows a routine: create→write→commit. You can only write a file after you create it; once you commit it, you are never allowed to modify it again except to delete it. The commit primitive is necessary to ensure that the file is written correctly; The reading process uses open→read→(close) primitives sequentially. You can re-open a file by open primitive with restrict only for read; You are left alone to close it or not. But proactive close makes the MetaServer release the resource more gracefully. For convenience, our system also provides high level APIs for programmers as well as some utilities for users. In our high level APIs, Indeed, we provide WRITE operation. But it is implemented via a delete and a create primitive. All the write operations to a certain file are cached in the client side before CLOSE, hence you will lose all of your modifications if encountering a failure on the CLOSE API. Which should be realized seriously is that the modification to a very large file is a very expensive operation, but

for small files, it is still very efficient. Besides, you are not given any guarantee that when you are modifying a file that it can not be deleted by another user. The user’s utilities provided by ZettaDS includes a special client, a modified FTP Server, and Virtual DISK(under developing). You can still use the FTP protocol to stage files between your desktop and ZettaDS. And also, you can, by installing ZettaTools—the Virtual DISK client, use the data in ZettaDS like your local data via a virtual network disk. The latter provides the seamless compatible to current application.

2.5. Consistency model ZettaDS has a very simple consistency model. Since there is no write operation allowed after committed, there is no data cache coherence problem. But there are still two issues to be addressed: • ZettaDS provides no guarantee about synchronization. Synchronization and mutual exclusion must be performed by application layer, hence are advisory. This means that the situation is valid that one client deletes the file while other client reading it. Most of the cases, it can not be a problem since our garbage collection mechanism will leave the physical data for an enough time before reclaiming it. It will raise a read-error in client’s API in the worst case. • Metadata cache coherence exists in ZettaDS as we allow both the client to cache the metadata and to create or delete files independently. Both the two modifications to metadata will be seen by a client via an explicit metadata retrieve. Additionally, the read to a nonexisted file will lead to discard the associated metadata cache and an implicit metadata retrieve as well as an read-error raising if needed. Besides, file namespace mutations (e.g. file creation, delete, and move etc) are atomic. They are handled exclusively by the MetaSever.

3. Implementation The core of ZettaDS is written in C++, and Runs on the Linux platform. Both the C++ and Java library for clients are provided. This section describes portions of the implementation in details in order to highlight our design for our specific goals. But we will not go deep inside of the source code.

3.1. Metadata management Like GFS and Hadoop, our design uses a single MetaSever to maintain all the metadata, including namespace,

the mapping from file to chunks, and the mapping from chunk to ChunkServers. The biggest difference is that we store metadata on the disk rather than keep it in memory. We choose this design for the following reasons: • Our system has to handle so many small files. • The resource consumption (especially memory consumption) must be minimized since our MetaServer is a non-dedicated one. • The workload of MetaServer is not heavy. It is no need to keep all the metadata in the memory to achieve a high performance and short response time. • The metadata cache in client side greatly reduces the MetaServer’s load. Nevertheless, in order to reduce the disk operations and to improve the performance under some special circumstances, we do provide a small metadata cache in memory, which is consisted of a hashtable together with a LRU queue implemented by double linked list. See Figure.2 for details.

LRU Head

LRU Tail

HashTable









Figure 2. schematic diagram of Metadata Cache The size of LRU cache is been set to 1024. A simple sample (See table 1) on our target datasets shows that the average length of file name is 20 bytes. Hence we have, on average, 50 bytes of metadata for each file including filename, size, timestamp, uid, gid and attributes, but not including the chunk list. The whole metadata cache is expected less than 64 kilobytes hence will be a very small memory cost for modern computer. Table 1. target dataset statistics dataset

num of files

avg(size)

name-size

avg(name-length)

FTP

119262

3.62M

1.9M

16.39

user1

478482

1.65M

9.89M

20.68

user2

17138

2.98M

1.55M

9.02

The persistent metadata for namespace and the mapping from file to chunk is store in a Berkeley Database[14]. It provides not only highly efficient put/get operations based on key-value pairs, but also transaction supporting which guarantees metadata management operations will be perform atomically. We do not keep the mapping from chunk to servers persistently, since this information can be built up via the ChunServer’s reports. Again, we allow clients to cache the metadata for a long period since once a file is committed, no further write operation will be allowed. When a file delete request arrives, the file is not deleted immediately, but moved to a hidden namespace. A garbage collection thread scans the hidden namespace periodically and batched sends Chunk-Delete message to ChunkServers to reclaim disk space. Another additional benefit of this mechanism is that it greatly reduces the I/O errors raised due to reading on deleted file.

3.2. Replica Management We use replica in our system for two purposes: 1) Fault tolerances and 2) performance. Since our system builds on low-cost commercial components, replica mechanism can be employed to avoid data lost due to disk failure. To save the disk space, the replica mechanism is turned off by default. You are required to set replica attribute explicitly if needed. Although, performance is not critical for our system, a higher data access performance is an additional value. ZettaDS does not using striping storage technology to improve performance, because 1) the bandwidth is an expensive resource under our assumption, and 2) we want to simplify the design. Besides, we want to improving the parallel data processing performance by exploiting the data locality in the future. This also requires that the associated data should be placed on nodes as close as possible, which is also contradict to stripping storage. As an alternative for striping storage, ZettaDS uses replica and parallel data slice transfer (see section 3.3 for detail) to improve performance if needed. The creation of the replica is initialized by MetaServer when replica attribute modifies. It chooses the replica ChunkServer, and creates the replica by sending a ChunkLoad request to the replica Server. The new replica will be notified to MetaServer via the subsequent heartbeat following the ChunkLoad operation. The MetaSever also scans the mapping from chunk to servers periodically to detect replica-losted chunks and rereplica them. The scan operation is an expensive operation since our metadata are stored on the disk. So it should be performed as less as possible and hence should have a longer period.

3.3. Physical data storage We manage the physical data in three different levels for different purposes, that is file for user manipulating, chunk for data storaging, and slice for data transfer. See Figure.3 for details.

Dir0 Dir1 File1

S1

S2



Sn

S12

S22



S2n









S1m

S2m



Snm

File2

S1

S2

Figure 3. ZettaDS physical data orgnization

• The whole data space is organized as a directory tree logically. File is the basic component that can be controlled and manipulated by users. • The data of a file is split into chunks that are placed on individual ChunkServers. chunk is the atomic unit for data storing. The size of the chunk in our system is fixed except the last one of a file. Currently we choose the magic number 4 Megabytes according to statistics on our target dataset. It may be modified as more and more tests and measurements are performed on our system. To avoid space wasting due to lots of small chunks(the chunk of small file), all the Chunks are stored in datasets which are large file usually 128M. • A chunk is splited into slices which are basic unit for data transfer. A slice is a fixed length segment of data, specifically 32 kilobytes, with 4 extra bytes at head (2-bytes offset in the chunk and 2-bytes length of the slice), and a 128-bits checksum in the end which is used to check integration of the slice. Different chunks of the same file are placed on different ChunkServers of the same rack. This enables us to perform parallel process on the same file if needed as well as keeps the associated chunks as close as possible. Also, for the same reason, files in the same directory are placed in the same rack if possible. Such chunk placement strategy is

also expected to enable us to exploit the data locality and reduce the bandwidth consumption cost by data staging when performing parallel data processing. Slice-based data transfer is employed for two reasons: performance and simply cache management. Since every slice is a self-description entity, it can be sent and received independently and allows out-of-order arrival. By sending and receiving the slice batched and parallel, we can re-use the TCP link. This method greatly reduces the overhead caused by three-way handshake, and also reduces the impact caused by TCP’s slow start and congestion avoidance mechanism. It greatly improves the transfer performance, especially when facing too many small files in which situation the FTP suffers a lot. Another benefit provided by fixed-size slice is that it greatly simplifies the data cache management. The client library can simply uses a hash set to manage the data cache efficiently by using the Chunk identify and slice offset as the key. Besides, this design also reduces the extra overhead of the storage server, since some operations such as checksum computation can be performed at the client side.

3.4. Manageability One of the most important goals of our system is manageability. Two special designs are taken according to this goal: Ad hoc bootstrap and dynamic add/remove ChunkServer support. There is no assumptions about in which order the components of ZettaDS would be start. When a ZettaDS server starts, it sends a broadcast message to a given magic port named 3122. The message is used, by the ChunkServer, to discover the current MetaServer if not specified manually, and, by MetaServer, to ensure there is no existed MetaServer yet. Once such message is received by MetaServer, it will send back a response containing its IP, port and an identity of the current system. This response is also broadcasted to port 3122, after the MetaServer starts and assures there is no MetaServer existing, to reduce bandwidth consumption of detecting MetaServer when the system starts. And then, each ChunkServer adds itself to the system by sending a register message after a random short delay (which is used to avoid a sudden workload rush during a whole system start). The MetaServer keeps track of all ChunkServers by periodical heartbeat. The metadata of a single ChunkServer is about 60 bytes; hence all the metadata of ChunkServers of a throusands-scale system is small enough to be kept in the MetaServer’s Memory. MetaServer keeps an expire time for each ChunkServer. And a ChunkServer is marked as dead if its expire time exceeds 3 heartbeat intervals. A dead server is removed actually when it dies for an enough long

4. Evaluation

16000 14000 12000 10000 8000

ZettaFS NFS FTP

5120

2560

1280

640

320

160

1G 2M 51 6M 25 8M 12 M 64 M 32 M 16

8M

4M

2M

1M 2K 51 6K 25 8K 12 K 64 K 32 K 16

8K

4K

FTP NFS ZettaFS

2K

8G

80

1K

Some experiments are performed to evaluate ZettaDS. The experiments are setup as follows: Six Intel Pentium machines (Pentium IV 2.4GHz, 2G Memory, RedHat AS 4) are used, one as the MetaServer and one as client, others as ChunkServers. The client generates one-gigabyte dataset whose file size is a normal distribution with specified expectation and stores the dataset to the system. The expectation of file size ranges from 1K to 1G with each step multiplying 2. Figure.4 and Figure.5 show the disk space used by the system and the transfer time for the dataset of different file size. The results of NFS and FTP are also showed as comparisons.

Occupied disk spaces (logscale, 1M=1024K)

except 0 byte. ZettaDS works still well since we use dataset — large file of approximate 128M — to store small chunks of such file. The extra size, used by ZettaDS as file size decreases, is just used to store metadata of dataset and its chunks.

Data transfer time (logscale, unit=10s)

period (e.g. two or more weeks) or manually. This enables us to query the information of a dead server when needed. A dead server can re-join the system by re-performing a register process. It is important to note that our system does not distinguish between normal and abnormal termination. It is the responsibility of the component itself to keep its data’s consistency and integration. The same mechanism is also adopted by GFS. But our implementation is different. We design our protocol to write the metadata after the data is written. This assures that once the metadata is written, the data is written. The orphan chunks produced by crashed written operation are detected during chunk information report and handled by garbage collection automatically.

Average file size (data set 1 Gigabytes)

Figure 5. Data transfer time of different file sizes.

Figure.5 shows that Both FTP and ZettaDS are more efficient than NFS when transferring lots of small files(less than 32K); while ZettaDS is a little higher over FTP. This is because FTP uses individual data connection for each file, which make it waste much time on TCP’s three-way handshake. While ZettaDS avoids this problem by using batched slice transferring, that is, transferring as many as possible slices using the same TCP link.

4G

5. Related work

2G

1535M

1279M

1151M 1110M 1087M 1069M 1055M 1048M

6M

8M

M

M

M

1G 2M

51

25

12

64

32

16

8M

4M

2M

6K

8K

K

K

K

1M 2K

51

25

12

64

32

16

8K

4K

2K

1K

Average file size (data set 1 Gigabytes)

Figure 4. Storage space occupations of different file sizes. Figure.4 shows that ZettaDS can store small files more efficiently. The space used by NFS or FTP (which uses operate system’s native file system) increases rapidly as the file size decreases when less than 64K. This is because native file system use 8K-size allocation unit, that is, it will allocate at least 8K space no matter how small the file size is

Many recent projects have tackled the problem of providing distributed data storage based on cluster. The most famous of this kind of systems is GFS[10], which is a scalable distributed file system for large distributed data-intensive applications designed and implemented by Google. It provides fault tolerances while running on inexpensive commodity hardware. It should be work for our situation. But unfortunately it is not available to us. Besides, it requires a dedicated server as the master, which is not suitable for our situation. The Hadoop[1] is another system which shares the same architecture and goals with GFS. But it fails under our test for our own scenario due to big memory consumption of the NameNode. Besides, the Java implementation makes it suffer from a little performance decline. But it does give some revelation in design and implementation of our system. Both the above systems are designed for large scale cluster and parallel data processing using map-reduce. It is too

complex for our situation. Also there are some of traditional distributed storage system (such as NFS[5], AFS[12], xFS[5], Swift[7] etc) available. Most of them provide a location independent namespace and fault tolerances. But they either are too heavy, too complex for our simple scenario or require extra maintain work. Moreover, they are also not suitable for us to exploit locality of parallel data processing using cluster in the future.

6. Conclusions and future Work We have described ZettaDS, a light-weight scalable distributed data storage system for cluster. While sharing many common characters with some of modern distributed data storage systems such as single master architecture, running on inexpensive commodity cluster etc, our system is much light-weight. The emphases of our design are on scalability of storage capacity and manageability. Throughput and performance are considered secondary. Moreover, ZettaDS is designed to minimize the resource consumption due to running on a non-dedicated system. In this paper, we discuss many aspects of our design and implementation in details. Furthermore, we evalute our system by some experiments. The results demonstrate that our system can use the storage spaces more efficiently and achieve better tranfer performance when facing a large number of small files. At present, ZettaDS is still a very initial one. To perfect it, extra work is required in some aspects: Firstly, our current implementation is without any security issue hence can only be used under a collaborated and honest environment. Adding permission supporting is one of the most important ongoing works. Another ongoing work is the development of virtual network disk utilities which will enables us to use online data in ZettaDS transparently and enables our applications using ZettaDS seamlessly. Secondly, Although performance is not that critical, still we want to improve it, especially for lots of small files importing and exporting, since we suffer a lot when performing small files transfers via FTP. And moreover, the system still requires more and more tests. Finally, we are hoping to exploit locality in parallel data processing.

7. Acknowledgments This Work is supported by ChinaGrid project of Ministry of Education of China, Natural Science Foundation of China (60573110, 90612016, 60673152), National Key Basic Research Project of China (2004CB318000, 2003CB317007), National High Technology Development Program of China (2006AA01A108, 2006AA01A111, 2006AA01A101, 2004CB318000,), Nokia-Tsinghua Research Framework 2007-04, EU IST programme and Asia Link programme.

References [1] Hadoop project. http://hadoop.apache.org/. [2] Nfs: Network file system version 3 protocol specification. Technical report, SUN Microsystems, 1994. [3] N. Agrawal, W. J. Bolosky, J. R. Douceur, and J. R. Lorch. A five-year study of file-system metadata. Trans. Storage, 3(3):9, 2007. [4] W. Allcock, J. Bresnahan, R. Kettimuthu, and M. Link. The globus striped gridftp framework and server. sc, 0:54, 2005. [5] T. E. Anderson, M. D. Dahlin, J. M. Neefe, D. A. Patterson, D. S. Roselli, and R. Y. Wang. Serverless network file systems. SIGOPS Oper. Syst. Rev., 29(5):109–126, 1995. [6] J. Bresnahan, M. Link, R. Kettimuthu, D. Fraser, and I. Foster. Gridftp pipelining. In TERAGRID 2007 CONFERENCE, TERAGRID 2007 CONFERENCE, MADISON, WI, 2007. [7] L.-F. Cabrera and D. D. E. Long. Swift: Using distributed disks triping to provide high i/o data rates. ACM Trans. Comput. Syst., 4(4):405–436, 1991. [8] P. H. Carns, W. B. Ligon III, R. B. Ross, and R. Thakur. PVFS: A parallel file system for linux clusters. In Proceedings of the 4th Annual Linux Showcase and Conference, pages 317–327, Atlanta, GA, 2000. USENIX Association. [9] J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1):107–113, 2008. [10] S. Ghemawat, H. Gobioff, and S.-T. Leung. The google file system. SIGOPS Oper. Syst. Rev., 37(5):29–43, 2003. [11] J. Hartman and J. Ousterhout. The zebra striped network file system. . ACM Transactions on Computer Systems (TOCS), 13(3):274–310, 1995. [12] J. H. Howard, M. L. Kazar, S. G. Menees, D. A. Nichols, M. Satyanarayanan, R. N. Sidebotham, and M. J. West. Scale and performance in a distributed file system. ACM Trans. Comput. Syst, 6(1):51–81, 1988. [13] F. Kon. Distributed file systems past and present and future a distributed file system for 2006, 1996. [14] M. A. Olson, K. Bostic, and M. Seltzer. Berkeley db. Proceedings of the FREENIX Track:1999 USENIX Annual Technical Conference, June 6–11 1999. [15] E. L. M. S. A. Brandt, L. Xue and D. D. E. Long. Efficient metadata management in large distributed file systems. In Proceedings of the 20th IEEE / 11th NASA Goddard Conference on Mass Storage Systems and Technologies, pages 290–298, April 2003. [16] F. Schmuck and R. Haskin. Gpfs: A shared-disk file system for large computing clusters. In Proceedings of the First Conference on File and Storage Technologies (FAST 02), January 2002. [17] H. Tang and T. Yang. An efficient data location protocol for self.organizing storage clusters. In SC ’03: Proceedings of the 2003 ACM/IEEE conference on Supercomputing, page 53, Washington, DC, USA, 2003. IEEE Computer Society. [18] W. Vogels. File system usage in windows nt 4.0. In Proceedings of the 17th ACM symposium on Operating systems principles (SOSP 99), pages 93–109, 1999.