Nov 14, 2009 - vide high access performance with such a huge number of files and such large ... bear this notice and the full citation on the first page. To copy ...
Adaptive and Scalable Metadata Management to Support A Trillion Files Jing Xing1,3 , Jin Xiong1 , Ninghui Sun2 , Jie Ma1 National Research Center for Intelligent Computing Systems, Institute of Computing Technology, Chinese Academy of Sciences 2 Key Laboratory of Computer System and Architecture, Chinese Academy of Sciences 3 Graduate University of Chinese Academy of Sciences {xingjing, xj, snh, majie}@ncic.ac.cn 1
ABSTRACT Nowadays more and more applications require file systems to efficiently maintain million or more files. How to provide high access performance with such a huge number of files and such large directories is a big challenge for cluster file systems. Limited by static directory structures, existing file systems will be prohibitively inefficient for this use. To address this problem, we present a scalable and adaptive metadata management system which aims to maintain a trillion files efficiently. Firstly, our system exploits an adaptive two-level directory partitioning based on extendible hashing to manage very large directories. Secondly, our system utilizes fine-grained parallel processing within a directory and greatly improves performance of file creation or deletion. Thirdly, our system uses multiple-layered metadata cache management which improves memory utilization on the servers. And finally, our system uses a dynamic loadbalance mechanism based on consistent hashing which enables our system to scale up and down easily. Our performance results on 32 metadata servers show that our user-level prototype implementation can create more than 74 thousand files per second and can get more than 270 thousand files’ attributes per second in a single directory with 100 million files. Moreover, it delivers a peak throughput of more than 60 thousand file creates/second in a single directory with 1 billion files.
1.
INTRODUCTION
In recent years, cluster storage has become the dominant architecture to meet the need of large-scale data storage. In this architecture, decoupling metadata processing from file data accessing is widely used. The metadata servers (MDS) maintain the file system namespace and file attributes, while the storage servers process the read and write requests. Cluster storage can provide hundreds of gigabyte per second’s I/O throughput and tens of thousands operations per sec-
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SC09 November 14-20, 2009, Portland, Oregon, USA (c) 2009 ACM 978-1-60558-744-8/09/11 ...$10.00.
ond’s metadata processing power, which makes them a suitable solution to address lots of applications’ I/O needs. However, with the popularization of internet applications and extensive applications of high-end scientific computing, cluster storage is now facing with the big challenge caused by many more files and data to be stored in the file system. The metadata management of cluster storage needs to deal with the following three major challenges. The first challenge is how to efficiently organize and maintain very large directories, each of which contains billions of files. Internet applications like facebook [22] and flickr [10] already have to manage tens of billions of photos. As there are millions of new files uploaded by users every day, the total number of files increases very rapidly and will soon be more than one trillion. Geographic information systems (GIS) are now managing billions of satellite photos, and they need to manage trillions of satellite photos or more for higher-precision maps in the near future. In the field of exploration sciences, like the large synoptic survey telescope (LSST)[1], there are more than nine quintillion files (1018 ). The directory scale will be billions or more of files when managing such a large number of files. However, existing cluster file systems generally aim at managing directories with millions of files. Their metadata organization is not scalable, as a result their metadata performance is very poor if there are tens of millions of files in a directory. A scalable directory organization method is required for cluster file systems which maintain billions or trillions of files. The second challenge is how to provide high metadata performance for a large-scale file system with billions or trillions of files. For facebook, there are 550,000 images served per second at the peak [22]. The flickr also need to support processing 38,000 images per second [10]. In the field of scientific computing, applications on petaflop machines literally open hundreds of thousands of files a time, and that number keeps growing. Supporting hundreds of thousands of ops/sec of metadata processing will be an important requirement for the future metadata management [5]. However, the synchronization mechanism of existing cluster file systems greatly restricts the parallelization of metadata modifications, leading to poor metadata performance. The improvement of concurrent processing needs to be considered to achieve high performance of metadata processing. The third challenge is how to provide high metadata performance for mixed workloads generated by a large number of concurrent users. The concurrent accesses from millions of users to large-scale cluster storage will cause two serious
problems: inefficient use of metadata cache and imbalance loads among metadata servers. Large numbers of concurrent accesses from different users result in a vast amount of random accesses from the perspective of each metadata server. Lack of locality for accesses to metadata servers causes frequent replacement in the metadata cache. Furthermore, random accesses will cause uneven access loads among metadata servers. Some servers are heavily loaded while the others are lightly loaded. If large-scale metadata management has a way to utilize memory well and balance access loads effectively, metadata performance will be improved greatly. To address the above challenges, inspired by GIGA+ [16], we propose a scalable and adaptive metadata management system which aims to maintain a trillion files efficiently. In our system, an adaptive two-level directory partitioning based on extendible hashing [7] is used to manage very large directories. In the first level each directory is divided into multiple partitions to distribute on multiple servers. The first level partitioning controls the distribution of partitions among metadata servers. And in the second level each partition is divided into a certain number of metadata chunks. The second level partitioning controls the size of each partition. So any file can be located within two I/O accesses. This metadata organization is very scalable and can maintain a billion-scale file system efficiently. Our system uses small data structures such as partition and chunk as the metadata control unit for metadata modifications. Therefore, update operations such as file creations or deletions in the same directory can be processed concurrently. Our system also exploits the different importance of different kinds of metadata, and partitions the metadata cache into multiple layers with different replacement priorities according to the kinds of metadata. This way, the most frequently accessed metadata, such as the directory information and the partition information which are shared by many directory entries, will be cached in the memory for the longest time. Meanwhile, a dynamic load balance approach is used to rebalance the workload among metadata servers when new metadata servers are added to the system. The rest of this paper is organized as follows. Section 2 discusses related work in metadata management. Section 3 presents the design of our proposed metadata management techniques. Section 4 introduces some critical implementation details. Section 5 provides the performance results based on microbenchmarks. Finally, Section 6 concludes the paper.
2.
RELATED WORK
The organization of metadata is critical for distributed metadata management. It plays an important role in the scalability of metadata servers, concurrency of metadata processing and load balance among metadata servers. Metadata in the file system can be divided into two categories: the dentry which is used to maintain the namespace information and the inode which is used to manage the file’s information including attributes and block locations. Since an inode is information on one file, it is relatively independent of other inodes. However, namespace information has more complex relationships than file information. It needs to keep the relationship between each directory and the entries it contains and the relationship between each entry and its inode. Traditionally, each directory is a special file which
is a mapping table of entry-ino pairs. And the entries of a directory are sometimes organized into a B+ tree [21] or a hash table [18]. Previously, metadata management technologies were categorized by their partitioning method [6][23][24]. However, the partitioning method emphasizes too much implementation details and neglects the characteristics of metadata organization. We believe that to classify metadata management technologies by partition granularity is more helpful to understand metadata management in depth, including the features of metadata structures, scalability of organization, concurrent processing, memory utilizing and load balance. According to partition granularity, distributed metadata management can be classified into the following four categories.
2.1 No Partitioning This type of metadata management does not partition the namespace. The whole namespace will be held in one metadata server or duplicated on multiple metadata servers. The Google file system [9] is a representative of this type. This approach is suitable when the amount of metadata is small, especially there is not more than millions of files in the file system. As operations need not to be interleaved between metadata servers, network overhead is low and metadata consistency is easy to maintain. However, the metadata management can not scale as all the metadata is maintained by a single server. Moreover, this approach does not support concurrent modification operations among metadata servers because all the modifications need to be processed by the master server. When there is a burst of modifications, the master server will replace its metadata cache frequently when it can not hold all the metadata in the memory. Readonly load can be balanced among read/shadow servers but modification load can not.
2.2 Subtree Partitioning In subtree partitioning, the whole namespace is partitioned into several subtrees (the minimal subtree is a single directory) which are assigned to a specific metadata server. There are two types of subtree partitioning: static subtree partitioning and dynamic subtree partitioning. In the static subtree partitioning, the namespace is partitioned at system configuration stage, like NFS [17] and Sprite [14]. This approach can achieve concurrent processing among different subtrees, but the granularity is very large, and not less than a subtree. Apparently, the disadvantage of static subtree partitioning is its load imbalance. The dynamic subtree partition can compensate for the inadequacy of static subtree partition. Based on the workload, it transfers subtrees among servers, and scalability and load balance is better than in the static case. However, when the subtree grows to a certain large size, moving the subtree to anywhere will cause another imbalance. Ceph [23] adopted a hash approach to hash the contents of a large directory and distributed them among multiple metadata servers. As the sub-directory number is limited by the number of metadata servers, this method may not be sufficient when subdirectories grow too large to scale well.
2.3 Partitioning within a Single Directory Since the subtree partitioning can not scale well for large directories, Lustre [3] split directory based on hash. How-
ever, it use static policy to split directory on fixed number of servers. GPFS [20] utilizes extendible hashing to organize directory buckets. It stores directory buckets in disk blocks and builds a distributed implementation across all the nodes through a distributed lock managers and cache consistency protocols. In case of concurrent creation, the GPFS lock manager need to initiate token transfers between the clients before anyone of them can create files in the shared directory. The lock acquire and release phase will cause multiple disk I/Os which limits the performance of concurrent creation. GIGA+ [16, 15] presented a metadata management scheme that partitions a single directory into fixed-sized parts. And a partition will split when it is too full to insert any new metadata. A large number of partitions are distributed among multiple metadata servers. In GIGA+, a partition is the lower level processing unit for some update operations, which increased the concurrency of metadata updates within a directory. Since the number of partitions increases when the directory grows, GIGA+ can provide higher scalability and concurrent processing. However, this two-level metadata hierarchy will become a bottle-neck when the directory contains more than billions of files. The directory with billions of files will be split into millions of partitions which may lead to inefficient partition management. Although increasing the partition size does reduce the number of partitions, but also lead to inefficient memory utilization as big partition may increase internal fragmentation and inefficient location inside partition.
2.4
File Partitioning
In file partitioning, a feature of each file, like name or identifier, is used to locate its metadata server. The most appealing character of the file partitioning is that it can distribute all the files in the file system equally among all metadata servers, which can achieve load balance if the probability of each file to be accessed is the same. There are also two types of file partitioning: the static one and the dynamic one. The static file partitioning, like the one used by Vesta [4], Intermezzo [2], RAMA [13] and zFS [19], assigns files to metadata servers by hashing their name and/or some other unique identifier. This simple method has a number of advantages. Client can locate and contact the responsible metadata server directly. For average workloads and well-behaved hash functions, requests are evenly distributed across the cluster. However, changing the configuration of metadata servers will cause re-hashing of all of the files. Dynamic file partitioning like HBA[24] and G-HBA [11] maintains additional information on each server to support changing the configuration of metadata servers. As a result, it can achieve better load balance, but it can hardly scale up to support a single large directory because the amount of maintained information increases linearly with the growth of the directory. Besides, it is difficult to achieve high concurrent processing as update operations need to be processed synchronized on the granularity of directory.
2.5
Summary
As we discussed, the granularity of metadata partition has an important impact on many aspects of metadata processing, such as scalability, concurrent processing, memory utilization and load balance. Based on the analysis on related work, we have concluded that the sub-directory granularity, by which a directory is divided into partitions, can
achieve better scalability and concurrency than other partition granularity. However, a static one-level partitioning can achieve limited scalability. We present a metadata management with an adaptive two-level partitioning. It dose not only partitions a single directory into multiple partitions and distributes the partitions among metadata servers, but also divides a partition into chunks within a metadata server. According to the directory size, the granularity of each level can be adjusted to leverage the concurrency and memory utilization. For load balance, we adopt the strategy to separate metadata storage and metadata processing, and partitions can be migrated among the servers to achieve a relative balanced state.
3. AN ADAPTIVE AND SCALABLE METADATA MANAGEMENT METHOD More and more users are asking cluster file systems to maintain billions or trillions of files efficiently, we propose a novel adaptive and scalable metadata management method. The core of this method consists of four techniques: adaptive and scalable directory partitioning, fine-grained intradirectory parallel processing, multi-layered metadata cache management and consistent-hashing-based load-balancing.
3.1 Adaptive and Scalable Directory Partitioning Directories are used to organize the namespace inside the file system. The lookup operations and update operations such as file creation and deletion need to manipulate the entries in a directory. So they need to be synchronized by some sort of lock of the directory structure. Since the directory structure affects the efficiency of lookup operations and the concurrency of update operations, it determines the scalability of metadata processing. The concurrent control granularity is determined by this directory structure. When a directory grows to consist of billions or more entries, searching a file (name lookup) in such a large directory will be costly. Concurrent updating of entries in a large directory will be restricted by obtaining a lock in the directory structure. GIGA+ divides each directory into fixed sized directory partitions and distributes the partitions among the metadata servers to improve metadata processing performance and maintain much larger directories efficiently. When a partition reaches a certain size, it will be split into two partitions automatically. And metadata servers only need to hold a mapping information to indicate the split status of each partition. Updates to a directory can be processed in parallel if they manipulate different directory partitions. Therefore, this method can achieve higher concurrency and scalability at the same time than other existing methods. However, this method may have scalability and concurrency limits. If each partition contains thousands of entries, a directory with a billion files needs to be split into a million partitions. Fast growing directories will experience frequent split which will cause frequent updates of partition bitmap. Moreover, each partition is stored as a separate file in order that modification of different partitions can be processed in parallel, and hence the performance of partition modification will be limited by the local file system (on metadata servers) such as ext3, which is inefficient to manage such a large number of files.
Split Partition
Directory 0
Partition 0 … Split
Directory 0 Partition 0 Chunk 0
Directory 2
Directory 0 Partition 1 Chunk 0
Directory 1
Directory 2 Partition 0 Chunk 1
X X X 1 X X X …
Partition 0 Split
X X 1 0 X X X …
…
X X X 0 X X X … Partition 0
…
X X 0 0 X X X …
Enlarge Partition 2
Directory 1 Partition 0 Chunk 0
Directory 1 Partition 1 Chunk 0
Directory 2 Partition1Chunk0 cmeta 0 cmeta 1 cmeta 2 cmeta 3 cmeta 4 cmeta 5 … … cmeta n-2 cmeta n-1 cmeta n
Figure 1: Directory Structure If the size of directory partitions is increased to contain tens of thousands files, the number of directory partition splitting will decline to an ideal level. However, memory utilization will be inefficient because a large number of unused entries may be be loaded into the cache with each lookup operation. Since the size of partition is related to the speed of locating metadata in the partition, a large partition could lead to inefficient lookup inside a partition. Efficient metadata management should trade off between the directory size and the partition size. The term “directory size” refers to the total number of entries in a directory. Similarly, the term “partition size” refers to the total number of entries in a partition. In our system, partition size is limited by a value which is the maximum size of each partition, while directory size is unlimited. On one hand, a directory is automatically split into multiple partitions in our method even if its size is small, so its partitions can be distributed across multiple metadata servers. This results in higher concurrent processing than a single partition without splitting. On the other hand, the size of a partition is adaptable to restrict the number of partitions of a very large directory to a certain scope. Initially, each partition only contains one metadata chunk. When the size of a directory reaches a threshold value, its partitions will automatically enlarge to contain more chunks. When the size of a partition doubles, each partition doubles the number of metadata chunks. The directory structure is shown in Figure 1.
3.1.1
…
Partition 2 …
Directory 0 Partition 1 Chunk 1 Directory 2 Partition 0 Chunk 0
Partition 1
Enlarge Partition
X X X X X X X …
The Two-Level Directory Partitioning Algorithm
The adaptive and scalable directory partitioning algorithm is shown in Figure 2. When locating a file through the lookup operation, the algorithm first calculates a 32-bit hash value from its name. The hash value is divided into partition bits and chunk bits to locate partition and chunk respectively. Each portion of the hash, the partition bits and chunk bits, uses extendible hashing [7]. When a directory is small, it contains only one metadata chunk. The corresponding partition bits length and chunk bits length are both zero, which means the directory has only one partition and the partition has only one chunk. The partition bits length is increased first when one partition is not enough. When the partition bits length reaches α(α related to the processing power of single server and the number of metadata servers), the partition bits length stops
… …
X X 1 0 1 X X … X X 1 0 0 X X …
Figure 2: Two-Level Directory Partitioning
growing, and the chunk bits length begins to increase. When the chunk bits length grows to β, which means every partition contains 2β chunks, the algorithm stops enlarging the partition and continues to split more partitions. Through increasing the partition bits length and the chunks bits length alternately, our method can achieve high performance and high scalability with the growing of a directory. When a directory is small, the algorithm tries to improve performance by distributing entries of a directory across all metadata servers and manipulating them concurrently. When a directory is very large, the algorithm tries to improve scalability by making each metadata server maintain more partitions. For example, suppose we need to locate a file whose hash value is 0101101001. The hash value is divided into 010110(the partition bits) and 1001(the chunk bits) to locate its partition and chunk respectively. As shown in Figure 2, partition bits length is 2 and chunk bits length is 1, we can educe the file belongs to partition 2 as the last 2 bits of the partition bits “010110” is “10”. Similarly, we can educe the file belongs to chunk 1, because the last bit of the chunk bits “1001” is “1”. Then we can locate the file at partition 2 chunk 1.
3.1.2 The Partition Bit-Map Like GIGA+, as file access operations are issued on clients, each client maintains a partition bit-map to indicate which partitions are available in order to reduce redirections in locating a file. To support fast locating file in a directory with billion files, we use the lower 30 bits of ino to locate partition and chunk. The higher 32 bits of ino is used to locate the directory. Among the lower 30 bits, 20 bits is used to locate partition and 10 bits is used to locate chunk. Since the number of directory partitions will be no more than 220 , the bit-map for each directory will be less than 128 KB. By the help of the partition bit-map and the partition bits length, to locate a partition is very efficient and only needs several memory accesses on clients. If the partition bit-map of a client is stale, the metadata servers will inform the client to refresh its bit-map. The partition bit-map is updated once a partition is split. As the update just affects the existing partition and the new created partition, only the directory’s primary metadata server and the two metadata servers maintaining the related two partitions need to update their bit-maps. The directory’s primary metadata server has the uptodate partition bit-map. The server with an obsolete bit-map can ask the primary server for the freshest bit-map. GIGA+ use a static hash function to distribute partitions among multiple metadata servers. Suppose there is a partition i, the server holds the partition is i mod number of servers. This mapping scheme will cause a large amount of overhead when metadata server configuration changes. Unlike GIGA+, we use consistent hashing to distribute partitions
… … Chunk n
Directory Hash Directory 1 Directory 2 Directory i Directory n
Directory 1 partition attribute Hash Partition 1 attribute Partition 2 attribute … Partition i attribute …
Fine-Grained Parallel Metadata Processing
In our metadata management system, the metadata structures can be classified into 4 types according to the data size they manipulate. Directories are on the highest level which manipulate the whole directory. Partitions are the second largest structures, and metadata chunks are the third largest structures. The metadata of each file is the smallest structure and is on the lowest level. To achieve high concurrency, different types of operations are processed in parallel at different levels. Both the directory cache replacement and partition bitmap update need to be synchronized by directory lock. Because in directory cache replacement, all the content of the evicted directory, including all partitions and all chunks should be written back to the disk before the directory structure is written back. And in partition bit-map update, multiple servers may require to modify the bit-map simultaneously. Although file or directory creation operations need to be synchronized by partition lock, other operations which only read the partition information can be processed concurrently. As metadata chunks are the unit to transfer metadata between memory and disk, when it is being written back to or read from the disk, both modification and access to it should be blocked. And setattr operations need to be synchronized by metadata lock. Moreover, in order to trade off between metadata con-
Chunk i
Partition i …
3.2.1
Chunk 2
Partition 2
Partition n
…
Parallel Processing Within a Directory
Metadata operations usually need to be synchronized because they will access or modify shared data structures. This kind of synchronization is typically implemented by locks, which greatly restricts the concurrency of metadata processing. For example, in traditional file systems, any namespace update operation to a directory should first get the lock of this directory, which results in all the updates for the same directory to be processed in sequence. A large number of updates competing for a single directory structure is one of the causes of poor metadata performance. Motivated by two common ways to improve concurrency which are to reduce the granularity of each lock and to reduce the processing inside each lock, our system divides a directory into many smaller-sized partitions. Operations to the same directory, no matter they are reads or modifications, can be processed concurrently on different servers, if they operate on different partitions. Furthermore, we can also benefit from the smaller granularity of partition lock. Without directory partitioning, each update operation will lock the whole directory, which not only prevents any other updates from being processed, but also prevents metadata cache replacement of this directory. However, with directory partitioning, each update operation will lock a partition which is much smaller than the directory and operations on other partitions can proceed on.
Chunk 1 Partition 1
…
3.2
Partition 1 Chunk Hash Partition Hash
…
among multiple metadata servers. Each metadata server will be assigned with a range of hash function values, and a partition is assigned to a metadata server by the hash function value of its ID. The range of hash function values for each metadata servers can be changed by our load balance mechanism (which will be discussed in Section 3.4) to accommodate to the change of access workloads and system configuration.
Partition n attribute
Partition1Chunk1 cmeta 0 cmeta 1 cmeta 3 cmeta 4 … … cmeta n-2 cmeta n-1
cmeta 2 cmeta 5 cmeta n
Directory 1 Partition 1 Attribute location MDS 1 depth 5 nlink 3153 … …
Figure 3: Cache Hierarchy: A partition cache and its chunk caches are in the same metadata server. The distribution of partition caches is based on the hash range of metadata servers. A directory cache and its partition attribute caches are in the same metadata server. A directory has only one directory cache which is in the same metadata server with partition 0 cache.
sistency and high processing concurrency, our system uses read-write locks to allow high concurrency for read operations. Especially, specific data structure is locked by writtenlock only when the operations modify the data structure; otherwise, it is locked by read-lock. Compared with other systems, our system can achieve filelevel concurrency for file attribute updates, partition-level concurrency for creation operations, which is smaller than the directory-level concurrency in traditional file systems. To achieve the equal concurrency with GIGA+, our system splits the partition as GIGA+ does when the directory is small. After the partition number is larger than the number of CPU cores, our system enlarges partition instead of splitting partition when the partition has no space to store new metadata. Moreover, our system can achieve better concurrency because some operations achieve chunk-level concurrency. Through two-level directory partitioning, our system greatly reduced the granularity of locks and can achieve better concurrency in metadata processing than existing file systems.
3.3 Multiple-Layered Metadata Cache Management Both in data centers and large-scale application environments, thousands or more users make a large number of concurrent file access or modification requests. As a result, the metadata accesses on each server exhibit weak locality and they are very random from the perspective of each server. Therefore, the metadata cache on the servers will be replaced frequently, resulting in frequent disk accesses and underutilized memory. To improve the memory utilization and reduce disk accesses, we present a multiple-layered cache management for metadata servers, which can keep the most frequently accessed metadata in the memory. As mentioned in previous subsection, there are 4 types of metadata in our system, and they have different access frequency. Among them, directories are the most frequently used data structure. Many operations need to access directory information. And if a directory is evicted out of the cache, all its content includ-
ing all partitions and chunks should also be clear out of the memory. Partitions are the second most frequently used data structure, as every operations need to access partition information or attain partition lock first. Chunks are the least frequently accessed data structure. Moreover, the replacement cost of different types of metadata is not uniform. The cost of replacing a chunk is the lowest, which only need to write back the chunk itself to disk. The cost of replacing a partition is much higher than chunk replacement, because it needs to write back all chunks in the partition. The cost of replacing a directory is the highest, which need to write back all chunks in all partitions of the directory. As shown in Figure 3, our system takes both the different access frequency and replacement cost of different types of metadata into account, and partitions the metadata cache on each server into 3 variable-sized parts, which is directory cache, partition cache and chunk cache. Appropriate size of both directory cache and partition cache is important for good memory utilization, because if these caches are too large, chunks being accessed will have little room in the memory. Therefore, our system restricts the size of directory cache and partition cache by a percentage of the the total size of metadata cache. When directory cache reaches the maximum size, loading a new directory will have to replace an old directory which contains the least number of partitions in the cache. Similarly, when partition cache reaches the maximum size, loading a new partition will have to replace an old partition which contains the least number of chunks in the cache. When a directory cache is evicted, all its partitions are also evicted together. To achieve this in the distributed setting, we use an agency mechanism to record the partition cache information at its directory’s primary metadata server. When a partition is cached, an agency of the partition will be created and inserted into the directory’s partition hash table. When a partition is released from cache, the agency of the partition will be removed from the directory’s partition hash table. So when a directory is evicted, the processing thread will know all the cached partitions of the directory from the directory’s partition hash table. Then, it can release all the partition agencies and send requests to related servers to release the cached partitions.
3.4
Dynamic Load Balance Mechanism
A large number of concurrent users of a large-scale storage system usually generate irregular and mixed workloads on the metadata servers. A load balance mechanism is needed to avoid certain individual servers to be over-loaded. An over-loaded server not only slows down the users accessing metadata on it, but also slows down other servers which are interacting with it. In order to facilitate load migration, our system divorces the metadata processing from metadata storage. Metadata servers only process metadata access requests, while metadata storage servers provide a shared metadata storage for all metadata servers. By this separation, access loads migration will not cause metadata storage migration, which will be very costly. Therefore, our system can easily migrate access loads among metadata servers with low cost if the loads are imbalance. For the sake of scalability and load balance, our system assigns partitions to metadata servers through consistent hashing. As described in Sec 3.1.2, each server is assigned
with a range of values by the hash function value of its ID. And each partition is assigned to a server if its hash function value is in the range assigned to the server. Therefore, the size of each range is one of the factors that determines the load on each server. Theoretically, each metadata server can be assigned almost equal-sized range of hash values through consistent hashing when there are a large number of metadata servers. However, for cluster file systems, the number of metadata servers is from several to hundreds. For such numbers of metadata servers, the size of ranges may be very different. Moreover, the access load on each server is also determined by the user workloads itself which is very dynamic and changing with time. So some servers may overload by vast amount of access requests on metadata assigned to these servers even if each range is equal-sized. Since consistent hashing is focus on balancing metadata placement among a large number of servers and it does not balance workload dynamically, our system also employs a dynamic load balance mechanism, in which some loads will migrate from overloaded servers to light-loaded servers automatically when necessary. There is a master which is one of the metadata servers in our system to take the charge of load balancing. In our system, the master needs not to collect the resource usage on all the servers periodically. An overloaded server is detected by its resource usage, including CPU usage and memory usage. When the resource usage exceeds a threshold value, the server will make a rebalance request to the master. In order to filter the burst workload within a short period of time, the master will collect the resource usage of all servers after a period of time. If the resource usage on the requested server is still higher than the threshold, the master will trigger the rebalance process immediately. Otherwise, the rebalance request is just ignored. In the rebalance process, the master will choose a target server to accept the loads moved out of the overloaded server first. The range of the target server must not be larger than a threshold and such a server with lowest resource usage is chosen. Then the master will choose which values on the overloaded server need to be assigned to the target server. This results in migration of the partitions whose hash function values equal to selected values. Generally, no more than 10% partition processing range is chosen to migrate. The rebalance process will also be triggered when a new metadata server is added in the system or a metadata server crashes. By using consistent hashing, it is easy for our system to determine which range is split into two ranges when adding a server or which ranges need to be merged into one range when a server crashes. And partition cache migrates accordingly. Therefore, our system is highly scalable, either scale up or scale down.
4. IMPLEMENTATION 4.1 System Architecture We have implemented a prototype cluster file system called skyFS. It contains four components, which are metadata servers, data storage servers, FUSE-based clients and an asynchronous message passing library (AMPL). To avoid the plague of kernel debugging, all of our components are implemented in the user space. The metadata servers are the core of our system. They
Conflict Parent Directory ID
Type
63 62 61
Hash / Conflict ID / Directory ID
32 31
0
Figure 4: Ino structure: If the type of metadata is file, the lower 32 bits of ino is a hash value of the file name or a conflict Id when the hash value conflicts with others. If the type of metadata is directory, the lower 32 bits of ino is a directory ID.
manage metadata, process request from client and make decision about when to split or enlarge a partition. To facilitate metadata processing scalability and load balance, metadata is stored in shared storage servers. Each metadata partition is stored as an object file in storage severs. The distribution of metadata partitions to storage servers is similar as distribution partitions to metadata servers. Each storage server will be assigned a range of values according to the hash function value of its ID. And partitions are mapped to the storage servers through the function value of its ID. Partition split needs several disk I/O. In order not to interfere with metadata servers, partition split is implemented on storage servers. The FUSE-base client is actually an interface between FUSE and our servers. After FUSE captures file system operations from the kernel, it delivers these requests to our client. The client packages the requests into messages and sends them to servers through AMPL. AMPL is a package of TCP/IP operations with the support of send messages and data in synchronous and asynchronous mode. As it uses multiple threads to send and receive messages and data, servers and clients need not to use additional threads to transfer requests or data.
4.2
Metadata Organization and Location
File attributes (such as inode in Linux) and its entry in the directory (such as dentry in Linux) are correlative. Under most conditions, a request to access a dentry will be followed by a request to access its attributes. However, they are separately stored on disk in traditional file systems, which results in many small-sized random accesses for metadata and poor metadata performance. Some previous works [8][23] have studied the method of storing file attributes with its dentry, and achieved great performance improvement. Motivated by these works, our system uses a single data structure called cmeta (short for combined metadata) which is the combination of the inode structure and the dentry structure. As inode and dentry are collocated in the same place, they should be located either by file name or by a unique ID (e.g. ino). Since cmeta’s file name needs to be used to identify its belonging directory partition ID and then locate its processing server by hashing the partition ID, we need to adjust ino accordingly to facilitate locating the cmeta by ino. To address this problem, the 64-bit ino for each file or directory is divided into 3 segments in our system as shown in Figure 4. The first 2 bits are the type flag and the confliction flag respectively. The type flag is used to indicate whether the object is a file or a directory. And the confliction flag is used to indicate if the hash value of the file conflicts with
other files. The following 30 bits of the ino are the parent directory ID of this file. And the last 32 bits are assigned differently for files and directories. If it is a file, the last 32 bits are the hash value of the file name, while if it is a directory, the last 32 bits are the unique directory ID generated by the metadata server when the directory is created. If the name hash conflicts with the existed files, the third part will be a conflict number instead. The conflict number is increased every time a name hash conflict occurred for this hash value. And an appendix field is used to hold the file’s name hash. The appendix field will be fetched together with the ino once inode operation occurs. When locating a cmeta by its ino, our system checks the confliction flag first. If the flag indicates no conflict, our system then uses the parent directory ID and 32-bit hash value in the ino to locate it. If the flag indicates conflict occurs, our system will use the appendix field as files hash value to locate the metadata. As we used directory ID to represent directory, we support rename directory easily. Cmeta location is relied on its hash value and directory id, rename the directory would not change the directory’s ID, so rename a directory will not influence the location of its containing metadata.
4.3
File System Semantics
In traditional file systems, each update operation such as creation or removal needs the directory attributes to be updated simultaneously. However, this strict consistency requirement is harmful to metadata processing performance, especially for multiple metadata servers because the coordination among them is costly. In fact, directory attributes only need to be consistent with its content when content related operations (like readdir and getattr ) need to be processed. Therefore, our system delays the update of some directory attributes such as nlink, size, modification time, etc. until their precise values are needed. When directory attributes need to be refreshed, the metadata server who maintains the directory will send messages to the other servers who maintain the partitions of the directory to collect directory information and merged them into the attributes. In Linux, readdir is implemented by reading the directory blocks of the directory in sequence. As directory block does not exist in our system, we need to locate the start point of asked directory block and read metadata from chunks to compose directory blocks. To improve the efficiency of readdir, we set all the cmeta with the same length. As a matter of this, we can get the partition ID and chunk ID of a cmeta with specified offset by the status information of the asked directory. Then we can locate the start reading point of a readdir request without much overhead.
5. EVALUATION Our experiment platform is the Dawning5000 which is a very large-scale cluster-architecture supercomputer consisting of 1400 nodes. Each node is composed of 4-way QuadCore AMD OpteronT M 8347 HE Processors and 64GB memory. The nodes are interconnected by Infiniband DDR. Each node also has a 150GB 7200RPM SATA hard drive. Our experiments only use 232 nodes, among which 32 nodes are used as metadata servers and storage servers, and 200 nodes are configured as clients. For the experiment below, our prototype is configured with a chunk size of 128 cmetas per chunk by default, the
80000
70000
70000
60000 50000 40000 30000 20000
adaptive partition single partition with 1024 cmeta single partition with 128 cmeta single partition with 8192 cmeta
10000 0 1000
10000
100000
1e+006
1e+007
1e+008
File Creation Throughput (ops/sec)
80000
70000
File Creation Throughput (ops/sec)
File Creation Throughput (ops/sec)
80000
60000 50000 40000 30000 20000
adaptive partition static 8 chunk/partition static 64 chunk/partition
10000 0 1000
1e+009
10000
100000
Directory size
1e+006
1e+007
1e+008
60000 50000 40000 30000 20000
0 1000
1e+009
Directory size
(a) Compare with single-level partitioning (b) Compare with fixed sized partition
adaptive partition 128 cmeta/chunk adaptive partition 64 cmeta/chunk adaptive partition 1024 cmeta/chunk
10000
10000
100000
1e+006
1e+007
1e+008
1e+009
Directory size
(c) Effect of different chunk size
Figure 6: Comparison of different Directory Structures
8000
5000
7000
5000
3000
4000 2000
3000
Number of times
File Creation Throughput (ops/sec)
4000 6000
2000 1000
adaptive partition performance split partition times enlarge partition times
1000
0
0 0
0.5
1
1.5
2
2.5
3
3.5
Directory size(million)
Figure 5: Directory Adaptability
number is chosen based on its best performance in the tests among various configurations.
5.1
Workload
To evaluate the performance of large directory processing, we use the mdtest [12] benchmark. Mdtest can produce an arbitrary number of concurrent metadata requests from multiple clients, and we believe it is a good benchmark for simulating multiple access streams to multiple metadata servers. Although each access stream of a client produced by mdtest is regular, it has the same effect on the metadata performance of our system as the random access streams, because each file is mapped to a metadata server through the hash value of its name and the accesses on each server is mixed and random.
5.2
Adaptability and Scalability
We evaluated metadata adaptability through building a very large directory from zero and monitoring the file creation performance of each stage to show the effect of adaptive splitting scheme. Since the adaptability can be shown in a small scale system, only one server node and eight client nodes are used in this evaluation. As shown in Figure 5, although the performance is fluctuant slightly during splitting directory or partitions, the performance of our metadata management is constant in general as the directory
gets larger, which validates the adaptability of our scheme. To evaluate the scalability of our metadata system, we tested the metadata performance of our system both under different scale of workloads and with different numbers of servers. As metadata performance is affected by the size of a partition and the depth of its splitting, the metadata performance is tested by changing the two values. In each test, mdtest was started on 200 clients with 5 test threads on each client. Each test thread created or stated or deleted 100 to 100,000 files in the same directory respectively. Distributing partitions by consistent hashing, a fresh system needs an adjusting stage to achieve the balance of metadata placement among all the servers. However, peak performance of metadata processing is achieved after loads are balanced among all the servers. Therefore ,in these tests, partitions were evenly distributed among all the servers by manual, which also resulted in equal loads among the servers. The partition structure with a single chunk in each partition is similar to GIGA+. Different with this single-level partition, our adaptive partitioning method allows multiple chunks in each partition. As showed in Figure 6(a), single-level partitioning with small partition (specifically 128 cmetas per partition here) has good performance at small directory size. However, with the growth of a directory, it soon reached its peak performance and its performance declined rapidly. Single-level partitioning with large partition (specifically 8192 cmetas per partition) performed better at large directory size, but still did not perform as well as our adaptive partitioning method. The performance of the single-level partitioning method fell down when the directory size is beyond 1 million because with the increase of directory scale, partition splits frequently to influence the performance. On the contrary, our adaptive two-level partitioning method has higher performance than the single-level one when the directory size is beyond 10 million. The reason is that the adaptive two-level partitioning method will enlarge partition to reduce split operation to influence the normal process. The static two-level partitioning method can reduce the number of partition split. However, it performs poor at small directory size, because a fewer number of partitions leads to low concurrency in processing. As shown in Figure 6(b), partition with 64 chunks has the lowest performance compared with the others at small directory size, and its performance is increased with the number of partition increased. Since our adaptive partitioning method has single
10k files 100k files 1m files 10m files 100m files
60000 50000
File stat Throughput (ops/sec)
File creation Throughput (ops/sec)
600000
70000
40000 30000 20000 10000 0
100k files 1m files 10m files 100m files
500000 400000 300000 200000 100000 0
4
8
16
32
Server number
(a) File creation
4
8
16
32
Server number
(b) File stat
Figure 8: Metadata server scalability: In file creation, multiple servers can have enough work to achieve peak performance with large directories. However, the configuration of 32 servers can not have enough work with small directories to perform well.
Create with Split Create without Split
File creation Thoughput (ops/sec)
140000 120000 100000 80000 60000 40000 20000 0 10k
100k
1m
10m
100m
1g
Directory size
Figure 7: Directory Scalability:Create with split means create a directory with partition directory over multiple servers. Create without split means the directory has already been partition over multiple servers, and no splitting happens during creating files.
chunk for each partition when the directory is small and enlarges each partition to contain multiple chunks when the directory becames very large, it can achieve high performance at any scale of directories.. Larger-sized chunks lead to fewer number of partitions splits, and reduce the overhead for file creation. However, large chunks also lead to inefficiency of search within a chunk. As Figure 6(c) shows, the lookup performance of larger-sized chunks is lower than that of smaller-sized chunks. This means the benefits of reducing the number of partitions can be counteracted by the cost of locating a cmeta within a chunk. However, small chunks with few cmeta are also inefficient, because it can result in frequent partition splitting, as well as inefficient partition search. This indicates that small chunks can impede scaling up. Therefore, suitable chunk size is the trade-off between partition splitting and intra-chunk search. Our experiment results show that the preferable chunk size is the one containing 128 cmetas. The directory scalability test result is shown in Figure 7. We can see that with 32 metadata servers, the create performance is stable even in the directory with 1 billion
files. The create without split mode means there were no partition split or enlarge during processing as the partition structures and the chunk structures had been constructed in the creation with split mode. To show the processing scalability with the increase of the number of servers, we run the same workload on different number of servers. The Figure 8(a) and (b) shows both the file creation performance and lookup performance (indicated by stat performance) increase with the number of metadata servers. Since the creation operations can be processed in parallel at the granularity of partitions, the small directory can not fully use the power of multiple servers as it contains few partitions. On the other hand, the lookup operations can be processed in parallel at the granularity of cmeta, its performance is increased linearly with the size of the directory. Since the partitions of a small directory can be cached in the memory, the performance of small directories is much higher than that of large directories. We have evaluated our system at the directory size of 1 billion files. It needs about 1.2TB to hold all the metadata. Since the size of cmeta is 368B, 1 billion cmetas occupies 368GB. We waste about three times as much as the metadata storage needed. To support a file system with a trillion files, we need to reduce holes in directory partition to improve the utilization of storage.
5.3 Load Balance Consistent hashing is used to assign partitions to servers in order to automatically balance load among servers when server configuration is changed. However, this method may lead to imbalanced load as a result of imbalanced distribution of partitions among a small number of servers. Furthermore, the accesses to contents of each partition may also vary distinctly. To deal with these problems, our system exploits a dynamic load balancing method as described in section 3.4. In our dynamic load-balance method, both the load threshold which triggers the rebalancing of loads and the amount of partitions that need to be migrated are two key factors for actual load distribution. To investigate their influence on metadata performance, we used mdtest to produce workloads to 32 metadata servers from 200 clients. In each test, mdtest created 10 million files in the same directory, and
4
27
3.5
26
3
Standard Deviation
Avarage CPU time per Server (Minute)
28
25 24 23
Migrate 5% Migrate 10% Migrate 15%
22 21
Migrate 5% Migrate 10% Migrate 15%
2.5 2 1.5 1 0.5
20
0 60
65
70
75
80
85
90
95
60
Balance Threshold (CPU Idle Percentage)
(a) Average CPU Time
70
75
80
85
90
95
(b) Mean Square Division 800000
240000
Migrate 5% Migrate 10% Migrate 15%
220000
File stat Throughput (ops/sec)
File Creationg Throughput (ops/sec)
65
Balance Threshold (CPU Idle Percentage)
200000 180000 160000 140000 120000 100000
Migrate 5% Migrate 10% Migrate 15%
700000 600000 500000 400000 300000
60
65
70
75
80
85
90
95
Balance Threshold (CPU Idle Percentage)
(c) File Creation Throughput
60
65
70
75
80
85
90
95
Balance Threshold (CPU Idle Percentage)
(d) File Stat Throughput
Figure 9: Load Balance Effect on different Parameters: The migrate percentage means the amount of hash range which need to migrate between busy server and idle server on each rebalance operation. The number of partitions to be migrated is determined by the migrate percentage and the range of hash function values of the busy server. stated them afterwards, and deleted them finally. And this process was repeated 3 times. CPU idle percentage is used to measure load threshold. High CPU idle percentage means low load threshold. Figure 9(a) and (b) show that aggressive load balancing with lower load threshold is better for load balance. The reason is that each file is accessed the same times in the test workloads produced by mdtest. The imbalance of load among the servers is due to the imbalance distribution of partitions among servers. On this situation, low load threshold results in frequent rebalancing and the load among the servers can keep balance most of the time. We guess that for more dynamic and changing workloads, aggressive load balancing may not always be good. These two figures have also shown lower percentage migration will be good for low load threshold while higher percentage migration will be good for high load threshold. That’s because frequent rebalance need small balance granularity to avoid disturbance with normal processing, and large balance granularity will be fine for less frequent rebalance. Figure 9(c) and (d) show that 10 percent of partitions are migrated to a light-loaded server is best on such situation which each file is equally accessed. Less percentage resulted in long time to reach balance, while more percentage affected normal processing of current working set and decreased the overall performance. As the amount of partitions to be migrated is a trade off between the time cost to reach balance and the interference with current processing, we guess that around 10 percent is preferable.
6. CONCLUSION Directory organization and partitioning are the key factors that determine the metadata performance of large file systems or directories with billions or trillions of files. To address the efficiency of managing trillions of files in a single cluster file system which is increasingly required by more and more applications including both Internet services applications and high-end computing applications, we present an adaptive and scalable and metadata management system which dynamically partitions each directory by an extendiblehashing-based method. It can achieve high metadata performance by splitting directory into partitions when the size of directory is small and by enlarge partitions with more chunks when the size of directory is large. And these two levels of splitting are done automatically by our system to adapt to different size of directories and the growth of directories. Further more, our system also exploits fine-grained parallel processing with a single directory, multiple-layered metadata cache management and a dynamic load balancing mechanism based on consistent hashing. These methods further improve metadata processing performance. The experiment results show that our adaptive and scalable directory partitioning method can provide the best performance compared with other static method. The performance tests on 32 metadata servers show that our system can create more than 74,000 files per second and can lookup more than 270,000 files per second in a single directory with 100 million files. The prototype implementation delivers a peak throughput of more than 60 thousand file creates/second in a single directory with 1 billion files, which
proves that it can maintain billions of files efficiently. Moreover, multiple metadata servers scale well, and the dynamic load balance mechanism can reach balance among all the servers quickly.
7.
ACKNOWLEDGMENTS
This work is supported in part by the National High-Tech Research and Development Program of China under grant numbered 2006AA01A102. The authors gratefully acknowledge the support of K.C Wong Education Foundation, Hong Kong. The authors would like to thank Professor Weisong Shi of Wayne State University for his discussion and helpful advices. We would also like to thank Panyong Zhang and Zhigang Huo for managing the Dawning5000 testbed. The final version has benefited greatly from the many detailed comments and suggestions from the anonymous reviewers and our shepherd Professor Garth Gibson and his student Swapnil Patil from CMU.
8.
[12]
[13]
[14]
[15]
[16]
REFERENCES
[1] Large synoptic survey telescope. http://www.lsst.org/lsst, 2008. [2] P. Braam, M. Callahan, and P. Schwan. The intermezzo file system. In In Proceedings of the 3rd of the Perl Conference, Oa´ ,rReilly Open Source Convention, Monterey, 1999. [3] Peter J. Braam. The lustre storage architecture. 2004. [4] Peter F. Corbett and Dror G Feitelson. The vesta parallel file system. ACM Trans. Comput. Syst., 14(3):225–264, 1996. [5] DARPA/IPTO. Exascale computing study: Technology challenges in achieving exascale systems. 2008. [6] John R. Douceur and Jon Howell. Distributed directory service in the farsite file system. In OSDI ’06: Proceedings of the 7th symposium on Operating systems design and implementation, pages 321–334, Berkeley, CA, USA, 2006. USENIX Association. [7] Ronald Fagin, Jurg Nievergelt, Nicholas Pippenger, and H. Raymond Strong. Extendible hashing—a fast access method for dynamic files. ACM Trans. Database Syst., 4(3):315–344, 1979. [8] Gregory R. Ganger and M. Frans Kaashoek. Embedded inodes and explicit grouping: exploiting disk bandwidth for small files. In ATEC ’97: Proceedings of the annual conference on USENIX Annual Technical Conference, pages 1–17, Berkeley, CA, USA, 1997. USENIX Association. [9] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The google file system. In SOSP ’03: Proceedings of the nineteenth ACM symposium on Operating systems principles, pages 29–43, New York, NY, USA, 2003. ACM. [10] Todd Hoff. Flickr architecture. http://highscalability.com/flickr-architecture, 2007. [11] Yu Hua, Yifeng Zhu, Hong Jiang, Dan Feng, and Lei Tian. Scalable and adaptive metadata management in ultra large-scale file systems. In ICDCS ’08: Proceedings of the 2008 The 28th International Conference on Distributed Computing Systems, pages
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
403–410, Washington, DC, USA, 2008. IEEE Computer Society. Lawrence Livermore National Laboratory. mdtest-1.7.4. http://sourceforge.net/projects/mdtest/, 2007. E. L. Miller and R. H. Katz. Rama: An easy-to-use, high-performance parallel file system. In Parallel Computing, pages 419–446, 1997. John K. Ousterhout, Andrew R. Cherenson, Frederick Douglis, Michael N. Nelson, and Brent B. Welch. The sprite network operating system. Computer, 21(2):23–36, 1988. Swapnil V. Patil and Garth Gibson. Giga+: scalable directories for shared file systems. http://highscalability.com/flickr-architecture, 2008. Swapnil V. Patil, Garth Gibson, Sam Lang, and Milo Polte. Giga+: scalable directories for shared file systems. In PDSW ’07: Proceedings of the 2nd international workshop on Petascale data storage, pages 26–29, 2007. Brian Pawlowski, Chet Juszczak, Peter Staubach, Carl Smith, Diane Lebel, and David Hitz. Nfs version 3 design and implementation. In In Proceedings of the Summer USENIX Conference, pages 137–152, 1994. Daniel Phillips. A directory index for ext2. In ALS ’01: Proceedings of the 5th annual Linux Showcase & Conference, pages 20–20, Berkeley, CA, USA, 2001. USENIX Association. O. Rodeh and A. Teperman. zfs - a scalable distributed file system using object disks. In In Proceedings of the 20th IEEE / 11th NASA Goddard Conference on Mass Storage Systems and Technologies, pages 207–218, 2003. Frank Schmuck and Roger Haskin. Gpfs: A shared-disk file system for large computing clusters. In FAST ’02: Proceedings of the 1st USENIX Conference on File and Storage Technologies, pages 231–244, Berkeley, CA, USA, 2002. USENIX Association. Adam Sweeney, Doug Doucette, Wei Hu, Curtis Anderson, Mike Nishimoto, and Geoff Peck. Scalability in the xfs file system. Proceedings. of the USENIX 1996 Annual Technical Conference, 1996. Peter Vajgel. Needle in a haystack:efficient storage of billions of photos. http://www.facebook.com/note. php?note_id=76191543919, 2009. Sage A. Weil, Kristal T. Pollack, Scott A. Brandt, and Ethan L. Miller. Dynamic metadata management for petabyte-scale file systems. In SC ’04: Proceedings of the 2004 ACM/IEEE conference on Supercomputing, page 4, Washington, DC, USA, 2004. IEEE Computer Society. Yifeng Zhu, Hong Jiang, Jun Wang, and Feng Xian. Hba: Distributed metadata management for large cluster-based storage systems. IEEE Trans. Parallel Distrib. Syst., 19(6):750–763, 2008.