!000111333 IIIEEEEEEEEE !777ttthhh IIInnnttteeerrrnnnaaatttiiiooonnnaaalll SSSyyymmmpppooosssiiiuuummm ooonnn PPPaaarrraaalllllleeelll &&& DDDiiissstttrrriiibbbuuuttteeeddd PPPrrroooccceeessssssiiinnnggg WWWooorrrkkkssshhhooopppsss aaannnddd PPPhhhDDD FFFooorrruuummm
tpNFS: Efficient Support of Small Files Processing over pNFS Bo Wang, Jinlei Jiang, Guangwen Yang Department of Computer Science and Technology, Tsinghua National Laboratory for Information Science and Technology (TNLIST) Tsinghua University, Beijing 100084, China Email:
[email protected], {jjlei,ygw}@tsinghua.edu.cn Abstract—Large scale data-intensive applications that consume and produce terabytes or even petabytes of data raise an ever increasing demand on I/O bandwidth. In order to meet this demand, NFSv4 architects design parallel NFS (pNFS), an NFS extension allowing clients to read/write data from/to multiple data servers in parallel. Though pNFS can support large files processing efficiently, we found that it has deficiency in processing small files. Unfortunately, small files dominate for a large number of applications in scientific computing environments. To deal with the problem, this paper presents tpNFS, an extension to pNFS that adds a transport driver to the pNFS metadata servers to make data of files, no matter small files or large files, be stripped more evenly onto multiple data servers. Our experiments with booting DomU clients from tpNFS and manipulating a large number of files show that tpNFS has better performance than pNFS for small files processing, especially when many clients read/write concurrently. As for large files processing, tpNFS introduces nearly no overhead when compared with pNFS. Keywords-pNFS, Parallel IO, File System, Load Balance
I. I NTRODUCTION The explosive growth of data-intensive applications, such as biology systems, climate modeling, data mining and high performance computing, have caused a dramatic increase in the volume of media that needs to be stored, classified and accessed, raising an ever-increasing demand on I/O bandwidth. Traditional single-headed NAS (Network Attached Storage) compliance cannot meet the need because of capacity and I/O limit. As a result, clustered NAS solutions are introduced, which have the benefits of no capacity limit of a single node, distributed data and metadata across cluster nodes, better aggregate throughput and data redundancy for better fault-tolerance. In order to access files that are stored on multiple nodes efficiently, distributed file systems (DFS) have appeared. However, distributed file systems are of disadvantage when their view of storage is through a file system node. Data must always travel through this intermediary node whether it is traveling in or out of storage. This extra layer of processing prevents distributed file systems from delivering the full potentials of the underlying storage systems, even for a single client. As one of the most popular distributed file systems, network file system (NFS) was first developed by Sun Microsystems in 1984 [25] and then established as a !777888---000---777666!555---444!777!---888///111333 $$$222666...000000 ©©© 222000111333 IIIEEEEEEEEE DDDOOOIII 111000...111111000!///IIIPPPDDDPPPSSSWWW...222000111333...333666
standard in 1986. The standard continues evolving to meet the changing needs of an increasingly dynamic industry. As a result, various versions of NFS have been released. However, until NFSv4, NFS adopts a ”single server” architecture, which binds one network endpoint to all files in a file system. The recently emerging distributed file systems such as GFS (Google File System)[20] and HDFS (Hadoop Distributed File System)[21] contain only one metadata server (called Master in GFS, and NameNode in HDFS) for simplifying management. In order to provide fast metadata operations, the metadata (namespaces and file-to-data mapping informations) of both GFS and HDFS are stored in memory and therefore, the number of files that can be supported is limited by the memory capacity of the metadata server heap. What’s more, too many metadata requests per second can flood the network of the metadata server, hindering the performance of the file system. In GFS and HDFS, the data is pushed linearly along a chain of data servers (called ChunkServers in GFS and DataNodes in HDFS), and the chain of the data servers is carefully selected to fully utilize each machine’s network bandwidth. Obviously, even for a single client, the maximum throughput of this chain is the minimum I/O speed of these machines. Under the circumstance that clients throughput is much larger than data servers, the linear chain will become the bottleneck. In the end, GFS and HDFS do provide a file system interface, but it is not compliant with POSIX. Consequently, software developers must adapt their programs before running them on GFS or HDFS. To deal with these problems, Parallel NFS (pNFS) [1] has been introduced as an extension to NFS. pNFS decouples the data path of NFS from its control path, thereby enabling concurrent I/O streams from many client processes directly access to multiple storage servers. By separating control and data flows, pNFS allows data to transfer in parallel from storage endpoints to clients. To foster parallel I/O and utilize network bandwidth efficiently, when writing files to the server, pNFS strips files data into fixed-size blocks and store them to multiple data servers. In practice, the block size can be configured by users. pNFS has many attractive features. 1) It supports a variety of layout drivers, which make pNFS feasible to run on any existing I/O and storage protocols. 2) It provides a 111!888!
straightforward way to migrate and upgrade existing NFS deployments to improve scalability and performance. 3) Many researches have shown that pNFS is a promising technology for wide area storage access. 4) Among many distributed file system such as GPFS [24], PVFS [23] and GFS [20], pNFS is the only open standard developed and supported by many commercial and non-profit organizations. As a result, pNFS has been widely deployed on a large number of architectures and platforms, ranging from workstations to high-end server farms. Though pNFS can support large files processing efficiently, we found that it has deficiency in processing small files. Figure 1 shows the aggregate Reading and Writing throughput for big and small files on pNFS. While pNFS can get linear aggregate I/O throughput for big files as the number of clients increases, its throughput for small files remains unchanged. Unfortunately, a large number of applications in scientific computing environments deal with small files [17]. In the field of climatology, the Community Climate System Model involves 450,000 files, with an average size of 61MB [18]. In biology, some applications can generate 30 million files, with an average size of 190KB. Similarly, the image files in astronomy are less than 1MB, but the number may exceed 20 million [19]. Xen [22], one of the most widely used virtual machine monitors, could base its root file system on a pNFS server so that client systems can boot from it. This approach is of particular use for migrating Xen DomU guests from one host to another. Since both the original host and the target one to which the guest will migrate need to access the DomU’s root file system, pNFS obviously provides an ideal mechanism to do this. An analysis of the files (Ubuntu Server 11.04 in our case) in the root file system showed that most files are small ones. More precisely, 92.34% of the files are below 40KB, 94.61% below 80KB and 96.62% below 200KB. 1200
over pNFS, we present tpNFS, an extension to pNFS that adds a transport driver to the metadata servers to make data of files, no matter small files or large files, be stripped more evenly onto multiple data servers. Our experimental results show that with tpNFS, the files data can be stripped almost evenly on each data server. On the contrast, with pNFS, the file data is seated only on the front eight data servers whereas the rest data servers are almost empty. Starting two DomU Clients from pNFS takes 35.67s whereas from tpNFS takes 33.18s. Starting 32 DomU clients at the same time from pNFS costs 97.64s, but doing that with tpNFS only takes 45.86s, implying a 46.97% reduction in time. The aggregate Read and Write throughput of pNFS is 527.68MB/s and 136.55MB/s respectively, whereas the value of tpNFS is 1004.82MB/s and 281.27MB/s, implying an improvement of 90.42% and 105.98% respectively. The rest of the paper is organized as follows. An overview of pNFS is provided in Section II. Then, Section III analyses the problem of small files processing over pNFS and describes how to implement tpNFS. After that, Section IV evaluates the performance of tpNFS. Section V reviews some related work. Finally, we conclude the paper in Section VI. II. OVERVIEW OF P NFS By separating control and data flows, pNFS allows data to transfer in parallel from storage endpoints to clients, which removes the single server bottleneck. As shown in Figure 2, the architecture of pNFS adds a layout driver and an I/O driver to the standard NFSv4 architecture. The NFSv4.1 client can communicate with any parallel file using the Layout and I/O driver. The NFSv4.1 server has multiple roles. It acts as a metadata server (MDS) for the parallel file system, which sends information to the clients on how to access the back-end cluster file system. This information, returned by GETDEVICEINFO operation, usually contains an IP address and port number that is stored by the client layout driver. The NFSv4.1 server may either direct communicate with the data server, or it may communicate with a metadata server, which is responsible for talking to and controlling the data servers in the parallel file system. In order to bypass ”single server” bottleneck, pNFS could also be configured to have multiple Metadata Servers.
Small Files(Read) Small Files(Write) Big Files(Read) Big Files(Write)
Aggregate Throughput(MB/s)
1000
800
600
A. pNFS Operations Flow
400
Let’s see a real example for the operations flow taken by a pNFS client to a metadata server and storage devices. When a pNFS client encounters a new FSID, it sends a GETATTR to the NFSv4.1 server for the FS LAYOUT TYPE attribute. If the attribute returns at least one layout type, and the layout types returned are among the set supported by the client, the client knows that pNFS is a possibility for the file system. Assuming the client supports the layout type returned by GETATTR and it chooses to use pNFS for data access,
200
0
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Number of Clients
Figure 1.
Big and Small Files Aggregate RW Throughput on pNFS
In order to bypass the bottleneck of small files processing
111!!000
servers. At the NFSv4.1 server (which acts as the meta data server), the spNFS daemon runs in user-space and communicates with the meta data server in the kernel via the RPC PipeFS, which is essentially a wait queue in the kernel. The meta data server enqueues requests from the clients via the control path, and pushes to the spNFS daemon via an up-call. The spNFS daemon then processes each of these requests and makes a downcall into the kernel with the appropriate data/response for the requests. The meta data requests sent up to the spNFS daemon for processing include the layout related procedures, that is LAYOUTGET, LAYOUTRETURN, LAYOUTCOMMIT and GETDEVICEINFO. In order to work on the processing of the requests, the spNFS daemon mounts an NFSv3 directory from each of the data-servers. For example, when the meta data server receives a file creating request from a client, the spNFS daemon will open the file on the mount of each of the data servers in create mode. The data servers will return the set of open file handles to the meta data server as part of the response to the up-call. The meta data server will then reply to the NFSv4.1 client with the file layout, with which the client will communicate with the data servers.
Layout Driver
I/O Driver
1)6Y&OLHQW QW W
Management
NFSv44 I/O NFS
NFSv4.1 Server(s) Metadata Server(MDS)
6WRUDJH
Figure 2.
pNFS Architecture
it then sends LAYOUTGET using the FILEHANDLE and STATEID returned by OPEN, specifying the range it wants to do I/O on. The response is a layout, which may be a subset of the range for which the client asked. It also includes device IDs and a description of how data is organized (or in the case of writing, how data is to be organized) across the devices. When the client wants to send an I/O, it determines to which device ID it needs to send the I/O command by examining the data description in the layout. It then sends a GETDEVICEINFO to find the device address of the device ID. The client then sends the I/O request to one of device ID’s device addresses, using the storage protocol defined for the layout type. If the I/O was a WRITE, then at some point the client may want to use LAYOUTCOMMIT to commit the modification time and the new size of the file to the metadata server and the modified data to the file system [1]. The flow chart of the operations is shown in Figure 3.
III. TP NFS:P ROBLEM A NALYSIS AND I MPLEMENTATION In this section, we examine the consideration for bypassing the bottleneck of small files processing over pNFS with the file layout driver. A. Problem Analysis When a client write a file to the server, the data of the file is distributed on each data server by the block of a specific size, which could be configured by users. If the block size is too small, the file is divided into many blocks distributed on each data servers, which could have a bad influence on the I/O performance. However, if the block size is too big, if bigger than the file size, all the data of the file is seated on the first data server. In evaluation section IV-B, we find that the best I/O performance could be achieved at the block size of 40KB of different file size. For big files (e.g., size of MB or larger), the data will strip on each data server, clients could read/write from/to data servers in parallel. However, if the storage is a small files one, all the files will be stored on a single server, and the rest data servers will be empty. In this case, pNFS server is actually degenerated into NFS server, which has only one data server and clients could not read/write data in parallel, leading to a high skew of data distribution on data servers (as shown in Figure 6) and consequential a bad throughput (as shown in Figure 1). To bypass this skew bottleneck of small files processing, we have to make small files distributed on each server more evenly not on only one data server. We present tpNFS, an extension to pNFS which adds a transport driver to the pNFS metadata servers, acting as a load balancer of network. As we described in section II-A, clients write file
ķGETATTR ķFS_LAYOUT_TYPE ķ ĸLAYOUTGET
Metadata Server (MDS)
ĸLAYOUT INFO ĹGETDEVICENFO
Client
ĹDEVICE INFO
ĺREAD/WRITE ĻLAYOUTCOMMIT
ĺI/O DATA
Storage
ĻLAYOUT INFO
Figure 3.
Flow Chart of pNFS Operations
B. pNFS Using a File Layout As shown in Figure 2, the pNFS client uses the file layout and I/O driver for communicating with the data servers. The layout driver translate READ and WRITE requests from the upper layer to the protocol the back-end parallel file system uses, that is object, block and file based layout. In this paper, we use the file-based model. The NFSv4.1 clients use a file layout driver to communicate with the NFSv4 servers, which act as the data
111!!111
data to the data servers according to the layout information got from the Metadata Server. In order to be compatible with conventional interfaces and make no change to clients, tpNFS intercepts the MDS operations return value, making some modification and the sent to clients. In clients’ view, tpNFS is transparent and indiscriminating to raw pNFS. In pNFS, file data is distributed by Layout information. To load balancing data distribution, all the return information of layout information relative operations is intercepted and adapted in tpNFS. We mainly intercept layout information relative procedures, i.e., LAYOUTGET, LAYOUTRETURN, LAYOUTCOMMIT and GETDEVICEINFO. B. Implementation 1) LAYOUTGET: The LAYOUTGET operation requests a layout from the metadata server for reading or writing the file given by the FILEHANDLE at the byte-range specified by offset and length. The layouts returned by the operation contain information which is about how the data is distributed on the data servers. The information we mainly concern is Device ID and data server FILEHANDLEs. The Device ID identifies the data server information, and FILEHANDLEs identify the file data written to on each data server. We modify these two messages to redirect data flow. Device ID is corresponding to an array of data servers, the order of which decides the data writing sequence. In Raw pNFS, the order is always the same, i.e., DS1 is the first, DS2 is the second, DS3 is the third, etc. Apparently, in raw pNFS, the first block of file data is invariably seated on DS1. In other words, small files is always seated on the front several data servers, leaving the rest data servers empty. As a result, small files on pNFS get a bad aggregate I/O throughput when many clients Read/Write concurrently, which could be seen in Figure 1. In Transport Driver, we use virtual Device ID (will discuss in III-B4) to identify different order of the data servers: DeviceID Locator =((Hash(f ile inode) + i) mod DataServers Length
(1)
Hashing the inode randomizes each file’s starting point in the data server, ensuring each small file is more evenly distributed on each data server. Adding the data server number outside the hash ensures that large files use all data server uniformly. Using Hash(f ile inode + i) effectively selected a data server independently at random for each block, but producing a binomial rather than a uniform distribution of blocks on data servers. When the number of data servers increases, the occupancy deviation between the least-filled and most-filled servers will also increase [27]. Corresponding to the order of data server, FILEHANDLEs’ order is also adjusted, each FILEHANDLE is relative
to the data server it seated on: F ILEHAN DLES[i] = GET F H(f ile path[(( Hash(f ile inode) + i)mod DataServers Length]) (2) 2) LAYOUTRETURN: This operation informs the server that obtained layout information is no longer required. Clients return a layout voluntarily or when they receive a server recall request. For we adjust data servers order, we have to modify this operation to fit the actual data distribution. When servers receives layout information from clients, the FILEHANDLEs’ order is corresponding to the adjusted data servers’ order. In order to match raw pNFS logical relation, the FILEHANDLEs’ order must be adjusted before stored. F ILEHAN DLES[i] = RECEIV EDF HS[ ((Hash(f ile inode) + i)modDataServers Length] (3) 3) LAYOUTCOMMIT: The LAYOUTCOMMIT operation commits changes to the layout information. The client uses this operation to commit or discard provisionally allocated space, update the end of file, and fill in existing holes in the layout. We also have to modify this operation to make sure the files on each data server are right updated. When MDS get the committed changes from clients, the layout information must be modified to match the real data servers order before stored. F ILEHAN DLES[i] = CHAN GEDF HS[ ((Hash(f ile inode) + i)modDataServers Length] (4) 4) GETDEVICEINFO: The GETDEVICEINFO operation returns pNFS storage device address information for the specified Device ID. Client uses this information to find which storage device to get/send file data. Each FILEHANDLE must match the corresponding data server to get/send file data rightly. In order to distribute file data uniformly, we adjust the order of the FILEHANDLEs, so we also have to adjust data servers’ order to work in with FILEHANDLEs. We create N (N is equal to the amount of data servers) virtual storage devices, each of which contains all the data servers but in different order. The Device ID ranges from 1 to N, and each one is corresponding to a sequence of data servers. For the given Device ID, the storage device address information is calculated in such method: DSLIST [i] = GET DSADDR(DataServers[ (5) (DeviceID + i)modN ]) IV. E VALUATION In this section, we evaluate the performance of tpNFS for small files. First, we discuss the experimental environment in section IV-A. Following that, in section IV-B, we look at different performance of different block sizes; in section IV-C, we look at performance advantages of tpNFS.
111!!222
A. Experimental Environment
C. tpNFS Versus pNFS
To evaluate the performance of tpNFS, we used a network of thirty-three identical nodes partitioned into sixteen clients, sixteen data servers and one metadata server. Each node has four 2 GHZ Quad-Core processors with 8 GB of DDR RAM. Each node is equipped with one 500GB, 7200RPM, Western Digital Caviar Blue SATA 6Gb/s Disk, which has a sequential data read of 136.72MB/s and write of 130.59MB/s. All the machines have a Gigabit Ethernet card and are connected via a gigabit Ethernet switch. The operating system kernel is Linux 2.6.37.
In many scientific environments, a large number of applications consist of large quantity of small files as describe in [17, 18, 19], e.g. in biology, some applications generate 30 million files with an average size of 190KB. In astronomy the image files are less than 1MB but exceed 20 million in number [19]. Xen[22], one of the most widely used virtual machine monitors, could install root file system on a pNFS server such that could be accessed and booted from on a client system. This approach is of particular use when migrating Xen DomU guests from one host to another. Since both the original host and the target host to which the guest is migrating need access to the DomU’s root file system, pNFS provides an ideal mechanism for achieving this by allowing the root file system located on the pNFS server to be mounted on both client hosts. Similarly, we also count the system files number of different size in Ubuntu Server 11.04, i.e. the size of 92.34% of the files is below 40KB, 94.61% is below 80KB and 96.62% is below 200KB. Xen guest DomU can boot from pNFS mounted root file system. After we create the root file system on the pNFS server, we check the file data distribution on each data server, the result is shown in Figure 6.
B. pNFS Performance of Various Block Size Our first experiment investigates pNFS I/O performance on various block size of data servers. We Read/Write from/to the pNFS server by the data size of 512KB, 1MB, 4MB, 16MB, 64MB and 256MB. As we can see from Figure 4 and Figure 5, as the block size increases the Read/Write speed of all data increases. When the block size is 40960B (40KB), the read/write speed gets to the maximum point. So, in the successor experiments, we set the block size of data server to 40KB. 40
512K 1M 4M 16M 64M 256M
35
pNFS tpNFS
100 90
Percentage of Files Seated on DS
Write Speed(MB/s)
80 30
25
20
70 60 50 40 30 20
15
10 4096
8192
12288
20480
32768
40960
61440
Block Size(Byte)
0
DS1 DS2 DS3 DS4 DS5 DS6 DS7 DS8 DS9 DS10 DS11 DS12 DS13 DS14 DS15 DS16 Data Server
Figure 4.
pNFS Write Performance on Various Block Size Figure 6.
80
512K 1M 4M 16M 64M 256M
Read Speed(MB/s)
70
60
50
40
30
4096
8192
12288
20480
32768
40960
61440
Block Size(Byte)
Figure 5.
pNFS Read Performance on Various Block Size
Data Distribution
Using raw pNFS, all the system files have data seated on DS1, 7.65% of the files have data seated on DS2, and no more than 2.5% of the files have data seated on DS8 and rear data servers. Because the data of each file is stripped according to the following rule: the first 40KB block is stripped on DS1, the next 40KB block (if has 40KB block more) is stripped on DS2, the third 40KB block is stripped on DS3, et al. Using tpNFS, the file data is averagely distributed on each data server, approximately each data server has 8.7% files seated on it. The first data block of each file is seated on randomized data server (i.e., on DSrandom ) not always on DS1, the second data block is seated on DSrandom+1 , the third block is seated on DSrandom+2 , et al. The added transport driver distributes file data evenly on every data
111!!333
server. We check Xen guest DomU booting time with various amount of DomU Clients , the result is shown in Figure 7 shows the time DomU Clients boot from pNFS and tpNFS simultaneously. 2 DomU Clients boot from raw pNFS use 35.67s while from tpNFS use 33.18s. As the number of DomU clients increases, boot time on the two servers both increases, but boot time on raw pNFS increases much faster than that on tpNFS. Starting 32 DomU clients at the same time on raw pNFS costs 97.64s, which is 2.7 times of starting 2 DomU clients. As a contrast, starting 32 DomU clients on tpNFS only takes 45.86s, which is only 1.38 times of starting 2 DomU clients. The time of starting 32 DomU clients on tpNFS is also 46.97% of that on raw pNFS. 100
300
pNFS tpNFS
Aggregate Throughput(MB/s)
250
200
150
100
50
0
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Number of Clients
Figure 8. Aggregate Write Throughput with Sixteen Clients and Separate Small Files
pNFS tpNFS
pNFS tpNFS
1000 90
800 Aggregate Throughput(MB/s)
Boot Time(s)
80
70
60
50
400
200
40
30
600
0
2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
0
32
Number of DomU Clients
Figure 7.
DomU Boot Time
0
1
2
3
4
5
6
7 8 9 10 Number of Clients
11
12
13
14
15
16
Figure 9. Aggregate Read Throughput with Sixteen Clients and Separate Small Files
We check the tpNFS and pNFS IO performance by writing/reading small files(the system files of Ubuntu 11.04 Server) from/to various amount of clients(from 1 client to 16 clients) concurrently. Each test we repeat 10 times and use the mean result. To avoid write confliction, we create one folder for each client, and different client write/read to/from different path. Figure 8 and Figure 9 respectively show the write/read performance with each client writing/reading to separate small files. Raw pNFS achieves a maximum aggregate throughput of 136.55MB/s(write) and 527.68MB/s(read). tpNFS reaches a maximum aggregate throughput of 281.27MB/s(write) and 1004.82MB/s(read). From Figure 6, we could find that most file data is seated on the front 6 or 7 data servers when using raw pNFS, which means clients will only get IO with these 6 or 7 data servers. When 10 or 11 clients write/read concurrently, the throughput will get peak, for the reason of the 6 or 7 data servers’ hard disk IO bandwidth. To the contrast, using tpNFS, the file data is distributed on every data server, clients get IO with all the data severs. As the number of clients increases, the IO throughput gets linear growth. Figure 10 presents the write/read performance with 16/32 clients and big files (size between 500MB and 1GB). When writing/reading big files, the performance of Raw pNFS and
tpNFS is almost the same. tpNFS has very little effect on Big Files I/O performance, imposing only a minor performance penalty. V. R ELATED W ORK There is a long history of research on NFS. It has served the industry well since its introduction and establishment as a standard in 1986 and the standard has continued to evolve to meet the changing needs of an increasingly dynamic industry. The latest version NFSv4.1 proposes a number of salient features, such as client delegation, compound operations and many others. As a part of the NFSv4.1 standard, pNFS benefits from the robustness of the standards processas well as from considerable private industry expertise and experience and input from the open source community. A large number of very capable people and organizations have contributed technology and have been involved in the pNFS effort, including many with considerable experience building early proprietary parallel file systems. Hildebrand et al. first demonstrated a prototype of pNFS over PVFS2 [4] in 2005. It could achieve high throughput access to a high-performance file system and achieved aggregate throughput equal to that of its exported file system
111!!444
1800
We conducted a performance evaluation of tpNFS using data distribution, boot time of DomU clients and parallel I/O benchmarks, and compared it with that of pNFS. The results indicate that the performance of tpNFS is much better than that of pNFS for small files processing, especially when many clients read/write concurrently.
pNFS tpNFS
1600
Aggregate Throughput(MB/s)
1400 1200 1000 800
ACKNOWLEDGMENT
600 400 200 0
16 Clients Read
16 Clients Write
32 Clients Read
32 Clients Write
This Work is co-supported by National Basic Research (973) Program of China (2011CB302505), Natural Science Foundation of China (61170210, 61073165), and National High-Tech R&D (863) Program of China (2011AA01A203).
Figure 10. Aggregate Read/Write Throughput with 16/32 Clients and Big Files
and far exceeded standard NFSv4 performance. W. Yu et al. designed and implemented lpNFS, a transparent pNFS on Lustre [2], providing fast data transfer paths through fast memory copying of small messages and zero-copy page sharing of large messages. However, they did not focus on the properties of small files, and also had the small file bottleneck. In [26], D. Beaver et al. demonstrated a combination method, combining small photo file as a big file, to bypass small files bottleneck. However, this method is special for read-only files, for normal files (i.e., files may be modified frequently) there will be a large performance penalty to handle modification. GlusterFS [28], one of the emerging distributed file systems, is designed with no metadata to avoid single point failure and performance bottleneck. Without metadata, GlusterFS uses special algorithms to locate files, and metadata is stored with file data. Using this method, all the servers of the system could provide locating service, enabling completely parallel data access and linear scalability. However, this method has a bad influence on files traverse and is short of global monitor. The clients also have to undertake more functions, i.e., file locating, namespaces cache and logic volume maintenance, resulting in more CPU and memory consuming in clients. As one of the emerging file system, GlusterFS still have certain bugs, and is most tested by amateurs not widely deployed in industrial environments. VI. C ONCLUSIONS In this paper, we presented tpNFS, an extension to pNFS that bypasses the bottleneck of small files processing over pNFS by adding a transport driver to the metadata servers. The added transport driver redirects the layout information of pNFS to make data of files, no matter small files or large files, be stripped more evenly onto multiple data servers. In addition, our extension only intercepts LAYOUTGET, LAYOUTRETURN, LAYOUTCOMMIT and GETDEVICEINFO procedures without changing the pNFS interfaces, so there is no need to modify existing programs to use tpNFS.
R EFERENCES [1] Shepler, M. Eisler, D. Noveck, Network File System (NFS) Version 4 Minor Version 1 Protocol. http://www.ietf.org/rfc/rfc5661.txt, January 2010. [2] W. Yu, Oleg Drokin, Jeffrey S. Vetter, ”Design, Implementation, and Evaluation of Transparent pNFS on Lustre”, Parallel & Distributed Processing , 2009. [3] Noronha, R., Ouyang, X., Panda, D.K. , ”Designing a HighPerformance Clustered NAS: A Case Study with pNFS over RDMA on InfiniBand”, Technical Report OSU-CISRC-5/08TR28, Department of Computer Science and Engineering, The Ohio State University (2008). [4] D. Hildebrand, P. Honeyman, ”Exporting Storage Systems in a Scalable Manner with pNFS”, In Proceedings of the 22nd IEEE/13th NASA Goddard Conference on Mass Storage Systems and Technologies, 2005. [5] G. Gibson and P. Corbett, ”pNFS Problem Statement”, Internet Draft, http://wwww.ietf.org/internet-drafts/draft-gibsonpnfs-reqs-00.txt, October 2004. [6] D. Hildebrand, L. Ward, and P. Hoenyman. ”Large files, small writes, and pnfs”. In Proceedings of the 20th ACM International Conference on Supercomputing (ICS06), Cairns, Australia, June 2006. [7] D. Hildebrand, M. Eshel, R. Haskin, P. Andrews, P. Kovatch, and J. White, ”Deplying pnfs across the wan: First steps in hpc grid computing”, In Proceedings the 9th LCI International Conference on High-Performance Clustered Computing, Urbana, IL, April 2008. [8] D. He, X. Zhang, D. H. Du, and G. Grider. ”Coordinating Parallel Hierarchical Storage Managerment in Object-base Cluster File Systems”, In Proceedings of the 23rd IEEE Conference on Mass Storage Systems and Technologies (MSST), May 2006. [9] Batsakis, A., Burns, R., Kanevsky, A., Lentini, J., Taley, T., ”An Adaptive Write Optimizations Layer”, Fast 2008: Proceedings of the 6th USENIX Conference on File and Storage Technologies, 2008. [10] F. G. Carballeira, A. Calderon, J. Carretero, J. Fernandez, J.M. Perez, ”The Design of the Expand File System”, International Journal of High Performance Computing Applications, vol. 17, pp.21-37, 2003.
111!!555
[26] D. Beaver, S. Kumar, H. C. Li, J. Sobel and P. Vajgel, ”Finding a needle in Haystack: facebook’s photo storage”, Proceedings of the 9th USENIX conference on Operating systems design and implementation(OSDI), p.1-8, October 0406, 2010, Vancouver, BC, Canada
[11] E. J, P. Fuhrmann, Y. Kemp, T. Mkrtchyan, D. Ozerov and H. Stadie, ”LHC data analysis using NFSv4.1 (pNFS): A detailed evaluation Computing in High Engergy and Nuclear Physics”, CHEP 2010span [12] Sun Microsystems Inc., ”NFS: Network File System Protocol Specification”, RFC 1094, March 1989.
[27] E. Nightingale, J. Elson, O. Hofmann, Y. Suzue, J. Fan, and J. Howell, ”Flat Datacenter Storage”, In Proceedings of the 10th USENIX conference on Operating systems design and implementation, Oct. 2012.
[13] Microsoft Corporation, ”CIFS Protocol”, http://msdn.microsoft.com/en-us/library/aa302188.aspx. [14] S. Shepler, B.Callaghan, D. Robinson, R. Thurlow, C. Beame, M. Eisler, and D. Noveck, ”Network File System Version 4 Protocol Specirfication”, http://www.ietf.org/rfc/rfc3530.txt, April 2003.
[28] The Gluster Home Page, http://www.gluster.org/
[15] D. T. Meyer and W.J. Bolosky, ”A Study of Practical Deduplication”, In Proceedings of the 9th USENIX conference on File and storage technologies, FAST’ 11, Berkeley, CA, USA, 2011. [16] D. Hildebrand and P. Honeyman., ”Direct-pNFS: Scalable, Transparent, and Versatile Access to Parallel File Systems”, In Proceedings of the 16th IEEE International Symposium on High Performance Distributed Computing , Monterey, CA, June 2007. [17] P. Carns, S. Lang, R. Ross, M. Vilayannur, J. Kunkel,and T. Ludwig, ”Small-File Access in Parallel File Systems”, In Proceedings of the 23rd IEEE International Parallel and Distributed Processing Symposium, April 2009. [18] A. Chervenak, J. M. Schopf, L. Pearlman, M. Su,S. Bharathi, L. Cinquini, M. D’Arcy, N. Miller, and D. Bernholdt, ”Monitoring the Earth System Grid with MDS4”, International Conference on e-Science and Grid Computing, 0:69, 2006. [19] E. H. Neilsen Jr, ”The Sloan Digital Sky Survey Data Archive Server”, Computing in Science and Engineering, 10(1):13-17, 2008. [20] Ghemawat, S., Gobioff, H., and Leung, S.-T. 2003, ”The Google file system”, In 19th Symposium on Operating Systems Principles. Lake George, NY. 29-43. [21] S. Shvachko, H. Kuang, S. Radia, and R. Chansler, ”The Hadoop Distributed File System”, In Proceedings of the Symposium on Mass Storage Systems and Technologies, pages 110, May 2010. [22] P. Barham, B. Dragovic, K. Fraser and S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, and A. Warfield, ”Xen and the Art of Virtualization”, In Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP), Bolton Landing, Lake George, New York, Oct 2003. [23] ”The Parallel Virtual http://www.pvfs.org/pvfs2.
File
System,
version
2”,
[24] F. Schmuck and R. Haskin, ”GPFS: A Shared-Disk File System for Large Computing Clusters”, In FAST ’02, pages 231-244. USENIX, Jan. 2002. [25] R. Sandberg, D. Goldberg, S. Kleiman, D. Walsh, and B. Lyon, ”Design and implementation of the Sun network filesystem”, In Proceedings of the Usenix Conference, June 1985, pp. 119-130.
111!!666