CSAR: Cluster Storage with Adaptive Redundancy - CiteSeerX

5 downloads 3687 Views 159KB Size Report
As a result, a disk crash on any of. £ ..... reconstruct the data in the stripe in the event of a crash. ... 80 GB ultra-side SCSI hard drive, a Myrinet 2000 interface,.
CSAR: Cluster Storage with Adaptive Redundancy Manoj Pillai, Mario Lauria Department of Computer and Information Science The Ohio State University Columbus, OH 43210, USA pillai, lauria @cis.ohio-state.edu 



Abstract Striped file systems such as the Parallel Virtual File System (PVFS) deliver high-bandwidth I/O to applications running on clusters. An open problem in the design of striped file systems is how to reduce their vulnerability to disk failures with the minimum performance penalty. In this paper we describe a novel data redundancy scheme designed specifically to address the performance issue. We demonstrate the new scheme within CSAR, a proof-of-concept implementation based on PVFS. By dynamically switching between RAID1 and RAID5 redundancy based on write size, CSAR consistently achieves the best of two worlds - RAID1 performance on small writes, and RAID5 efficiency on large writes. Using the popular parallel I/O benchmark BTIO, our scheme achieves 82% of the write bandwidth of the unmodified PVFS. We describe the issues in implementing our new scheme in a popular striped file system such as PVFS on a Linux cluster.

1. Introduction Input/Output has been identified as the weakest link for parallel applications, especially those running on clusters [13]. A number of cluster file systems have been developed in recent years to provide scalable storage in a cluster. The goal of the Parallel Virtual File System (PVFS) project [9] is to provide high-performance I/O in a cluster, and a platform for further research in this area. An implementation of the MPI-IO library over PVFS is available, and has made PVFS very popular for parallel computing on Linux clusters. In PVFS, as in most other cluster file systems, clients directly access multiple storage servers on data transfer operations, providing scalable performance and capacity. A major limitation of PVFS, however, is that it does not store any redundancy. As a result, a disk crash on any of Proceedings of ICPP 2003, Kaohsiung, Taiwan, ROC, October 2003. Copyright c 2003 IEEE. 



the many storage servers will result in data loss. Because of this limitation, PVFS is mostly used as high performance scratch space; important files have to be stored in a lowbandwidth, general-purpose file system. The goal of the CSAR project is to study issues in redundant data storage in high-bandwidth cluster environments. We have extended PVFS so as to make it tolerant of single disk failures by adding support for redundant data storage. Adding redundancy inevitably reduces the performance seen by clients, because of the overhead of maintaining redundancy. Our primary concern is to achieve reliability with minimal degradation in performance. A number of previous projects have studied the issue of redundancy in disk-array controllers. However, the problem is significantly different in a cluster file system like PVFS where there is no single controller through which all data passes. We have implemented three redundancy schemes in CSAR. The first scheme is a striped, block-mirroring scheme which is a variation of the RAID1 and RAID10 schemes [3] used in disk controllers. In this scheme, the total number of bytes stored is always twice the amount stored by PVFS. The second scheme is a RAID5-like scheme [3], which uses parity-based partial redundancy to reduce the number of bytes needed for redundancy. In addition to adapting these well-known schemes to PVFS, we have designed a novel scheme (Hybrid) that uses RAID5 style (parity-based) writes for large write accesses, and mirroring for small writes. The goal of the Hybrid scheme is to provide the best of the other two schemes by adapting dynamically to the presented workload. In this paper we focus on the performance aspects of RAID-type redundancy in a distributed environment. The main contribution of this work is to show the performance benefit of a design (the Hybrid scheme) that specifically addresses the shortcomings of traditional schemes in a distributed environment. The rest of this paper is organized as follows. Section 2 describes the advantages and disadvantages of the RAID1 and RAID5 schemes, and provides the motivation for the

Hybrid scheme. Section 3 describes related work. Section 4 gives an overview of the PVFS implementation, and the changes we made in order to implement each of our redundancy schemes. Section 5 describes experimental results. Section 6 provides our conclusions and outlines future work.

2. Motivation Redundancy schemes have been studied extensively in the context of disk-array controllers. The most popular configurations used in disk-array controllers are RAID5 and RAID1. In the RAID5 configuration, storage is organized into stripes. Each stripe consists of one block on each disk. One of the blocks in the stripe stores the parity of all other blocks. Thus, for a disk-array with n disks, the storage overhead in RAID5 is just 1/n-1. Since the number of bytes of redundancy that have to be written is small, the performance overhead for writes that span an integral number of stripes is also low. The major problem with the RAID5 configuration is its performance for a workload consisting of small writes that modify only a portion of a stripe. For a small write, RAID5 needs to do the following: (1) Read the old version of the data being updated and the old parity for it (2) Compute the new parity (3) Write out the new data and new parity. The latency of a small write is high in RAID5 and disk utilization is poor because of the extra reads. In the RAID1 configuration, two copies of each block are stored. This configuration has fixed storage overhead, independent of the number of disks in the array. The performance overhead is also fixed, since one byte of redundancy is written for each byte of data. A number of variations have been proposed to the basic RAID5 scheme that attempt to solve the performance problem of RAID5 for small writes. The Hybrid scheme that we present in this paper is one such scheme. Our work differs from previous work in two ways: 1. Previous work has focused on retaining the low storage overhead of RAID5 compared to RAID1. Our starting point is a high-performance, cluster file system intended primarily for parallel applications. Our emphasis is on performance rather than storage overhead. 2. Many of the previous solutions are intended for diskarray controllers. We are interested in a cluster environment where multiple clients access the same set of storage servers. In a cluster environment, the RAID5 scheme presents an additional problem. In parallel applications, it is common for multiple clients to write disjoint portions of the same

file. With the RAID5 scheme, care must be taken to ensure that two clients writing to disjoint portions of the same stripe do not leave the parity for the stripe in an inconsistent state. Hence, an implementation of RAID5 in a cluster file system would need additional synchronization for clients writing partial stripes. RAID1 does not suffer from this problem. Our Hybrid scheme uses a combination of RAID5 and RAID1 writes to store data. Full-stripe writes use the RAID5 scheme. RAID1 is used to temporarily store data from partial stripe updates, since this access pattern results in poor performance in RAID5. Since partial stripe writes use the RAID1 scheme, we avoid the synchronization necessary in the RAID5 scheme for this access pattern.

3. Related Work A number of projects have addressed the performance problems of a centralized file systems. Zebra [5], xFS [1] and Swarm [6] all use multiple storage servers similar to PVFS, but store data using RAID5 redundancy. They use log-structured writes to solve the small-write problem of RAID5. As a result, they suffer from the garbage collection overhead inherent in log-structured systems [11]. Swift/RAID [10] was an early distributed RAID implementation with RAID levels 0, 4 and 5. In their implementation, RAID5 obtained only about 50% of the RAID0 performance on writes; RAID4 was worse. Our implementation performs much better relative to PVFS. We experienced some of the characteristics reported in the Swift paper. For example, doing parity computation one word at a time, instead of one byte at a time significantly improved the performance of the RAID5 and Hybrid schemes. The Swift/RAID project does not implement any variation similar to our Hybrid scheme. Petal [8] provides a disk-level distributed storage system that uses a mirroring scheme for redundancy. The RAID-x architecture [7] is a distributed RAID scheme that also uses a mirroring technique. RAID-x achieves good performance by delaying the write of redundancy. Delaying the write improves the latency of the write operation, but it need not improve the throughput of the system. For applications that need high, sustained bandwidth, RAID-x suffers from the limitations of mirroring. Also, a scheme that delays the writing of redundancy does not provide the same level of fault-tolerance as the schemes discussed here. TickerTAIP [2] addresses the scalability problem of centralized RAID controllers by implementing a multicontroller parallel RAID5 system. They focus on synchronizing accesses from multiple controllers, but do not address the small-write problem of RAID5. HP AutoRAID [14], parity logging [12] and data logging [4] address the small write problem of RAID5, but they provide solutions meant to be used in centralized storage controllers. These

solutions cannot be used directly in a cluster storage system. For example, AutoRAID uses RAID1 for hot data (writeactive data) and RAID5 for cold data. It maintains metadata in the controller’s non-volatile memory to keep track of the current location of blocks as they migrate between RAID1 and RAID5. In moving to a cluster, we have to re-assign the functions implemented in the controller to different components in the cluster in such a manner that there is minimal impact on the performance.

4. Implementation 4.1. PVFS Overview PVFS is designed as a client-server system with multiple I/O servers to handle storage of file data. There is also a manager process that maintains metadata for PVFS files and handles operations such as file creation. Each PVFS file is striped across the I/O servers. Applications can access PVFS files either using the PVFS library or by mounting the PVFS file system. When an application on a client opens a PVFS file, the client contacts the manager and obtains a description of the layout of the file on the I/O servers. To access file data, the client sends requests directly to the I/O servers storing the relevant portions of the file. Each I/O server stores its portion of a PVFS file as a file on its local file system. The name of this local file is based on the inode number assigned to the PVFS file on the manager. PVFS File

D0

D1

....

D2

D0

D1

D10

D11

D2 D5

D6

D7

D8

D9

D10

D11

I/O Server 0

I/O Server 1

In the RAID1 implementation in CSAR, each I/O server maintains two files per client file. One file is used to store the data, just like in PVFS. The other file is used to store redundancy. Figure 2 shows how the data and redundancy blocks are distributed in the RAID1 scheme to prevent data loss in the case of single disk crashes. The data blocks are labeled D0, D1 etc. and the corresponding redundancy blocks are labeled R0, R1 etc. The contents of a redundancy block are identical to the contents of the corresponding data block. As can be seen from the figure, the data file on an I/O server has the same contents as the redundancy file on the succeeding I/O server. R2

D0 D3

R5

D1

R0

D2

R1

D4

R3

D5

R4

D6

R8

D7

R6

D9

R11

D10

R9

Data File

Redundancy File

I/O Server 0

D File

R File

I/O Server 1

D8

R7

D11

R10

D File

R File

I/O Server 2

Figure 2. The RAID1 scheme As is the case in PVFS, the RAID1 scheme in CSAR is able to take advantage of all the available I/O servers on a read operation. On a write, all the I/O servers may be used but the RAID1 scheme writes out twice the number of bytes as PVFS.

4.3. RAID5 Implementation

D4

D3

4.2. RAID1 Implementation

I/O Server 2

Figure 1. File Striping in PVFS Figure 1 shows the striping for a PVFS file using 3 I/O servers. Each I/O server has one data file corresponding to the PVFS file in which it stores its portion of the PVFS file. The blocks of the PVFS file are labeled D0, D1 etc. and the figure shows how these blocks are striped across the I/O servers. PVFS achieves good bandwidth on reads and writes because multiple servers can read/write and transmit portions of a file in parallel.

Like the RAID1 scheme, our RAID5 scheme also has a redundancy file on each I/O server in addition to the data file. However, in the RAID5 scheme these files contain parity for specific portions of the data files. The layout for our RAID5 scheme is shown in Figure 3. The first block of the redundancy file on I/O server 2 (P[0-1]) stores the parity of the first data block on I/O Server 0 and the first data block on I/O Server 1 (D0 and D1, respectively.) On a write operation, the client checks the offset and size to see if any stripes are about to be updated partially. There can be at most two partially updated stripes in a given write operation. The client reads the data in the partial stripes and also the corresponding parity region. It then computes the parity for the partial and full stripes, and writes out the new data and new parity. The RAID5 scheme needs to ensure that the parity information is consistent in the presence of concurrent writes to the same stripe. When an I/O server receives a read request for a parity block, it knows that a partial stripe update is taking place. If there are no outstanding writes for the stripe,

D0

P[4−5]

D1

P[2−3]

D2

P[0−1]

D3

P[10−11]

D4

P[8−9]

D5

P[6−7]

D7

D6 D9 Data File

D8

D10 Redundancy File

I/O Server 0

D File

D11 R File

I/O Server 1

D File

R File

I/O Server 2

Figure 3. The RAID5 scheme the server sets a lock to indicate that a partial stripe update in in progress for that stripe. It then returns the data requested by the read. Subsequent read requests for the same parity block are put on a queue associated with the lock. When the I/O server receives a write request for a parity block, it writes the data to the parity file, and then checks if there are any blocked read requests waiting on the block. If there are no blocked requests, it deletes the lock; otherwise it wakes up the first blocked request on the queue. The client checks the offset and size of a write to determine the number of partial stripe writes to be performed (there can be at most 2 in a contiguous write.) If there are two partial stripes involved, the client serializes the reads for the parity blocks, waiting for the read for the first stripe to complete before issuing the read for the last stripe. This ordering of reads avoids deadlocks in the locking protocol. In both the RAID1 and the RAID5 scheme, the layout of the data blocks is identical to the PVFS layout. By designing our redundancy schemes in this manner, we were able to leave much of the original code in PVFS intact and implement redundancy by adding new routines. In both schemes, the expected performance of reads is the same as in PVFS because redundancy is not read during normal operation.

4.4. The Hybrid Scheme In the Hybrid scheme, every client write is broken down into three portions: (1) a partial stripe write at the start (2) a portion that updates an integral number of full stripes (3) a trailing partial write. For the portion of the write that updates full stripes, we compute and write the parity, just like in the RAID5 case. For the portions involving partial stripe writes, we write the data and redundancy like in the RAID1 case, except that the updated blocks are written to an overflow region on the I/O servers. The blocks cannot be updated in place because the old blocks are needed to reconstruct the data in the stripe in the event of a crash. In addition to the files maintained for the RAID5 scheme, each I/O server in the Hybrid scheme maintains an additional file for storing RAID1 overflow blocks, and a persistent table listing the overflow regions for each PVFS file.

When a client issues a full-stripe write any data in the overflow region for that stripe is invalidated. When a file is read, the I/O servers consult their overflow table to determine if any of the data being accessed need to be read from the RAID1 overflow region, and return the latest copy of the data. The actual storage required by the Hybrid scheme will depend on the access pattern. For workloads with large accesses, the storage requirement will be close to that of the RAID5 scheme. If the workload consists mostly of partial stripe writes, a significant portion of the data will be in the overflow regions. PVFS allows applications to specify striping characteristics like stripe block size and number of I/O servers to be used. The same is true of the redundancy schemes that we have implemented. Even though redundancy is transparent to clients, the design of our redundancy schemes allows the same I/O servers used for storing data to be used for redundancy.

5. Performance Results 5.1. Cluster Description We conducted experiments on two clusters. For the micro-benchmarks and the ROMIO/perf benchmark we used the Datacluster at the CIS department at the Ohio State University. This cluster has 8 nodes each with two Pentium III processors and 1GB of RAM. The nodes are interconnected using a 1.3 Gb/s Myrinet network and using Fast Ethernet. In our experiments, the traffic between the clients and the PVFS I/O servers used Myrinet. Each node in the cluster has two 60 GB disks connected using a 3Ware controller in RAID0 configuration. For the FLASH I/O and BTIO benchmarks we used the Itanium 2 cluster at the Ohio Supercomputing Cluster. The cluster has 128 compute nodes for parallel jobs, each with 4 GB of RAM, two 900 MHz Intel Itanium 2 processors, an 80 GB ultra-side SCSI hard drive, a Myrinet 2000 interface, a Gigabit Ethernet interface and a 100Base-T Ethernet interface. The nodes run the Linux operating system. We ran our experiments with each PVFS I/O server writing to /tmp which is an ext2 file system. We used the Myrinet interface for our experiments.

5.2. Decoupling Network Receive from File Write In the course of our experiments, we discovered a performance problem in PVFS. In PVFS, the I/O servers use a non-blocking receive to get available data from a socket when a file write is in progress. The data received is then written immediately to the server’s local file. If the file is not in the cache when the write is being performed, this mode

120

30

"RAID0" "RAID1" "Hybrid" "RAID5" "RAID5-npc"

100

25

20 Bandwidth (MB/s)

Bandwidth (MB/s)

80

60

15

40

10

20

5

0

"smw"

2

3

4

5

6

7

8

Number of IO Servers

0

RAID 0

RAID 1

RAID 5

Hybrid

Redundancy Schemes

Figure 4. Large Write Performance

Figure 5. Small Write Performance

of writing the file can cause severe degradation in performance. The degradation in performance results from blocks of the file being written partially, which causes the rest of the block to be read into the cache before the write is applied. To fix the problem, we implemented a write buffering scheme. In this scheme, each write connection on the server is given a small write buffer whose size is a multiple of the local file system block size. Data received from the network is accumulated in this write buffer until the buffer is full or the write is complete. This allows data to be written to file in full blocks if the client access is large enough. All the experiments described below were conducted with the write buffering scheme.

writes to it in one-block chunks. For this workload, RAID5 has to read the old data and parity for each block, before it can compute the new parity. Both the RAID1 and the Hybrid schemes simply write out two copies of the block. Figure 5 shows the median bandwidth seen by the single client. The bandwidth observed for the RAID1 and the Hybrid schemes are identical, while the RAID5 bandwidth is lower. In this test, the old data and parity needed by RAID5 are found in the memory cache of the servers. As a result, the performance of RAID5 is much better than it would be if the reads had to go to disk.

5.3. Performance for Full Stripe Writes We measured the performance of the redundancy schemes with a single client writing large chunks to a number of I/O servers. The write sizes were chosen to be an integral multiple of the stripe size. This workload represents the best case for a RAID5 scheme. For this workload, the Hybrid scheme has the same behavior as the RAID5 scheme. Figure 4 shows the median bandwidth seen by the single client. For this workload, RAID1 has the worst performance of all the schemes, with no significant increase in bandwidth beyond 4 I/O servers. This is because RAID1 writes out a larger number of bytes, and the client network link soon becomes a bottleneck. The RAID5-npc graph in Figure 4 shows the performance of RAID5 when we commented out the parity computation code. As can be seen, the overhead of parity computation on our system is negligible.

In this section, we compare the performance of the redundancy schemes using the perf benchmark included in the ROMIO distribution. perf is an MPI program in which clients write concurrently to a single file. Each client writes a large buffer, to an offset in the file which is equal to the rank of the client times the size of the buffer. The write size is 4 MB by default. In this experiment we used 4 clients. The benchmark reports the read and write bandwidths, before and after the file is flushed to disk. Here we report only the results after the flush. Figure 6 shows the read performance for the different schemes. All the schemes had similar performance for read. For RAID1 and RAID5, the behavior on a read is exactly the same as in PVFS. In the Hybrid scheme, there is additional overhead due to the lookup of the overflow table. For the perf benchmark, the results show that this overhead is minimal. The write performance of the benchmark is shown in Figure 7; RAID5 and Hybrid perform better than RAID1 in this case because the benchmark consists of large writes.

5.4. Performance for Partial Stripe Writes

5.6. FLASH I/O Benchmark

To measure the performance for small writes, we used a benchmark where a single client creates a large file and then

The FLASH I/O benchmark contains the I/O portion of the ASCI FLASH benchmark. It recreates the primary data

5.5. ROMIO/perf Benchmark Performance

400

"RAID0" "RAID1" "Hybrid" "RAID5"

350

Bandwidth (MB/s)

300

250

200

150

100

50

0

2

3

4

5 Number of IO Servers

6

7

8

Figure 6. ROMIO/perf Read Performance

5.7. BTIO Benchmark

"RAID0" "RAID1" "Hybrid" "RAID5"

160

140

Bandwidth (MB/s)

120

100

80

60 40

20

0

2

3

4

5 Number of Servers

6

7

8

Figure 7. ROMIO/perf Write Performance

10

"RAID0" "RAID1" "Hybrid" "RAID5"

Time to Output (seconds)

8

6

4

2

0

4

8

12

16

20 Number of Clients

24

28

32

Figure 8. FLASH I/O Performance

structures in FLASH and writes a checkpoint file, a plotfile with centered data and a plotfile with corner data. The benchmark uses the HDF5 parallel library to write out the data in parallel. At the PVFS level, we see mostly small and medium size write requests ranging from a few kilobytes to a few hundred kilobytes. Figure 8 shows the total output time for the three files for different redundancy schemes using 6 servers, as printed by the application. The clients generate fairly small requests to the I/O servers in this benchmark, and as a result the performance of this benchmark is not very good in PVFS. For the RAID1 and the Hybrid schemes performance is very close to RAID0 performance; RAID5 performs slightly worse than the other two redundancy schemes.

36

The BTIO benchmark is derived from the BT benchmark of the NAS parallel benchmark suite, developed at the NASA Ames Research Center. The BTIO benchmark performs periodic solution checkpointing in parallel for the BT benchmark. In our experiments we used BTIO-full-mpiio – the implementation of the benchmark that takes advantage of the collective I/O operations in the MPI-IO standard. We report results for Class B and Class C versions of the benchmark. The Class B version of BTIO outputs a total of about 1600 MB to a single file; Class C outputs about 6600 MB. The BTIO benchmark accesses PVFS through the ROMIO implementation of MPI-IO. ROMIO optimizes small, noncontiguous accesses by merging them into large requests when possible. As a result, for the BTIO benchmark, the PVFS layer sees large writes, most of which are about 4 MB in size. The starting offsets of the writes are not usually aligned with the start of a stripe and each write from the benchmark usually results in one or two partial stripe writes. The benchmark outputs the write bandwidth for each run. We recorded the write bandwidths for two cases: (1) when the file is created initially (2) when the file is being overwritten. In the latter case we make sure the file has been flushed out of the system cache by writing a series of sufficiently large dummy files. Figure 9 shows the write performance for the Class B benchmark for the initial write; Figure 10 shows the write performance for Class B when the file already exists and is being overwritten. The BTIO benchmark requires the number of processes to be a perfect square. The performance of the Hybrid scheme and RAID-5 in Figure 9 are comparable for 4 and 9 processes, both being better than RAID-1. For 16 processes, the performance of RAID-5 drops slightly, and then for 25 processes it drops dramatically. By comparing the reported bandwidth to a version of RAID-5 with no locking (meaning that the parity block could be inconsistent in the

"RAID0" "RAID1" "Hybrid" "RAID5"

500

I/O Bandwidth (MB/s)

I/O Bandwidth (MB/s)

400

300

200

100

0

"RAID0" "RAID1" "Hybrid" "RAID5"

500

400

300

200

100 0

5

10

15

20

25

30

Number of Clients 0

Figure 9. BTIO/Class B, initial write

0

5

10

15 Number of Clients

20

25

30

Figure 10. BTIO/Class B, overwrite

The performance of the schemes for the BTIO Class C benchmark is shown in Figure 11 for the initial write case and in Figure 12 for the overwrite case. The effect of the locking overhead in RAID5 is less significant for this benchmark. The performance of RAID1 is seen to be much lower than the other two redundancy schemes. The reason is that the caches on the I/O servers start to overflow in the RAID1 scheme because of the large amount of data written – the Class C benchmark writes about 6600 MB of data, and the amount of data written to the I/O servers is twice that in the RAID1 scheme. For the case when the file is overwritten, the big drop in bandwidth for the RAID5 scheme is seen in this benchmark also. For the overwrite case, the Hybrid scheme performs much better than the other two redundancy schemes.

"RAID0" "RAID1" "Hybrid" "RAID5"

500

I/O Bandwidth (MB/s)

400

300

200

100

0

0

5

10

15 Number of Clients

20

25

30

Figure 11. BTIO/Class C, initial write

"RAID0" "RAID1" "Hybrid" "RAID5"

500

400 I/O Bandwidth (MB/s)

presence of concurrent writes), we were able to determine that most of the drop in RAID5 performance is due to the synchronization overhead of RAID5. When the output file exists and is not cached in memory at the I/O servers, the write bandwidth for RAID5 drops much below the bandwidths for the other schemes, as seen in Figure 10. In this case, partial stripe writes result in the old data and parity being read from disk, causing the write bandwidth to drop. There is a slight drop in the write bandwidth for the other schemes. This drop results because the alignment of the writes in the benchmark results in some partial block updates at the servers. To verify this we artificially padded all partial block writes at the I/O servers so that only full blocks were written. For the RAID0, RAID1 and Hybrid case, this change resulted in about the same bandwidth for the initial write and the overwrite cases. For RAID5, padding the partial block writes did not have any effect on the bandwidth. The reason is that the pre-read of the data and parity for partial stripe writes brings these portions of the file into the cache. As a result when the partial block writes arrive, the affected portions are already in memory.

300

200

100

0

0

5

10

15

20

25

Number of Clients

Figure 12. BTIO/Class C, overwrite

30

6. Conclusions and Future Work

References

There are two potential sources of concern in a distributed RAID5 implementation: (1) the performance of small write accesses (2) the synchronization overhead for concurrent writes. Our experiments using Myrinet showed only a small penalty for partial stripe writes compared to RAID1 when the file accessed is cached in memory. However, when the file is not cached, RAID5 incurs a big penalty. One benchmark showed a significant slowdown because of the synchronization overhead for concurrent writes. In our experiments, the Hybrid scheme solves both of the problems seen with RAID5. For an important parallel benchmark, the Hybrid scheme outperformed both RAID5 and RAID1 by a large margin. The storage overhead of the Hybrid scheme depends on the access pattern of the application. In this paper, we have focused on the performance characterization, and not on the storage overhead of the schemes. We are implementing changes to the Hybrid scheme that will allow us to limit the size of the overflow regions, and study the impact on performance. We have implemented the redundancy schemes as changes to the PVFS library. Our implementation allows us to run programs that use the PVFS library and also run programs written using the MPI-IO interface. PVFS also provides a kernel module that allows the PVFS file system to be mounted like a normal Unix file system. We are in the process of modifying the kernel module to incorporate our redundancy schemes, and characterizing its performance using unmodified applications. One of the goals of the PVFS project was to provide a platform for further research in the area of cluster storage. We found it useful to have a real file system where we could test our ideas. PVFS has become very popular for highperformance computing on Linux clusters, and implementing our schemes in PVFS gave us access to representative applications.

[1] T. Anderson, M. Dahlin, J. Neefe, D. Patterson, D. Roselli, and R. Young. Serverless network file systems. ACM Transactions on Computer Systems, Feb. 1996. [2] P. Cao, S. Lim, S. Venkataraman, and J. Wilkes. The tickertaip parallel raid architecture. ACM Transactions on Computer Systems, Aug. 1994. [3] P. Chen, E. Lee, G. Gibson, R. Katz, and D.Patterson. Raid: Highperformance, reliable secondary storage. ACM Computing Surveys, Vol.26, No.2, June 1994, pp.145-185, 1994. [4] E. Gabber and H. F. Korth. Data logging: A method for efficient data updates in constantly active raids. Proc. Fourteenth ICDE, Feb. 1998. [5] J. Hartman and J. Ousterhout. The Zebra striped network file system. ACM Transactions on Computer Systems, Aug. 1995. [6] J. H. Hartman, I. Murdock, and T. Spalink. The swarm scalable storage system. Proceedings of the 19th International Conference on Distributed Computing Systems, May 1999. [7] K. Hwang, H. Jin, and R. Ho. RAID-x: A new distributed disk array for I/O-centric cluster computing. In Proceedings of the Ninth IEEE International Symposium on High Performance Distributed Computing, pages 279–287, Pittsburgh, PA, 2000. IEEE Computer Society Press. [8] E. K. Lee and C. A. Thekkath. Petal: Distributed virtual disks. In Proceedings of the Seventh International Conference on Architectural Support for Programming Languages and Operating Systems, pages 84–92, Cambridge, MA, 1996. [9] W. B. Ligon and R. B. Ross. An overview of the parallel virtual file system. Proceedings of the 1999 Extreme Linux Workshop, June 1999. [10] D. D. E. Long, B. Montague, and L.-F. Cabrera. Swift/ RAID: A distributed RAID system. Computing Systems, 7(3), Summer 1994. [11] M. Rosenblum and J. Ousterhout. The design and implementation of a log-structured file system. ACM Transactions on Computer Systems, 10(1), Feb. 1992. [12] D. Stodolsky, M. Holland, W. Courtright, and G.Gibson. Parity logging disk arrays. ACM Transaction on Computer System, Vol.12 No.3, Aug.1994, 1994. [13] R. Thakur, E. Lusk, and W. Gropp. I/O in parallel applications: The weakest link. The International Journal of High Performance Computing Applications, 12(4):389–395, Winter 1998. In a Special Issue on I/O in Parallel Applications. [14] J. Wilkes, R. Golding, C. Staelin, and T. Sullivan. The HP AutoRAID hierarchical storage system. In Proceedings of the Fifteenth ACM Symposium on Operating Systems Principles, pages 96–108, Copper Mountain, CO, 1995. ACM Press.

7. Acknowledgments This work was partially supported by the Ohio Supercomputer Center grant # PAS0036-1. We are grateful to Pete Wyckoff and Troy Baer of OSC for their help in setting up the experiments with the OSC clusters. We would like to thank Rob Ross of Argonne National Labs, for clarifying many intricate details of the PVFS protocol and for making available the PVFS source to the research community.

Suggest Documents