FBD: a Fault-tolerant Buffering Disk System for Improving Write Performance of RAID5 Systems Haruo Yokota and Masanori Goto Department of Computer Science, Tokyo Institute of Technology e-mail: fyokota,
[email protected] Abstract The parity calculation technique of the RAID5 provides high reliability, efficient disk space usage, and good read performance for parallel-disk-array configurations. However, it requires four disk accesses for each write request. The write performance of a RAID5 is therefore poor compared with its read performance. In this paper, we propose a buffering system to improve write performance while maintaining the reliability of the total system. The buffering system uses two to four disks clustered into two groups, primary and backup. Write performance is improved by sequential accesses of the primary disks without interruption and by reduction of irrelevant disk accesses for the RAID5 by packing. The backup disks are used to tolerate a disk failure and to accept read requests for data stored in the buffering system so as not to disturb sequential accesses in the primary disks. We developed an experimental system using an off-the-shelf personal computer and disks, and a commercial RAID5 system. The experiments indicate that the buffering system considerably improves both system throughput and average response time.
nization (or latency) is a serious bottleneck in applications using disks. To balance the disk access performance with the data manipulation in a processor, many types of parallel disk arrays have been proposed. Data striping among parallel disks reduces the disk delay and provides high data transfer rates. However, straightforward parallel disk systems have problems with reliability, which is reduced because of physical movements in a disk system. Redundant disk arrays, known as RAIDs (Redundant Array of Inexpensive Disks), have been investigated to enhance performance and to achieve high reliability for systems with multiple disk drives [10, 3, 1]. A RAID system stores data and redundant information over a number of disk drives, and recovers the data using the redundant information following failure of a disk drive. RAID implementation may be at one of six levels corresponding to configurations of the redundant information [1]. The parity calculation technique used in level 5 RAID enables cost effective construction with high reliability and good read performance. RAID5 stores parity codes for each data stripe into one of the disks in an array. For example, when Di;j denotes a data block of the ith stripe in disk j (1 j n), the parity codes for stripe i can be calculated as follows:
M n
1. Introduction The progress of computer architecture and semiconductor technology has radically improved performance of processors and semiconductor memories. Data transfer rates from and to hard disk drives have also increased as a consequence of enhancements of recording density and spindle rotation speed of disks. Unlike semiconductor products, however, disk drives have intrinsic limitations on performance improvements, since they need physical movements during data accesses. The gap between the data transfer rate of a disk and the data manipulation performance of a processor is continually widening. Furthermore, the disk delay caused by the track seek time and rotation synchro-
Pi =
Di;j
j=1 j = mod(i 1; n) + 1
6
0
When a data block Di;k is damaged, it can be salvaged by the following calculation: Di;k = Pi
8
M n
Di;j
j=1 j = mod(i 1; n) + 1; k
6
0
This means that the amount of redundant information is just 1=n of the total amount, but all data in a crashed disk can be rebuilt using that amount of redundant information. From the performance point of view, the throughput for RAID5 read operations is better by a factor of n than that of a single
disk, since each disk can accept read requests in parallel, and read operations do not require the parity calculation. On the other hand, a RAID5 system still has a problem with its performance for small write operations. To keep the parity information up-to-date in a RAID5, the following calculation is required when updating Di;j : N ew Pi = Old Pi
In this paper, we propose a buffering system using a small number of disks to improve the write performance of a RAID5. We have named the buffering system Fault-tolerant Buffering Disks, or FBD for short. It uses 2-4 disks clustered into two groups: primary and backup. Write performance is improved by sequential accesses for the primary disks without interruption and by packing to reduce irrelevant disk accesses for the RAID. Buffering instead of LFS caching does not require garbage collection. Thus, FBD can be used by continuous applications without interruptions. The backup disks are used to tolerate a disk failure in the FBD and to accept read requests for data stored in the buffering system, so as not to disturb sequential accesses in the primary disks. The FBD also uses semiconductor memories to balance data transfer speeds and remove redundant pages, but the amount of non-volatile memory is small, and most of the information for controlling FBD can be placed in volatile memories. There are several other approaches to improving the write performance of RAID5. The parity logging [12] and virtual striping [9] methods were proposed to improve write performance by updating bulk parity. AutoRAID [13] and Hot-Mirroring [8] divide disk space into two areas: RAID5 and mirroring (RAID1). The RAID1 area is used to store “hot” data being accessed frequently while data that has not been accessed for a relatively long time is moved to the RAID5 area to save disk space. Thus, the RAID1 area is used as a caching area. These methods change the RAID system itself and control mechanisms for them are rather complicated, while FBD allows the use of ordinary simple RAID5 systems and the control mechanism of FBD is extremely simple. We implemented an experimental system using an offthe-shelf personal computer and disks, and a commercial RAID5 system. The evaluation results using the experimental system indicate that the buffering system considerably improves both system throughput and average response time. The remainder of this paper is organized as follows. In Section 2, we describe the architecture of the FBD and its behavior under normal conditions and after a disk failure. Then, Section 3 presents a configuration of our experimental system and evaluation results. The reliability of the total system and the cost effectiveness of the FBD are discussed in Section 4. We summarize our work and describe possible future work in the final section.
8 Old Di;j 8 N ew Di;j
Thus, a write request requires four disk accesses: read the old data page, read the old parity page, write the new data page, and write the new parity page. The throughput for small writes therefore shrinks by a factor of four, i.e. n=4 for a single disk. Within these operations, the two read operations can be omitted for data blocks larger than the stripe size, but are required for small data elements being stored in separate locations. In summary, the performance of small write operations in a RAID5 is poor compared with its read performance [1]. To reduce this performance gap, semiconductor memories are commonly placed between a RAID5 and its host as a disk cache [7]. As high hit ratios are necessary to make a caching mechanism effective, it is better for the cache to be as large as possible. However, large semiconductor memories make the system significantly more expensive. Moreover, non-volatility is essential for the cache considering the applications of RAID systems, and this raises the cost of the system further. The DCD (Disk Caching Disk) architecture was proposed by Hu and Yang to reduce costs by using disks as disk caches instead of semiconductor memory [4]. They adopted the LFS (Log-structured File System) [11] method that converts small random writes into a large sequential write to improve write performance. DCD enables a costeffective system configuration without modifying the RAID and its host, but it still has the following problems:
The LFS approach needs garbage collection phases to reuse disk space. Garbage collection prohibits use of DCD systems for continuous service, such as twentyfour-hour OLTP applications. Read operations during the log-style write operations move the position of the disk heads from the tail of the log. They increase both track seek time and rotation latency. DCD requires a large amount of non-volatile memory. Information about page locations in DCD must be stored and cannot be lost to manage the LFS. If the location information is lost, the latest data cannot be reconstructed. Duplication of the DCD to tolerate a failure of the caching disk is impractical, because tight synchronization between them would be required to keep states of these duplicated caches identical.
2. The FBD Architecture 2.1. System Organization A basic system organization of the FBD is shown in Figure1. Three disks and two small non-volatile memories are 2
Host
Continuous write (from host) Continuous read (to RAID5) Read/Wirte (to/from host)
READ
Controller
ControlTable
WRITE
A2
A2
A1
A1
B2
B1
B1
B2
B2
B1
C1
C2
C1
C2
C1
C2
Disk C Phase
Disk A
Disk B
Disk C
A1
B1
C1
A2
B2
C2
1
2
3
4
1
2
time
Figure 2. A phase-switch strategy without a failure
storage area. The RAID is capable of handling these read requests efficiently. The probability of the target page of a read request being in the buffer will be relatively low. Disk C can afford to accept these partial read requests, because the write accesses to Disk C do not contain empty pages for synchronizing rotation, as is the case for the primary. The write requests to Disk C are also in order of logical page addresses, to reduce seek time. To make the storage area management simple, all three disks are divided into two zones, A1/B1/C1 and A2/B2/C2 for Disks A, B and C, and phases for accessing the areas are synchronized between the primary and backup. Phase switch strategies with and without a disk failure will be described in the following subsections. Instead of zoning these disks, one more disk could be used for backup to configure a four-disk buffering system. Alternatively, if we allow some sacrifice of performance, the primary can be constructed from one disk by disabling the double buffering, to configure a two-disk buffering system. These variations are tradeoffs between cost and performance. Two non-volatile memories, NVRAM-W and NVRAMP, are used to synchronize page transfers between the host and buffering disks, and between the buffering disks and the RAID, respectively. Therefore, their sizes are independent of the RAID storage size, unlike a disk cache. This fact keeps the cost of the system low. Assuming only one disk failure, these non-volatile memories need not be duplicated, because they can accept both read and write requests without delay. If we need to consider failures of memories, we can duplicate them without any difficulty.
Backup Disk
Packing Device
NVRAM-P
A1
Disk B
NVRAM-W
Primary Disks
A1 Disk A
RAID5
Figure 1. A basic system architecture for the FBD placed between a RAID5 system and its host. We will describe other organizations using two or four disks later. The three disks, Disk A, Disk B and Disk C, are clustered into two groups. Disk A and Disk B are included in a primary group to implement a double buffering mechanism: Disk A is assigned to read only and Disk B to write only for a while, and then they change places. This allows simultaneous data transfer from the host and to the RAID. Since the logical page addresses for accessing the primary disks are independent of those for accessing the RAID, those for the primary disks can be in sequential order to reduce track seek operations. Moreover, the rotational latency can also be hidden by writing empty data, even when no write requests are coming from the host. This also makes it easy to synchronize phase changes of the double buffering. We call this a continuous read/write operation. The minimum track seeks and no rotational latency of the continuous read/write operations allow us to obtain the maximum throughput of the disk. Disk C performs two roles: it keeps data as a backup to tolerate a disk failure in the primary group, and accepts read requests for data stored in the primary disks so as not to disturb their sequential accesses. All write requests for the RAID will come to disks in the FBD first, but read requests go directly to the RAID and are spread over its whole
2.2. Behavior of the FBD 2.2.1. Under Normal Conditions Write requests issued by the host are stored directly into NVRAM-W, and then transferred to both one of the primary disks and the backup disk. Figure 2 shows a phase-switch 3
strategy for the three-disk configuration without a failure. A repeating pattern is constructed from four phases: write into A1, read from A1, write into A2, and read from A2 for Disk A. When Disk A is in a write phase, Disk B is in a read phase and vice versa. C1 and C2 in Disk C are also toggled to synchronize loosely with the phases of Disks A and B. While one of the disks is accepting writes as above, pages read from the other disk are transferred to NVRAMP, where these pages are sorted to make accesses for the RAID sequential. Furthermore, pages that were overwritten during a phase are removed in the NVRAM-P. These packing operations try to maximize the write throughput of the RAID, and reduce the amount of data transferred from the host to the RAID. Thus, they enable the FBD to improve the outward write performance of the RAID. A read request issued by the host is dispatched to the RAID, the backup disk, or either NVRAM, depending on where the latest version of the target page of the request is located. A control table keeping the location information can be placed on a volatile memory, because the current state can be constructed without data loss from the data in the RAID, primary disks, and NVRAMs, even if the location information is lost. To implement this, a logical page address in the RAID for each page is also stored in the primary disks. In contrast, it is impossible to place location information for a LFS on volatile memory because garbage in the LFS cannot be identified without that information. Neither is it acceptable to store the location information on the disk, because the track seek times required to manage the information greatly reduce the performance of the LFS. Therefore, the DCD requires a large amount of non-volatile memory to manage storage space, while the FBD only requires a small amount of non-volatile memory to balance data transfer speeds between the host and buffering disk, and for packing to accelerate system throughput.
Coutinuous write (from the host) Continuous read (to RAID5) Read/Write (from/to the host) Read (to RAID5)
A1
A1
A2
B2
B1
B1
Disk A
fail B2
B2
B1
C1
C2
C2
Disk B C1
C2
C1
Disk C time
Figure 3. A disk failure during continuous write operations
A1
A1
A2
B2
B1
B1
A2
A1
C1
C1
Disk A Disk B
fail C1
C2
C1
C2
Disk C time
Figure 4. A disk failure during continuous read operations
transfer data from the previous phase to the RAID and to accept write requests from the host during the current phase. Then, double buffering is restarted using Disk B and Disk C. Note that after the failure, the disks doing the double buffering should accept read requests from the host. Therefore, the FBD’s performance is degraded until it is reconstructed with a new disk. The phase switch for a disk failure during continuous read operations in Disk B is depicted in Figure 4. The failure during read inhibits the primary disk (Disk B) from completing data transfer to the RAID. Therefore, the data written into the backup disk in the previous phase (the data in C2) should be resent to the RAID to complete the data transfer. After completion of the phase, double buffering by the remaining two disks should be started, as for failure during continuous write operations. In this case, however,
2.2.2. After a Disk Failure There are three places where a disk failure can occur: in the RAID, the primary, or the backup. Any disk failure in the RAID should be treated by the RAID itself, i.e., we need not consider such failures. Treatments of a disk failure in the primary are clustered into two types depending on whether the failure occurred during read or write operations. Figure 3 illustrates the treatment of a failure during write operations in Disk A. In this case, data written into the damaged primary disk cannot be accessed after the failure. Thus, the data cannot be transferred to the RAID in the next phase. However, the FBD stores the same data into the backup disk. Initially, both read operations from B1 and write operations to C1 are continued until the end of that phase, to
A1
A1
A2
A2
A1
A1
B2
B1
B1
B2
B2
B1
Disk A Disk B C1 Disk C
C2
C1 fail time
Figure 5. A failure in the backup disk 4
16
(PC / AT compatible)
14
(max 40 MB/s)
Disk A
Disk B
Throughput (MByte/sec)
Memory (128MB DRAM)
PCI Bus (max 133 MB/s)
CPU (PentiumPRO 200MHz)
Ultra Wide SCSI interface
Disk C
Seagate Cheetah 4LP (ST34501W) (4.6GB) X 3
Ultra Wide SCSI interface
12
Cheetah sequential: Read Write
10 8
Cheetah random: Read Write
6 4
(max 40 MB/s)
Arena random: Read Write
2 0
RAID5 (Infortrend Technology Arena)
1
10
100
1000
10000
Block Size (KByte)
IBM DHEA-36481 (6.5GB ) X 6
Figure 7. Throughput of Cheetah and Arena
Figure 6. Block diagram of an Experimental System
3.1. Pre-evaluation Before evaluating performance of the experimental FBD system, we measured the performance of the fundamental components: Cheetah 4LP (a disk for buffering) and Arena (the RAID5). Read and write throughput of Cheetah and Arena with varying access block sizes are plotted in Figure 7. The graph indicates that the throughputs of both Cheetah and Arena vary widely, and monotonically increase with block size. This supports our assumptions that large access sizes are suited for the RAID and that the sequential access throughput of a high performance disk is better than that of the RAID. The maximum throughput of sequential accesses to Cheetah is about 14MB/s, which is the maximum transfer rate of Cheetah. Incidentally, the difference between read and write of Arena is not as large as we mentioned in Section 1. The reason for this is that Arena contains a 16MB semiconductor disk cache. As we will describe later, this influences the FBD’s performance when read operations are included.
access requests from the host cannot be accepted for one phase period. We can shorten the period by more precise control. Even if the backup disk is damaged, the double buffering can be continued using the primary disks. However, they must accept read requests during write operations from the host (Figure 5.) This means that performance is also degraded until a new disk is attached as the backup disk.
3. An Experimental System We developed an experimental system using an offthe-shelf personal computer (PC/AT with PentiumPRO [200MHz] processor and 128MB DRAM) with three high performance disks (Seagate’s Cheetah 4LPs, ST34501W [4.6GB]), and a commercial RAID5 system (Infortrend Technology’s Arena, using six IBM DHEA-36481s [6.5GB]) connected by Ultra WIDE SCSI Interface (40MB/s), to evaluate performance of the FBD. Figure 6 illustrates the block diagram of the experimental system. Nowadays, we can construct such an FBD system very cheaply compared with RAID systems. Moreover, any parts can easily be replaced with up-to-date enhanced versions. We adopted FreeBSD2.2.8R [2] and C to implement the FBD control software. The page addresses of the RAID are managed with a control table in the main memory of the PC, using a hash function and linked lists for hash collisions. We used no non-volatile memory in the experimental system, because the main purpose of the implementation is performance evaluation. We allocated 6.4MB of main memory to simulate each NVRAM. Of course, it is very easy to change the software using non-volatile memory. In these experiments, we only used the outside zone in the zone bit recording of Cheetah to derive the maximum transfer rate of buffering disks. Therefore, we only allocated 200MB for each stage of double buffering.
3.2. System Throughput for Constant Load Workloads we prepared followed the Zipf-like distribution [5] where the probability that logical page address x in the RAID is accessed is proportional to 1=x . Each page is randomly accessed if = 0, and accesses become more skewed by increasing the value of . In the experiment, we use three values, 0, .5, and 1, for . We calculated the system throughput from execution times for writing or reading 1GB of data for some workload. This data size sufficiently exceeds the size of buffer so it can be treated as a constant load. Figure 8 and Figure 9 show throughput curves of the FBD and RAID5 for access requests including write-only and 50% reads, respectively. If the access requests contain no read, all requests come to the FBD first and packed by removing irrelevant accesses. Therefore, the influence 5
100 Transfer Rate Magnification: log(FBD/RAID5)
16
Throughput (MByte/sec)
14 12 10 8 FBD: θ = 1.0 θ = 0.5 θ = 0.0
6 4
RAID5: θ = 1.0 θ = 0.5 θ = 0.0
2 0
1
10
100
1000
Write50%: θ = 1.0 θ = 0.5 θ = 0.0
10
Write100%: θ = 1.0 θ = 0.5 θ = 0.0
1
10000
1
10
Block Size (KByte)
Figure 8. Throughput of FBD and RAID5 for 100% write of 1GB data
14
FBD: θ = 1.0 θ = 0.5 θ = 0.0
RAID5: θ = 1.0 θ = 0.5 θ = 0.0
Throughput (MByte/sec)
Throughput (MByte/sec)
10000
16
35
25
1000
Figure 10. Throughput comparison between FBD and RAID5
40
30
100 Block Size (KByte)
20 15
12 10
6
10
4
5
2
0
0 1
10
100
1000
10000
FBD: θ = 1.0 θ = 0.5 θ = 0.0
8
1
10
RAID5: θ = 1.0 θ = 0.5 θ = 0.0
100
1000
10000
Block Size (KByte)
Block Size (KByte)
Figure 9. Throughput of FBD and RAID5 for 50% write of 1GB data
Figure 11. Throughput of FBD and RAID5 for 100% writes of 100MB data
of skew is directly reflected on the throughput as depicted in Figure 8; the FBD throughput for high skewed data sets ( = 1) are considerably higher than that for flat data sets ( = 0). Indeed, the FBD throughput for = 1 is very close to the maximum throughput of Cheetah (see Figure 7), but that for = 0 is limited by write accesses from the FBD to the RAID5. On the other hand, if the access requests contain 50% reads, these read requests go directly to the RAID5 or the backup disk, as we mentioned in 2.2.1. The FBD throughput is even better than that of Cheetah by the synergistic effect of the FBD and RAID5. In this case, lower skewed data sets provide higher throughput for large blocks because skews cause a bottleneck at the backup disk, while it is reversed for small blocks because of the effect of the semiconductor cache in Arena1 To make the difference of throughput between the FBD and RAID5 distinct, we plot their performance ratio in Figure 10. It indicates that the FBD improves the throughput about 30 times at most compared to the pure RAID5. The reason for the decreases in the ratio for blocks larger
than 2K is performance improvement of the RAID5 with the block size while the performance of the buffering disk is saturated. This means that the FBD greatly improves small write operations. The graph also indicates that read operations amplify the performance improvement, so far from disturbing it. Notice that the FBD realizes very high absolute performance for large block size even though the relative performance improvement is small.
1 We tried
3.3. System Throughput for Burst Loads In some real situations, the load will be drastically changed. For instance, the load for a business database will be low during the night, but may become high when the business starts up in the morning and settle down later. If the amount of data during the drastic load increase does not exceed the size of the buffer, the FBD absorbs the temporary burst load. To demonstrate this, we derived the system throughput for workloads with 100MB data transfers. Figure 11 and Figure 12 show throughput curves of the FBD and RAID5 for 100MB access requests including write-only and 50% reads, respectively. They indicate that the throughput for data that can be stored into one
to make the caching in Arena unavailable, but could not.
6
40
20
FBD: θ = 1.0 θ = 0.5 θ = 0.0
30 25
Responce Time (msec)
Throughput (MByte/sec)
35
20 RAID5: θ = 1.0 θ = 0.5 θ = 0.0
15 10
RAID5: θ = 1.0 θ = 0.5 θ = 0.0
15
10 FBD: θ = 1.0 θ = 0.5 θ = 0.0
5
5 0
1
10
100
1000
0
10000
10
100
1000
10000
Block Size (KByte)
Transfer Rate (KByte/sec)
Figure 12. Throughput of FBD and RAID5 for 50% write of 100MB data
Figure 13. Average response time of FBD and RAID5 and M T T FRAID5 :
stage buffer is much better than that for constant load. The throughputs of the FBD for 100% write requests of any data-skew types are almost the maximum performance of the buffering disks. It indicates that double buffering without interruption works effectively for write requests. The throughputs for 50% reads are much better than those for the constant load. The results demonstrate that the FBD is capable of absorbing the burst load.
M T T FF BDsystem =
1 1 1 MT T FFBD + MT T FRAID 5
We can use the formula in [10] to derive the M T T FRAID5 : M T T FRAID5 =
NRAID5
(M T T FdiskR )2
2 (NRAID5 0 1) 2 M T T RdiskR
where M T T FdiskR and M T T RdiskR are the mean time to failure and mean time to repair of a single disk used in the RAID5 respectively, and NRAID5 is the total number of disks in the RAID5. The FBD tolerates a single disk failure in a group like the RAID5, so we can apply the previous formula again to derive the M T T FF BD :
3.4. Response Time We next compared the average response time of the FBD with the RAID5. We generated write access requests at intervals according to the exponential distribution for given loads. The block size was fixed to 2KB and the distribution of logical page addresses in requests was the same as for the workload for throughput evaluation, i.e., a Zipf-like distribution. The average response time for varying load (data transfer rate) is plotted in Figure 13. Although the semiconductor cache in Arena is effective before the amount of data exceeds the cache size, the access time of the disks in the RAID5 becomes evident for these workloads with their large amounts of data. The graph indicates that the average response time of the FBD is also better than that of the RAID5. If access is skewed ( = 1), the packing mechanism enables the FBD to be used under heavier load.
M T T FF BD =
NF BD
(M T T FdiskF )2
2 (NF BD 0 1) 2 M T T RdiskF
In our experimental system, NRAID5 and NF BD are 6 and 3, respectively. The MTTF of both IBM DHEA-36481 (M T T FdiskR ) and Seagate Cheetah 4LP (M T T FdiskF ) are 106 hours, according to the data sheets of the manufacturers. If we assume that M T T RdiskR and M T T RdiskF are 1 hour, M T T FRAID5 3 2 1010 and M T T FF BD 2 2 1011 . Then, M T T FF BDsystem 3 2 1010 , which is almost same as M T T FRAID5 . On the other hand, if we implement the DCD using the same components, the mean time to failure of the system is
4. Discussion
M T T FDCDsystem =
4.1. System Reliability M T T FdiskD M T T FDCDsystem
and
106
hours. Therefore, hours. Consequently, the reliability of the system using the FBD is almost the same as that of the RAID5, while that for the DCD is the same as that of a single disk.
From the reliability point of view, a system with FBD and RAID5 is a series configuration [6]. Therefore, assuming only one disk failure, the mean time to failure of the system M T T FF BDsystem is calculated from M T T FF BD 7
is
1 1 1 MT T FdiskD + MT T FRAID 5
106
4.2. Cost Effectiveness
Although the maximum throughput is limited to the sequential access speed of the disks used for buffering, we can prepare multiple FBDs for RAIDs having a large number of disks to enhance the performance further. Consideration of the best ratio of disks in the RAIDs to those in the FBD is part of our proposed future work. We also plan to evaluate performance after a disk failure, and to compare FBD performance with that of the DCD.
There are currently two trends in prices of hard disk drives: expensive high performance disks, and cheap lowend disks. We can now buy a cheap disk for less than a hundred and fifty dollars, while a high-end disk may cost more than a thousand dollars, even from a discount shop. It should be attractive to construct a system using numerous inexpensive low-end disks for a RAID5 (matching the origin of the acronym Redundant Array of INEXPENSIVE Disks), and to enhance its small write performance using a small number of relatively expensive high performance disks. Even if we construct a RAID5 from expensive disks, it is difficult to make its small write performance comparable to its sequential accesses performance.
Acknowledgments This research was partially supported by the Ministry of Education, Science, Sports and Culture, Japan, by a grant for Scientific Research on Priority Areas (08244205/09230206/08244105), and also by SRC (Storage Research Consortium in Japan).
5. Concluding Remarks References We proposed a disk buffering system using a small number of disks to improve the small write performance of RAID5 while maintaining reliability of the total system. We named it the Fault-tolerant Buffering Disk (FBD) system. The FBD enables a cost-effective configuration by combining cheap low-end disks and expensive high performance disks. The main features of the FBD are:
[1] P. M. Chen et al. RAID: High-Performance, Reliable Secondary Storage. ACM Computing Surveys, 26(2):145 – 185, Jun 1994. [2] FreeBSD. http://www.freebsd.org/. [3] G. A. Gibson. Redundant Disk Arrays. The MIT Press, 1992. [4] Y. Hu and Q. Yang. DCD – Disk Caching Disk: A New Approach for Boosting I/O. In Proc. of Int. Sympo. on Comp. Arch. ’96, 1996. [5] D. E. Knuth. The Art Of Computer Programming Volume 3/ Sorting and Searching. Addison-Wesley, 1973. [6] F. P. Mathur. On reliability modeling and analysis of ultrareliable fault-tolerant digital systems. IEEE Transactions on Computer, C-20(11):1376–1382, 1971. [7] J. Menon. Performance of RIAD5 Disk Arrays with Read and Write Caching. Distributed and Parallel Databases, (2):261–293, 1994. [8] K. Mogi and M. Kitsuregawa. Hot mirroring: A method of hiding parity update penalty and degradation during rebuilds for RAID . In Proc. of SIGMOD Conf. ’96, pages 183–194, 1996. [9] K. Mogi and M. Kitsuregawa. Virtual Striping: A Storage Management Scheme with Dynamic Striping. IEICE Transactions on Information and Systems, E79-D(8):1086–1092, 1996. [10] D. A. Patterson, G. Gibson, and R. H. Katz. A Case for Redundat Arrays of Inexpensive Disks(RAID). In Proc. of ACM SIGMOD Conference, pages 109–116, Jun 1988. [11] M. Rosenblum and J. Outerhout. The Design and Implementation of a Log-Structured File System. ACM Transactions on Computer Systems, 10(1):26–52, 1992. [12] D. Stodolsky, M. Holland, W. V. Courtright II, and G. A. Gibson. Parity-Logging Disk Arrays. ACM Transactions on Computer Systems, 12(3):206–235, August 1994. [13] J. Wikes, R. Golding, C. Staelin, and T. SUllivan. The HP AutoRAID hierachical storage system. In Proc. of SIGOPS Conf. ’95, pages 96–108, 1995.
It requires no modification in either the host or the RAID5. It can be used for continuously-running applications, as there are no pauses for collecting garbage. Its control mechanism is very simple compared to LFS based approaches. It utilizes the high transfer rate of sequential accesses by use of double buffering. The backup disk is used to enhance performance as well as to maintain reliability. Synchronization between the primary disks and backup disk is rather loose. The amount of non-volatile memory required is constant and small, while the DCD requires a larger amount of non-volatile memory, proportional to the size of the RAID.
We developed an experimental system using an offthe-shelf personal computer and disks, and a commercial RAID5 system. Performance evaluation of the system demonstrates that the FBD improves both throughput and response time considerably. We accept that the large cache memory in the RAID5 system diminishes the confidence in the experimental results, because we could not inhibit the caching. We plan to use other RAID5 systems for further evaluation. 8