crashes. As the result, the percentage of writes to a storage subsystem is increased. Therefore, a ... 4Recently the areal density of a hard disk increases roughly sixty percents per year 17]. .... Parity blocks will be used to recover from a disk.
Ecient Implementation of RAID-5 Using Disk Based Read Modify Writes 1 Sangyup Shim, Yuewei Wang, Jenwei Hsieh, Taisheng Chang, and David H.C. Du Distributed Multimedia Research Center2 and Computer Science Department, University of Minnesota
Abstract Redundant Array of Inexpensive Disks (RAID) is often used to provide a fault tolerance capability for disk failures in database systems. An ecient implementation of small writes is the most important issue to achieve high throughput because most of writes in databases are small. The traditional RAID implementation requires a RAID controller to construct parity blocks for writes (host based writes). An RAID implementation based on an exclusive-or (xor) engine in each disk may minimize link trac and improve the response times for small writes. This is called disk based writes. However, many implementation issues have to resolved before the higher performance can be realized. We have observed that a straightforward implementation of disk based writes achieves only half of the potential throughput because of the interactions among commands. One of the implementation issues is to prevent a state of deadlock. A state of deadlock may occur when multiple disk based writes are executed simultaneously. The implementation issues addressed in this paper include how to prevent a state of deadlock, how to enhance cache replacement policy in disks, how to increase concurrency among disk based writes, and how to minimize the interactions of many commands. This paper investigates the eectiveness of disk based writes, and shows the performance comparison between a disk based write and a host based write in RAIDs. In a stress test, the aggregate throughput was increased, and the average command latency was reduced.
The authors would like to thank Cort Fergusson, Mike Miller and Jim Coomes at Seagate Technology for providing us valuable information on FC-AL and a disk based Read Modify Write. 2 Distributed Multimedia Research Center (DMRC) is sponsored by US WEST Communications, Honeywell, IVI Publishing, Computing Devices International and Network Systems Corporation. 1
1
1 Introduction A reliable and high performance database system imposes challenges on the design and implementation of a storage subsystem. Firstly, a high performance database system requires high data bandwidth from an I/O subsystem. A storage subsystem of traditional databases usually consists of a hierarchy of two-level storage components. They are magnetic disks to store all the data needed by database transactions and memory to cache the data retrieved from disks and to compute query results based on them. Since the I/O bandwidth of magnetic disks is low compared with that of memory, an ecient design and implementation of a storage subsystem are crucial to the performance of database systems. Secondly, fault tolerance is an important issue if a database system needs to provide non-disruptable services. A good design of a storage subsystem must be capable of handling disk failures. Redundant Arrays of Inexpensive Disks (RAID) [4] provides a fault tolerant solution to allow a large number of queries to access databases on a storage subsystem. Among all the RAID levels, RAID5[4] is widely used due to its high I/O performance in a fault-free condition and reasonable throughput under fault environment. For this reason, we only consider RAID-5 in this paper 3 . Some traces of database transactions show that most of query sizes are in the range of 4KB, and the majority of the requests are writes. The reason behind the latter is because of the extensive read caching and a write-through policy. Read caching is performed extensively due to the increasing size of memory in today's computer systems. The number of reads to disks are reduced as more reads are served by memory cache hits. For writes, a write-through policy is employed to minimize the fault vulnerability. If writes are cached in memory, the cached writes are lost when a system crashes. As the result, the percentage of writes to a storage subsystem is increased. Therefore, a storage subsystem needs to be optimized towards the ecient implementation of small writes. The traditional RAID implementation requires a RAID controller to construct (or re-construct) a parity block for each write. In RAID, data are divided into blocks which are called striping units and stored in several disks in a block-striping fashion. When a block is written to a disk, the old block and the corresponding parity block are read into buers in a host or a RAID controller. The new parity block is constructed by computing the exclusive-or (xor) of the new and old data blocks, and the old parity block. Then, the new parity block is written back to the disk to replace the original parity block. Hence, a small write in traditional RAID-5 involves with two reads and two writes. This process is usually called a read-modify-write (RMW). Since the parity computation is done in a host or in a RAID controller, the traditional RMW is called a host based RMW. A RAID controller is expensive because it often requires multiple channel interfaces for the participating disks, large buer memory, and intelligent mechanisms to conduct RMW operations and special hardwares to reconstruct parity blocks. If a parity block for a write can be computed in a disk cache, only new data blocks need to be transferred from a host to disks. This reduces the trac between a host and disks and improves the performance of small writes. However, this requires the presence of an XOR unit in each disk. A write based on the XOR unit in a disk is called a disk based write. An XOR unit on a disk adds relatively little cost to a high performance disk4. When the XOR capability is built into a disk, a For conveniences, we shall use the terms RAID and RAID-5 interchangeably throughout the paper. Recently the areal density of a hard disk increases roughly sixty percents per year [17]. The capacity of a high performance disk is approaching 20 GBytes. Disc rotation speed is also getting faster from 7,200 to 10,000 rpm (such 3
4
1
RAID controller may no longer be needed if parity computations are performed in disks. That is why the storage industry are quite interested in a disk based RMW. The new SCSI commands to implement a disk based RMW, called XDWRITE and XPWRITE, are de ned in [8]. XDWRITE is used by a host to send new data blocks to disks. XPWRITE is used by a disk to send updated information to another disk for the computation of new parity blocks. For convenience, a disk destined by an XDWRITE is referred as a target disk, and a disk contains the corresponding parity blocks is referred as a parity disk. In RAID-5, data and parity blocks are striped over all participating disks. Any disk can be a target disk or a parity disk for a given write. In a disk based RMW, a host sends an XDWRITE with new data blocks to a target disk. XDWRITE causes the target disk to construct an XOR block from the new block and old block. The constructed XOR block is then sent to the corresponding parity disk. An XPWRITE makes a parity disk to update the corresponding parity block based on the XOR block from the target disk and the old parity block. Since old data blocks are not transferred to a host in a disk based RMW, data transfers in links are reduced by half compared to a host based one. We will describe the disk based write in detail in Section 4. The standard in [8] describes only the steps necessary to perform a disk based write. However, many problems arise when multiple disk based writes are intermixed with one another. Even though disk based writes may minimize link trac and improve the response times for small writes, many implementation issues have to resolved before the higher performance can be realized. One of the implementation issues is to prevent a state of deadlock. A state of deadlock may occur when multiple disk based writes are executed simultaneously. When a target disk receives an XDWRITE, it sends an XPWRITE to the corresponding parity disk. After a target disk constructs an XOR block, it waits for a ready signal from a parity disk. If two disks are each other's parity disk, they may wait for each other's ready signal inde nitely. Hence the interactions between them may result in a circular wait. This paper examines ways to prevent a state of deadlock in disk based writes. Other implementation issues addressed in this paper including how to enhance cache replacement policy in a disk, how to increase concurrency among disk based writes, how to improve the performance of large block writes, and how to minimize the interactions of concurrent commands. A disk based write can be implemented over any storage interface. Among them, an SCSI bus is widely used to interconnect storage devices. However, it does not provide enough bandwidth to support high performance disks as the fast data transfers from a small number of such disks can easily saturated an SCSI bus. For example, fast-wide SCSI provides a bandwidth of 20 MB/s, and a Cheetah disk drive oers data transfer rate up to 16.8 MB/s [16]. The emerging serial storage interfaces such as Fibre Channel - Arbitrated Loop (FC-AL)[3] and Serial Storage Architecture (SSA)[1, 2] provide alternatives for high-performance storage interfaces. FC-AL oers many advantages over SCSI. The advantages include higher bandwidth, fair accesses, fault tolerance, compact connectors, and more device attachments in a single loop. FC-AL provides 100 MB/s of link bandwidth. FC-AL's high data bandwidth makes it suitable for connecting high performance disks. Since adding an XOR unit to a high performance disk is cost eective, in this paper we study the performance of disk based writes with FC-AL disk drives. The maximum number of attachments in a fast-wide SCSI bus and FC-AL is 15 and 126 devices as Cheetah disk drives[16] from Seagate oer disk rotation speed at 10,033 rpm and data transfer rates of 11.3 to 16.8 MB/s). As a disk is getting larger and faster, the cost of a high performance disk is increasing while the cost per MB is decreasing. An XOR unit on a disk adds relatively little cost to a high performance disk.
2
Host Disk 1 Disk 2
100MB/s
FC-AL Adaptor
Disk N Disk N-1
100MB/s
Dual-loop Fibre Channel - Arbitrated Loop
Figure 1: A loop topology of an FC-AL loop. respectively. Multiple hosts can also be attached in FC-AL. A con guration with dual loops and multiple hosts in FC-AL provides bandwidth of 200 MB/s and fault tolerance against link, host, and adapter failures. A host failure may be tolerated in this environment because all disks are shared among hosts in a loop. FC-AL also supports a bypass circuit which can be used to keep a loop operating when a device on a loop is physically removed and failed. Figure 1 shows a dual loop con guration of FC-AL with single host. Since a disk based RMW reduces link trac for small writes, the performance gain is higher as more disks can be attached in a loop. In a stress test, the number of disk attachments can be doubled when disk based RMWs are used. We have observed that disk based RMWs reduced link trac and contention signi cantly when compared with host based ones. The remainder of this paper is organized as the following. Section 2 discusses related work. In Section 3, the overview of a host based write is presented. Section 4 describes a disk based write. Section 5 examines the implementation issues in implementing a disk based write and discusses how to prevent a state of deadlock in disk based writes. Section 6 presents an analysis of the proposed schemes and the simulation model. Section 7 shows the simulation results of a disk based write and a host based one. Section 8 concludes the paper and outlines future work.
2 Related Work Redundant Array of Inexpensive Disk (RAID)[4, 7, 10] provides the high throughput and fault tolerance capabilities to a storage system. Whenever data are written to disks in a RAID, the corresponding parity blocks need to be updated. Parity blocks will be used to recover from a disk failure. In RAID-5, data are striped block-wise over disks. Parity disks are selected in a circular fashion so that parity blocks are placed uniformly over all disks. Section 3 describes a host based write. Mirroring can also provide fault tolerance by doubling the storage requirement. In [8], the new SCSI commands, an XDWRITE and XPWRITE, are de ned to support a disk based write. However many problems arise when multiple disk based writes are intermixed with each other. Even though disk based writes may minimize link trac and improve the response times for small writes, many implementation issues have to resolved before the higher performance can be realized. This paper examines the important implementation issues of disk based writes. To the best of our 3
knowledge, no study has been published on the implementation issues and the performance of disk based writes. Disk 2
Disk 3
Disk 1
20MB/s
Disk 4
20MB/s 20MB/s 20MB/s Disk 5
Host (Initiator)
Disk 7
Disk 6
Figure 2: A loop topology of an SSA loop. The serial interface standards have been recently developed. They include the Serial Storage Architecture (SSA) standards [1, 2], the Fibre Channel Arbitrated Loop (FC-AL) standard [3], and the IEEE P1394 serial bus standard [13]. FC-AL was designed for high speed interconnections to provide low-cost storage attachments. FC-AL provides data bandwidth of 100 MBps and fault tolerance with an optional fairness algorithm. Serial Storage Architecture (SSA) oers high performance interconnections for storage subsystems. A link in one direction operates at 20MB/sec in a link and 80MB/sec in total bandwidth. SSA provides a spatial reuse feature which allows independent data transmissions to occur simultaneously on the same loop at the full bandwidth. A typical SSA con guration is shown in Figure 2 where many nodes are connected by point-to-point links in a loop. Both FC-AL and SSA supports hot pluggable devices on a back plane and provides redundant link paths for fault tolerance. A detailed tutorial on the FC-AL and SSA standards are described in [5].
3 Host Based Writes In RAID-5, data are divided into blocks and striped block-wise in disks. Data blocks in the same location at each disk and the corresponding parity block form a stripe. A write in RAID involves updating the corresponding parity block. The steps of a small write are shown in Figure 3. When new B0 is written to disk 0, old B0 and old P0 are read into a host. Old P0 is modi ed using the new block and the old block. As the nal step, the new parity block is written back to the parity disk. When a write involves data blocks from more than half of a stripe, the blocks which are not written in a stripe are read into host buers because the non-writing blocks are smaller than the updated blocks. This write is called a reconstruct write[4]. It is a common technique to improve the performance of large writes in RAID-55 . After the non-writing blocks are read, the new parity block is constructed, and later written back to the parity disk. Figure 4 shows a reconstruct write. Old B4 and B5 are read into the host buers where the new parity block, new P0, is constructed. Therefore, in this paper a write involves data blocks from more than half of a stripe is called large write, otherwise it is a small write. 5
4
Host New B0
Old B0
Old P0
xor
NewP0
Write
Write
Read
Read
Read
NewB0 Old B0
NewP0 Old P0
Write
Write
Read
B0
B1
B2
B3
B4
B5
P0
B6
B7
B8
B9
B10
P1
B11
B12
B13
B14
B15
P2
B16
B17
B18
B19
B20
P3
B21
B22
B23
Disk0 (Target Disk)
Disk1
Disk2
Disk3
Disk4
Disk5
Disk6 (Parity Disk)
Figure 3: Host based read-modify-write for a small block New B0, B1, B2, B3, and new P0 are then written back to the disks.
4 Disk Based Writes If parity computations are performed in a disk cache, block transfers from disks to a host for the parity computation can be reduced. A disk based write uses a disk cache for the parity computation. It requires an XOR unit on each disk. The disk based write can potentially support a large number of disks by reducing link trac and faster small writes. In case of small writes, a host writes new blocks directly to a target disk using an XDWRITE command. In a target disk, the old block is read into the disk cache. The XOR block is constructed from the old block and the new block. When the target disk receives an XDWRITE, it sends an XPWRITE to the corresponding parity disk so that the parity disk starts to read the old parity block into its disk cache. After the parity disk responds with a ready signal, the target disk sends the constructed XOR block to the parity disk. When the parity disk receives the XOR block from the target disk, it constructs the new parity block by performing the XOR operation with the old parity block and the XOR block. The new parity block is then written back to the parity disk as the nal step. The target disk sends a complete status to a host when it completes an XDWRITE. Figure 5 shows an example of a disk based write. New B0 and old B0 are transferred into the disk cache at disk 0 where XOR B0 is constructed. XOR B0 is transferred to the parity disk, disk 6, where new P0 is constructed and later updated. A disk based write is more ecient for small writes than a host based one because the old data blocks and the old parity blocks do not need to be transferred to a host. In a disk based large write, the XOR blocks are transferred one by one to the corresponding parity disk from target disks. Because more than half of disks in a stripe are writing to the parity disk, each target disk has to wait for its turn to access the parity disk. As the result, multiple XPWRITEs are sequentialized in the parity disk. The parity disk may become a bottleneck. 5
Host New B0
Write
Write
New B0
New B1
New B3
Write
New B1 Write
New B2
Write
New B2 Write
Old B4
Old B5
Read
New B3 Write
XOR
New P0
Write
Read
Old B4 Write
Old B5 Read
New P0 Read
Write
B0
B1
B2
B3
B4
B5
P0
B6
B7
B8
B9
B10
P1
B11
B12
B13
B14
B15
P2
B16
B17
B18
B19
B20
P3
B21
B22
B23
Disk0 (Target Disk)
Disk1
Disk2
Disk3
Disk4
Disk5
Disk6 (Parity Disk)
Figure 4: Host based reconstruct write for a large block Hence, the disk based write for a large block may incur higher latency than a host based write 6 because all the target disks have to write sequentially to the parity disk. A host based reconstruct write only reads in non-writing blocks to compute the new parity block. The disk based large write is shown in Figure 6. The parity computations and updates are performed by disks.
5 Implementation Issues of A Disk Based Write 5.1 Deadlock in Disk Based Writes Many commercial disks today can only handle one SCSI command at a time. This may cause deadlocks in disk based RMW. When a target disk receives an XDWRITE, it sends an XPWRITE to the corresponding parity disk. After the target disk receives a new data block, it constructs an XOR block from the new block and the old block. Before the target disk transmits the constructed XOR block, it waits for a ready signal from the parity disk. The ready signal indicates that the parity disk is ready to receive data. Let us consider a scenario where two XDWRITEs are sent to two dierent disks. Assume that those disks are each other's parity disk. Both disks send an XPWRITE to each other when they receive an XDWRITE command from a host. After those disks constructed the XOR blocks, they wait for a ready signal from each other. Since both disks do not know if they are waiting for each other's response, they enters a state of deadlock. Figure 7 shows the deadlock scenario where disk 0 and disk 6 are waiting for each other's response after both disks send an XPWRITE to each other. The host sends XDWRITEs to disk 0 and disk 6. When disk 0 receives the XDWRITE, it sends an XPWRITE to its parity disk (disk 6). 6
A write from a host perspective becomes a read-modify-write (RMW) in the disk level in RAID-5.
6
Host New B0
XDWRITE XPWRITE
Xor B0
New P0
xor
xor Xor B0 write
New B0
Old P0
Old B0
write
read
read
B0
B1
B2
B3
B4
B5
P0
B6
B7
B8
B9
B10
P1
B11
B12
B13
B14
B15
P2
B16
B17
B18
B19
B20
P3
B21
B22
B23
Disk1
Disk2
Disk4
Disk5
Disk0 (Target Disk)
Disk3
Disk6 (Pairty Disk)
Figure 5: Disk based read-modify-write for a small block Disk 0 receives new B0 from the host and reads old B0 into the disk cache. After that, it computes XOR B0 and waits for the ready signal from disk 6. Disk 6 does the same operations as disk 0 since they are each other's parity disk. Now a state of deadlock is occurred as both disk 0 and disk 6 wait for each other's ready signal forever.
5.2 Resolving Deadlock in a Disk Based Writes Deadlock can be resolved using a priority scheme. A priority is assigned to each disk. In FCAL, each disk is assigned an unique address. This address potentially can be used as a priority for each disk. If two disks send an XPWRITE to each other, the lower priority disk must respond the XPWRITE of the higher priority disk. After the XPWRITE of the higher priority disk is completed, the lower priority disk continues with its own XPWRITE. This solution sequentializes the XDWRITE operations based on the priority of a disk. Hence it prevents a state of deadlock. However this solution causes unfairness among disks which may lead to a poor performance. In heavy load, only high priority disks are able to complete commands while low priority disks may keep accumulating commands. The commands sent to a low priority disk will experience longer delay than those commands sent to a high priority disk. A better approach to resolve a state of deadlock is to allow a disk to respond to an in-coming XPWRITE (it is acting as a parity disk) while it is waiting for a ready signal (it is also acting as a target disk) from a parity disk. However, the computed XOR block (which will be sent to a parity disk once a ready signal is received) must be kept in the disk cache safely. Therefore, this scheme allows a disk to lock a portion of the cache space for the computed XOR blocks. We call it a cache locking scheme. In this scheme, the disk cache is divided into two parts, which are a locking section and a non-locking section. The XPWRITE uses the non-locking section and the XDWRITE uses the locking section of the disk cache. Hence a state deadlock is prevented because XPWRITEs can be processed without waiting for an outstanding XDWRITE to be completed. 7
Host New B0
XorB1
Xor B0
Xor B2
New B1
New B2
New B3
New P0
XorB3
xor
xor
xor
xor
New B0
NewB1
NewB2
New B3
Old B0
Old B1
Old B2
Old B3
xor Xor B Old P0
read
write
read B0
B1
B2
B3
B4
B5
P0
B6
B7
B8
B9
B10
P1
B11
B12
B13
B14
B15
P2
B16
B17
B18
B19
B20
P3
B21
B22
B23
Disk1
Disk2
Disk4
Disk5
Disk6
Disk0
write
Disk3
Figure 6: Disk based read-modify-write for a large block Table 1: Stress test with eight disks (64KB block size)
Method
Pure XDWRITE 50% READ and 50% XDWRITE
Priority based scheme 3.4 MB/s Cache locking scheme 3.6 MB/s
5.4 MB/s 5.5 MB/s
Table 1 shows the results of a stress test with eight disks. The details of the stress tests are described in Section 7. The proposed priority based scheme and cache locking scheme are simulated and compared. The results show that the cache locking scheme performs a bit better than the priority based scheme. However we observed the unfairness problem from the priority based scheme. Therefore, the cache locking scheme was used in the rest of this paper.
5.3 Performance Issues of Disk Based Write Since a disk based write uses a disk cache for XOR computations, a cache replacement policy for a disk cache has to be carefully evaluated. Usually, a disk cache is divided into a number of units, called cache segments. A disk may use Least Recently Used (LRU) policy to select a cache segment to be replaced. However, when an XOR block is computed, it has to be locked in a cache until it is transferred to a parity disk. Once a cache segment is locked, it cannot be replaced. After an XOR block is transferred to a parity disk, the segment occupied by the XOR block is no longer in use. Since the XOR block is a temporary result, it will not be read in the future. Hence the XOR block can be replaced without aecting the number of disk cache hits. Any unlocked XOR blocks in cache segments are replaced rst, and LRU is used for the remaining cache segments. This enhancement on the cache replacement policy may improve the cache hit ratio in a disk. When a target disk receives an XDWRITE command, an XPWRITE has to be sent to a parity disk immediately. The reason is to reduce the response time of a ready signal from a party disk. 8
Host New B0
New B6
XDWRITE
XDWRITE
XPWRITE
XOR B0
xor
XOR B11
xor
XPWRITE
New B0 Old B0
read
New B11
write
Old B11
write
read
B0
B1
B2
B3
B4
B5
P0
B6
B7
B8
B9
B10
P1
B11
B12
B13
B14
B15
P2
B16
B17
B18
B19
B20
P3
B21
B22
B23
Disk0 Disk1 (Disk6’s parity disk)
Disk2
Disk4
Disk5
Disk3
Disk6 (Disk0’s parity disk)
Figure 7: Deadlock in a disk based RMW We have observed that the waiting time for the XPWRITE ready from a parity disk was long enough to degrade the overall performance when many commands were outstanding7 . One way to reduce the delay for an XPWRITE is to put an XPWRITE command at the head of the command queue in a parity disk. It makes a parity disk respond faster. Since a target disk waits for the XPWRITE ready, a fast response from a parity disk is critical to reduce the latency. However, sending an XPWRITE immediately and putting an XPWRITE in the head of a command queue may not be sucient to minimize the waiting time. In heavy load, many XPWRITE commands may be queued in a command queue, or a large read may be running at a parity disk. In such a case, the delay for a XDWRITE may be long. If we only allow a single XDWRITE outstanding at a time, it leads to an inecient implementation of XDWRITE. To remedy this ineciency, multiple XDWRITE commands can be outstanding at the same time. In fact, an XDWRITE command can be separated into two distinct phases. The rst phase includes sending new data blocks from a host to a target disk and computing XOR blocks. Once an XOR block is computed, it is locked in a cache until it is transferred to a parity disk. The second phase is to transfer an XOR block into a parity disk. This second phase may take a while before it is executed depending on the load in a parity disk. After the rst phase is nished, a target disk is capable of executing other commands because the second phase is not aected by other commands as far as the XOR block is locked safely in a disk cache. With multiple outstanding commands, the cache locking scheme mentioned before can be used to prevent a state of deadlock. In this scheme, a disk cache is divided into two sections as shown in Figure 8. One section, called locking cache, is used by the commands that need to lock a part of the disk cache. An XDWRITE is a locking command. The other section, called non-locking cache, is used by the commands which do not need to lock the disk cache. The number (K ) of segments of the locking cache is predetermined. When the total number of cache segments is N , the rest of the cache A command is outstanding when it is sent out by a host or a target disk but has not been completed. An outstanding command may be served by a disk or waiting in a command queue. 7
9
Disk Cache with N segments
N-K nonlocking disk cache
K locking disk cache
Figure 8: Disk cache is divided into locking and non locking segments segments (N ? K ) is the non-locking cache segments. This division of the cache segments prevents deadlock from occurring because there are enough disk cache segments available to execute nonlocking commands such as XPWRITE, READ, and WRITE while locking commands, XDWRITE, may be waiting for responses from a parity disk. The number of outstanding XDWRITE commands is limited by the locking cache segments, which is K . Table 1 shows the results of a stress test with eight disks. An implementation with the above improvements is called an improved implementation. For the comparison, a simple implementation and an improved implementation were simulated. The details on the stress tests are described in Section 7. The throughput was doubled when the implementation with the aforementioned improvements were implemented. Table 2: Results from a stress test with eight disks (64KB) Method
Pure XDWRITE 50% READ and 50% XDWRITE
Simple implementation 3.4 MB/s Improved implementation 7.2 MB/s
5.41 MB/s 10.3 MB/s
6 Analysis and Simulation Model 6.1 Analysis This section analyzes the response times for both disk based writes and host based ones. A disk based write is more ecient for small writes than a host based one. However, a host based write performs better for a large write. A hybrid write takes advantage of both schemes. If a write involves less than the half of the disks, a hybrid write works the same way as a disk based one. Otherwise, a hybrid write works the same as a host based write. A hybrid write reduces the trac in links when the request size is small. When the request size is large, it reads only non-writing blocks into host. Thus the command latency may also be minimized. Even though a hybrid write provides advantages of the both schemes, it has to construct XOR blocks in both host and disk depending on the request size. Hence a hybrid write requires the XOR capability in both host and disks. 10
Symbol N Nstride Tread Twrite Ttransfer Txor
Table 3: Symbols and de nitions De nition
Units
blocks Number of disks in RAID disks Time to read a single block sec Time to write a single block sec Time to transfer a single block sec Time to perform XOR operation on a single block sec
Number of blocks requested
When a write involves less than half of the striped disks, an old block and a parity block are read into a host or a RAID controller. After a new parity block is computed, it is written back to a parity disk. When N < Nstride 2 , Equation 1 shows the total time required to complete a host based write. (N + 1) Tread + (N + 1) Ttransfer + N Txor + (N + 1) Ttransfer + (N + 1) Twrite (1) Equation 2 shows the total time required to complete a disk based write. (N + 1) Tread + N Ttransfer + N Txor + N Ttransfer + (N + 1) Twrite
(2)
The total time for a hybrid write is the same as a disk based write when N < Nstride 2 . As an example, when N = 1 and Nstride = 7, a host based write takes 2Tread + 4Ttransfer + Txor + 2Twrite , while a disk based write takes 2Tread + 2Ttransfer + Txor + 2Twrite . When a write involves more than half of the striped disks, the blocks in a stripe are read into host buers because it is more ecient to read non-writing blocks. When N Nstride 2 , Equation 3 shows the total time required to complete a host based write. (Nstride ?N ) Tread +(Nstride ?N ) Ttransfer + N Txor +(N +1) Ttransfer +(N +1) Twrite (3) Equation 4 shows the total time required to complete a disk based write. (N + 1) Tread + N Ttransfer + N Txor + N Ttransfer + (N + 1) Twrite
(4)
The total time required for a hybrid write is the same as a host based write when N Nstride 2 . As an example, when N = 5 and Nstride = 7, a host based write takes 2Tread +8Ttransfer +5Txor +6Twrite , while a disk based write takes 6Tread + 10Ttransfer + 5Txor + 6Twrite .
6.2 Simulation Models A host is connected to eight target disks via a single FC-AL loop. A loop port is attached to the loop for each disk on the loop. A Loop Port Sate Machine (LPSM) [5] is used to de ne the behavior of the loop ports. FCP simulates a protocol mapping layer that uses the services provided by LPSM. Our simulation models focus on the storage interface with attached disk drives. To simulate the behavior of hardware components, the models cover from the host adaptors, storage interfaces, disk cache and disk modules. The models do not consider the overhead caused by the I/O buses and memory system in the host computer. To simulate the behavior of software components, the models include the mapping protocol of SCSI commands to FC-AL protocols, detailed behaviors of 11
FC-AL transport protocols, and its fairness algorithm. The models do not consider the overheads caused by device drivers and operating systems. Commands are multiplexed as each command is executed in several phases. Several I/O operations can be intermixed and multiple disks can be busy at the same time. We call an I/O operation which has been issued by the initiator but has not been completed as an outstanding command. Intuitively, more outstanding commands can keep more disks busy serving I/O requests. The maximum number of outstanding commands is an important simulation parameter that may aect the performance. The disk model includes a disk buer cache and a physical disk. In the disk model, data caching is enabled. Fetching data from a disk and transferring data out occur concurrently. In order to reduce the arbitration overhead, FCP waits for a segment of data or the total request size depending on which one is smaller to be fetched into the buers in a disk. After a segment of data is read into the buers, FCP starts to arbitrate. Table 4: Model Parameters for IBM Ultrastar XP 4.51 GB. Capacity Rotational speed Average rotation latency Seek times Transfer Rate
4.51 GB 7202.7 RPM 4.17 ms 0.5 - 16.5 ms 5.53 - 7.48 MB/sec
Table 4 shows the summary of disk parameters used in the simulation. The disk model is based on an IBM Ultrastar XP 4.51 GB disk. The detailed disk parameters are given in [9]. This IBM disk employs zone bit recording [12]. The disk is divided into dierent zones. Table 5 shows the parameters of disk cache used in the simulation. Blocks in the disk and disk cache are of the same size (i.e., 512 bytes). Arbitration FC-AL Channel Data Transfer Disk One SegmentOne Segment Seek Latency Data Fetch from a disk
Read in FC-AL
Figure 9: Data fetches from disk In FC-AL, channel accesses are granted through arbitration. If data are transferred block by Table 5: Parameters of the disk cache for IBM Ultrastar XP 4.51 GB. Total size Number of segments Each cache segment size Number of blocks in one segment
12
512KB 128 4KB 128
block in an FC-AL loop, it may result in a higher overhead because an interface has to perform an arbitration for each block of data transferred. In order to reduce arbitration overhead, after winning an arbitration, an interface will nish transmitting either a segment of data or the total request size of data depending on which one is smaller. For a request of several segments after a segment of data is read into the disk buer, the interface starts to arbitrate in a loop. The overlapped disk accesses and data transfers in a loop for FC-AL are shown in Figure 9. Writes are performed in a similar fashion as reads. As soon as a transfer unit of data is in disk cache, it is written into a disk platter.
7 Simulation Results Table 6: Latency of a command when disk seek and latency are disabled. Size 4KB 8KB 16KB 32KB 64KB WRITE (ms) 1.559ms 3.026ms 5.959ms 11.826ms 23.559ms XDWRITE (ms) 1.380ms 2.722ms 5.405ms 10.772ms 21.505ms WRITE trac size (KB) in links 16KB 32KB 64KB 128KB 256KB XDWRITE trac size (KB) in links 8KB 16KB 32KB 64KB 128KB
Throughput Comparison between Pure WRITEs and XDWRITEs
Latency Comparison between Pure WRITEs and XDWRITEs 1400
RAID5 Pure WRITE RAID5 Pure XDWRITE
60
Pure WRITEs Pure XDWRITEs 1200 Average Latency (ms)
Throughput(MB/s)
50 40 30 20
1000
800
600
10 400 0 5
10
15
20
25 30 35 40 Number of Disks (a)
45
50
55
60
5
10
15
20
25 30 35 40 Number of Disks (b)
45
50
55
Figure 10: Simulation results Comparison between WRITEs and XDWRITEs. (a) Throughput; (b) Average Latency In light load, a write command was sent to a disk at a time. The latency of a command was measured from a RAID controller. Since only one command was outstanding, there was no link contention on a loop. Disk seek and latency varied when data blocks were accessed because they were dependent on the current disk head position. They were more than 10 ms in some cases. For the comparison purposes, the disk seek and latency were disabled only in the light load simulation. Hence the results were not aected by the current disk head position. The disk access time only 13
60
included the data transfer time in disks. The logical block address of a command was chosen randomly based on the uniform distribution. In a host based write, when all the old blocks were collected in host buers, an XOR block was computed at once. In disk based write, an XOR block was computed in a disk when an old block from the disk and a new block from a host were transferred into the disk cache. The rate to perform an XOR operation was set to 40MB/s in both cases. The size of stripe unit used was 64KB. Table 6 shows the command latencies for WRITEs (host based writes) and XDWRITEs (disk based writes) in RAID-5. The command latency includes the time for data transfers and protocol overheads in FC-AL such as loop arbitrations. Since a host based RMW required more data transfers in links than a disk based RMW, a host based RMW took longer to complete when the command size was small. For example, the command latencies for 4KB blocks for a host based RMW and a disk based RMW were 1.559 ms and 1.380 ms, respectively. It was also important to notice that the size of trac generated was larger in a host based RMW. This implied the link contention was higher when many disks were attached on a loop.
Throughput Comparison between Writes and XDWrites
Latency Comparison between Writes and XDWrites
50% Reads and 50% Writes 50% Reads and 50% XDWrites
60
800 Average Latency (ms)
50 Throughput(MB/s)
50% Reads and 50% Writes 50% Reads and 50% XDWrites
900
40 30 20 10
700 600 500 400
0
300 5
10
15
20
25 30 35 40 Number of Disks (a)
45
50
55
60
5
10
15
20
25 30 35 40 Number of Disks (b)
45
50
55
Figure 11: Simulation results with 50% Reads and 50% Writes. (a)Throughput (b)Average Latency We also investigated the performance of host and disk based RMWs in heavy load. As more disks are added, the number of outstanding commands were increased. The commands were generated until the number of outstanding commands was reached to a predetermined number. The maximum number of commands was set to the number of disks multiplied by eight. This represented that there were eight commands outstanding per disk on average. It was maintained by generating a command as soon as a command completes. A group of 8 disks (7+1) was formed a RAID group. Parity was updated among these 8 disks. When N disks were used, N=8 groups were existed. In order to saturate a link with heavy load, the block size of 64KB was used. It was also the size of stripe unit. Two cases were simulated, which were pure writes and mixed reads and writes. For the mixed read and writes, the read-write ratio of 1:1 was used. Figure 10 (a) shows the throughput comparison between a host based write and a disk based write in RAID-5. With the request size of 64KB, a disk based write performed better than a 14
60
host based one. Using host based writes, the loop was saturated at 19.5 MB/sec with 32 disks. Since a host based RMW generated two reads and two writes, 19.5 MB/sec of throughput was translated into roughly 78 MB/sec in the link level. In disk based RMWs, the loop was saturated at 38.2 MB/sec with 48 disks. Hence 76.4 MB/sec of the aggregate throughput in the link level was achieved out of the theoretical data bandwidth of 100 MB/sec. The results were consistent with the results found in [6]. In host based RMWs, the aggregate throughput decreased when more than 32 disks were attached on a loop because of the link saturation. However the links were saturated when more than 48 disks were attached in disk based RMWs. It was interesting to note that the average command latency remained roughly the same as more disks were added in disk based RMWs while it steadily increased in host based RMWs. The reason is because of low link contentions in disk based RMWs. The disk based model allowed to attach more disks without the link saturation. The simulation results of mixed reads and RMWs are shown in Figure 11. It shows the similar results as pure writes. The achievable throughput was higher because more commands were completed in the same time frame. The peak throughput of disk based RMWs and host based RMWs were 31.4 MB/sec and 51.9 MB/sec respectively. In pure writes, the peak throughput in disk based RMWs was roughly twice as much as the one in host based RMWs. On the other hand, in mixed read and write, the peak throughput in disk based RMWs was approximately 65 percents more than the one in host based one. This was because the number of writes is reduced as more reads were performed in the case of mixed reads and writes.
Throughput Comparison between WRITEs and XDWRITEs
Latency Comparison with trace data 300
Trace data with WRITEs Trace data with XDWRITEs
12
Latency with WRITE Latency with XDWRITE
Average Latency (ms)
Throughput(MB/s)
10 8 6 4
250
200
150 2 0
100 5
10
15
20
25 30 35 40 Number of Disks (a)
45
50
55
60
5
10
15
20
25 30 35 40 Number of Disks (b)
45
50
55
Figure 12: Simulation results with trace data. (a)Throughput (b)Average Latency Trace data were collected from a database application. They consisted of 75 percents of writes and 25 percents of reads. Roughly 95 percents of commands were less than or equal to 4KB in size. The average sizes of reads and writes were 4KB and 6KB respectively. The write commands exhibited the access locality where the cache hit ratio was approximately 39 percents. Figure 12 (a) shows the simulation results with trace data. The simulation was run in the same manner with the previous ones with trace data. Since an XOR block was computed in a disk cache when an 15
60
XDWRITE was executed, the existing cache segment was overwritten and locked when an XOR block was constructed. When compared with host based RMWs, disk based RMWs needed one more cache segment. Hence it caused the dierent number of cache hits. As shown in Figure 12, with eight disks, host based RMWs performed slightly better than disk based RMWs because of better cache hit ratio in disks. However the small advantage in throughput for host based RMWs quickly disappeared as more disks were added. That was because host based RMWs generated more data trac which translated into higher link contention. The throughput for host based RMWs peaked at 48 disks with 8.1 MB/s. It was roughly 32.4 MB/s in the link level. Hence a loop was not saturated by the amount of data trac but the arbitration overhead. Table 7: Loop Waiting Time with trace data. Number of Disks
8 16 24 32 48 56 Use WRITE (ms) 0.048ms 0.065 0.134ms 0.177ms 0.994ms 27.33ms Use XDWRITE (ms) 0.04ms 0.065ms 0.08ms 0.102ms 0.294ms 1.87ms
In FC-AL, a device which accesses the loop is determined by an arbitration process. The details of the arbitration process is described in [3, 5]. The high arbitration overhead in links for a host based RMW caused the longer loop waiting time as more disks were contending for the link simultaneously. The loop waiting time is de ned as the time period from when a disk starts to arbitrate for a loop access to when it wins an arbitration. Table 7 shows the loop waiting time after a disk starts to participate in the loop arbitration. The loop waiting time steadily increased as more disks were attached on a loop. Using host based RMWs with 56 disks on a loop, the average loop waiting time was 27 ms because many disks were contending for a loop, and the loop was satureated. When the loop waiting time was greater than the disk access time, most of the command latency was spent on waiting for accessing a loop. Hence the link was completely saturated with 56 disks with host based RMWs. A disk based RMW was eective in reducing not only amount of trac but also link contentions. Thus it allowed more disk attachments on a single loop.
8 Conclusion and Future Work A simple implementation of disk based writes achieves only half of the potential throughput because of interactions among commands. A state of deadlock may also occur when multiple disk based writes are executed simultaneously. We examined many implementation issues such as how to prevent a state of deadlock, how to enhance cache replacement policy, how to increase command concurrency and decrease side eects of concurrent commands. We also proposed ecient solutions. After the implementation issuses are resolved, the throughput is doubled. We have investigated the performance of disk based writes under dierent workloads which included actual trace data from a database application. We observed that disk based writes provided the higher performance than host based writes. Disk based writes reduce link trac and contention signi cantly when compared with host based ones. In a stress test, the aggregate throughput was increased, and the average latencies was reduced. The simulation results also showed that more disks could be attached before the link was saturated when XDWRITE commands were used. This advantage is important for the emerging serial storage 16
interfaces such as FC-AL and SSA where the maximum number of 126 devices can be attached in a single loop. SSA provides a spatial reuse feature which allows independent data transmissions to occur simultaneously on the same loop at the full bandwidth. In SSA, multiple XDWRITE commands can be performed simultaneously at the full bandwidth using the spatial reuse feature. The performance of disk based writes can be improved further.
References [1] ANSI X3T10.1/0989D revision 10, "Information Technology - Serial Storage Architecture - Transport Layer 1 (SSA-TL1) (Draft Proposed American National Standard". American National Standard Institute, Inc., April, 1996. [2] ANSI X3T10.1/1121D revision 7, "Information Technology - Serial Storage Architecture - SCSI-2 Protocol (SSA-S2P) (Draft Proposed American National Standard". American National Standard Institute, Inc., April, 1996. [3] ANSI X3.272-199x, "Fibre Channel - Arbitrated Loop (FC-AL), Revision 4.5", American National Standard Institute, Inc., June 1, 1995. [4] P. Chen, E. Lee, G. Gibson, R. Katz, D. Patterson, "RAID: High-Performance, Reliable Secondary Storage", ACM Computing Surveys, June 1994, pp. 145-185. [5] D.H.C. Du, T. Chang, J. Hsieh, S. Shim, and Y. Wang, "Emerging Serial Storage Interfaces: Serial Storage Architecture (SSA) and Fibre Channel - Arbitrated Loop (FC-AL)", Technical Report at Computer Science Department, University of Minnesota, TR 96-073, 1996. [6] D.H.C. Du, J. Hsieh, T. Chang, Y. Wang, and S. Shim, "Performance Study of Serial Storage Architecture (SSA) and Fibre Channel - Arbitrated Loop (FC-AL)", Technical Report at Computer Science Department, University of Minnesota, TR 96-074, 1996. [7] G. Ganger, B. Worthington, R. Hou, Y. Patt, "Disk Arrays High-Performance, High-Reliability Storage Subsystems", IEEE Computer, March 1994, pp.17-28. [8] ANSI X3T10-1994-111r9, G. Houlder, J. Elrod, and M. Miller, "XOR Commands on SCSI Disk Drives", [9] IBM Corporation, "Functional Speci cation, Ultrastar XP Models", 1995. [10] R. Muntz and J. Lui, "Performance analysis of disk arrays under failure", Proceedings of 16th VLDB, Brisbane, Austrailia, August 1990, pp.162-173. [11] C. Ruemmler and J. Wilkes, "UNIX disk access patterns", USENIX Winter Technical Conference, Jan. 1993, pp.313-323. [12] C. Ruemmler and J. Wilkes, "An Introduction to Disk Drive Modeling", IEEE Computer, March 1994, pp.17-28. [13] A. Kunzman and A. Wetzel, "1394 High Performance Serial Bus: The Digital Interface for ATV", IEEE Transactions on Consumer Electronics, August 1995, Vol.41, No.3, pp 893-900. [14] D. Patterson, G. Gibson, and R. Katz, "A Case of Redundant Arrays of Inexpensive Disks (RAID)" Proceedings of SIGMOD, Chicago IL, June 1988. [15] SSA Industry Association, "Serial Storage Architecture: A Technology Overview", Version 3.0, 1995. [16] Seagate Technology, Inc. "Cheetah Family Speci cations", {http://www.seagate.com/disc/cheetah/cheetah.html}
[17] T. Sutton and D. Webb, "Fibre Channel: The Digital Highway Made Practical", Seagate technology paper, October 1994.
17