Data-Reconstruction Networks for Highly Reliable Parallel-Disk Systems

3 downloads 31644 Views 209KB Size Report
Data stored in a damaged disk can be recovered us- ing only parity ... aged drives, and can recover with up to nine dam- ..... This makes it hard to configure.
DR-nets: Data-Reconstruction Networks for Highly Reliable Parallel-Disk Systems Haruo Yokota School of Information Science Japan Advanced Institute of Science and Technology, Hokuriku E-mail: [email protected]

Abstract We propose DR-nets, Data-Reconstruction networks, to construct massively parallel disk systems with large capacity, wide bandwidth and high reliability. Each node of a DR-net has disks, and is connected by links to form an interconnection network. To realize the high reliability, nodes in a sub-network of the interconnection network organize a group of parity calculation proposed for RAIDs. Inter-node communication for calculating parity keeps the locality of data transfer in DR-nets, and it inhibits bottlenecks from occurring, even if the size of the network becomes very large. Overlapped two types of parity groups on the network make the system able to handle multiple disk-drive failures. A 5 × 5 torus DR-net recovers data 100% with two damaged disk drives located in any place, 95% with four damaged drives, and can recover with up to nine damaged drives. 1. Introduction The progress of computer architecture and semiconductor technology radically improves the performance of processors and semiconductor memories. Moreover, we can obtain greater computing power by massively parallel systems using multiple processors. Secondary non-volatile storage, i.e. disk, is also highly important for actual computer applications, especially in data intensive processing fields, such as databases and on-line transaction processing. Unlike semiconductor products, disk systems have intrinsic limitations on performance improvements because they need physical movements during data access. Consequently, multiple disk drives should be accessed in parallel in order to balance disk-access speed with the data manipulation in processors. Data can be striped and stored into multiple disk drives, and accessed by interleaving in order to achieve a level of performance corresponding to the number of drives. There is another application for parallel disk

systems. Multi-media services have been eagerly investigated recently. For example, digital video data is stored in a disk system and quickly retrieved for video on demand (VOD) services [Gibbs et al. 93]. It requires great capacity, wide bandwidth and multiple streams for the disk system. The massively parallel disk system is exactly suited for the requirements [Kajitani 94]. Parallel disk systems, however, have a problem with reliability. Comparing with parallel systems constructed only from semiconductor parts, the reliability of parallel disk systems is reduced by physical movements in a disk drive, in spite of the faulttolerant design is one of the most significant requirement for their applications. Redundant disk arrays, known as redundant arrays of inexpensive disks or RAIDs, have been investigated to enhance the performance of secondary storage by parallel disk accesses and to improve the reliability of a system with multiple disks [Patterson et al. 88] [Gibson et al. 89] [Gibson 92]. It stores data and redundant information into a number of disks, and recovers the data using the redundant information under a disk-drive failure. RAIDs have five levels corresponding to the configuration of redundant information. Level 1, also called mirroring, keeps duplicate data in different disks. Level 2 keeps data with Hamming codes for error correction in a group of disks. Level 3, 4 and 5 use parity codes instead of Hamming codes since a disk controller can detect damaged disks. Data stored in a damaged disk can be recovered using only parity codes and other data of the disk group, if the damaged disks are marked by the disk controllers. Data and parity codes are stored as byte interleaved striping in level 3, and as block interleaved striping in level 4 and 5. Level 4 specifies a disk for keeping parity codes, while level 5 spreads parity codes into all disks of the group. In this paper, we focus on RAID level 4 and 5 because of its storage-capacity efficiency and independent parallel access. Figure 1 shows a typical level 4 RAID system

construction. A row in the figure is called a parity group. A disk storing parity codes in the parity group is called a parity disk. In RAID level 4 and 5, it is wasteful to access all disks in a group to calculate a parity code for updating data stored in a disk of the group. It can be made more efficient by calculating parity using the old data, the old parity and the new data, as follow :

PE

DC

Bottleneck

DC

DC

DC

DC

new parity = new data ⊕ old data ⊕ old parity — (i) Using expression (i), only two disks in a group, a disk keeping the old data and another one for storing parity code, are accessed during update of the data. This parity calculation technique offers a cost effective system construction. RAIDs provide enough performance and reliability for the systems with small number of disk drives. However, several problems occur for RAIDs when a system contains a large number of disk drives, e.g. when used for VOD services. A VOD station of practical size will require more than thousands of disk drives for storing many video programs, and wide bandwidth to connect to a great number of video terminals. A bus in a RAID system becomes a bottleneck when a great deal of disk drives are connected to it. If the system contains a thousand disk drives whose transfer rate is 10 MB/s, the bandwidth of the bus should be 10 GB/s. This is an extreme example, and the bus bandwidth can be narrower by considering actual drives in operation, but in any case the bandwidth of current I/O busses is insufficient for connecting a large number of drives. The bus is heavily used when reconstructing data stored in damaged disks, and the access speed of RAIDs is particularly reduced in that case. Clustered RAID has been proposed to make performance degradation during the reconstructing of data smaller [Muntz and Lui 90] [Merchant and Yu 92]. However, it does not eliminate the bottleneck thoroughly. Reliability is another problem of RAID with a large number of disks. RAID offers enough reliability for systems with small number of disks via rapid disk repairs [Patterson et al. 88]. However, data is lost when at least two disks in the same group are damaged simultaneously, since a single fault is assumed in a parity group of RAID. The single-fault assumption is not sufficient when the system contains a great number of disks. The Mean Time To Failure (MTTF) of disk drives seems to tend to be skewed depending on their production lot. If RAID uses many disks from the same lot, the probability of

Parity Disk Parity Group

Figure 1. An ordinary RAID construction

simultaneous faults may become high in the initial stages. If a system can tolerate multiple faults, applicability of the system becomes higher, and the system can be used in wide variety of fields requiring high reliability. In this paper, we propose to apply the parity calculation technique of RAIDs to interconnection networks to solve these problems. We call them datareconstruction networks or DR-nets for short. In DR-nets, disks are connected to each node of an interconnection network, and a parity group is constructed in a sub-network of the interconnection network. Since communication in a parity group is kept local in the sub-network, performance degradation for reconstructing data becomes very small. DRnets realize high reliability capable of handling multiple faults by using two types of parity groups overlapped on the interconnection network. It keeps data transfer local, resulting in low communication overhead for reconstructing data. Gibson has proposed some codes for the n-dimensional parity groups to realize high reliability RAIDs [Gibson et al. 89] [Gibson 92]. For example, a column in Figure 1 is used as the second dimension parity group. However, access methods for these parity groups are not clear, and bottlenecks of the bus and locality of communication among disks are not considered. Lee has proposed a system in which multiple RAID systems with a bus are connected by a network [Lee et al. 92]. It, however, differs from applying parity groups to interconnec-

tion networks directly, and assumes a single fault in a parity group. We are developing a small experimental system of 5 × 5 torus DR-net using Transputers. A Transputer, T805, corresponds to each node of the DR-net, and is connected with a small disk via a SCSI controller. The DataMesh research project [Wilkes 91], similar to the DR-net prototype, used Transputers as controllers of a parallel disk system to increase I/O performance, but they do not support high reliability at first. They next proposed the TickerTAIP architecture to adopt the RAID parity technique for tolerating a single fault [Cao et al. 93]. It, however, does not consider the network configuration nor multiple faults. We first give a simple construction of parity groups in an interconnection network and describe the behavior in each node of the network in Section 2. Although there are many topological types of interconnection networks to which this idea is applicable, we focus on two-dimensional torus networks, especially on a 5 × 5 construction as the first step. Section 3 presents the second-parity-group construction for improving the reliability of the system. Since network nodes dedicated for storing parity codes cause a problem of imbalance of storage volume and access frequency like RAID level 4, we consider a treatment for the imbalance in Section 4. In Section 5, we briefly report on an experimental system that we are now developing using Transputers. 2. Applying RAID to Interconnection Networks Figure 2 indicates the construction of systems using multiple processing elements (a pair of a processor and memories, indicated as PE in the figure) and multiple disk systems (disk drives and a disk controller; the controller is illustrated as DC in the figure.) A network between processing elements and disk systems takes out the communication bottleneck between them, and makes database and on-line transaction processing faster. In on-line transaction processing, for example, a request accepted by a processing element from an outside terminal is transferred to a disk system via the network. If the network supports independent access paths, we can easily increase the number of processing elements and disk systems without bottlenecks. Other types of communication among disk systems are required to calculate parity codes using expression (i). We can use the network in Figure 2 for

PE

PE

PE

PE

PE

DC

DC

Network DC

DC

DC

Figure 2. Image of connecting multiple PEs and disk systems

PE

PE

PE

PE

PE

DC

DC

PE Network

Disk Network DC

DC

DC

Figure 3. Two types of network for PEs and disk systems calculating parity codes, but another type of network shown in Figure 3 is suitable to realize highly reliable parallel disk systems independent of communication with the processing elements. In this paper, we consider the construction of disk networks to apply the concept of RAID. It is, however, not difficult to unify two networks of Figure 3 into a network in Figure 2. 2.1 Constructing Parity Groups in a Network It is important to keep inter-disk-system communication local for applying RAIDs to networks. We put a parity node, a node storing parity codes, into center of a sub-network and construct a parity group with the parity node and nodes which are directly linked to it. Communication between neighboring nodes is suitable for the parity-code calculation of expression (i) because there is no collision in

the sub-network even if the data reconstruction process is executed. For example, the parity group is composed of five nodes in the shape of a cross if we adopt a twodimensional torus as the network topology. Figure 4 shows parity group construction on a 5 × 5 torus network. Nodes connected by bold lines indicate parity groups. We have five parity groups in the 5 × 5 torus network. We can apply this concept to other topological networks. Examples for three-dimensional torus connection, and four-dimensional hypercube connection networks are illustrated in Figure 5. Seven nodes, a parity node and six neighboring nodes, construct a parity group in three-dimensional torus. In some case, such as in hypercube connection, a node may belong to multiple parity groups. It does not seem to be a serious problem, and these nodes can be used as spare nodes. We, however, consider the 5 × 5 torus network as the first step because each node belongs to just one parity group. Of course, the following characteristics can be easily extend to 5n × 5m tori (n and m are integer). The locations of parity nodes in a torus network form a minimal dominating set* of a graph constructed from vertices of nodes. We can name one node in the network i=0, j=0, and right-hand neighbor has the same j but i increased by 1 modulo 5 and down neighbor has the same i but j increased by 1 modulo 5. Each element of the minimal dominating set, called dominating node, has a unique value from i, j, (i+j) mod 5, and (i-j) mod 5. It means that dominating nodes cannot be lined in rows, columns, or two-direction diagonals. And a pair of dominating nodes, (i, j) and (i’, j’), also have one of the following two types of characteristics.

0,0

0,1

0,2

0,3

0,4

1,0

1,1

1,2

1,3

1,4

2,0

2,1

2,2

2,3

2,4

3,0

3,1

3,2

3,3

3,4

4,0

4,1

4,2

4,3

4,4

Parity Node i,j

Parity Group

Figure 4. Construction of parity groups on a 5 x 5 torus network

(a) 3-dimensional torus connection

Case I: If i’ = (i + 1) mod 5 then (i’ + j’) mod 5 = (i + j - 1) mod 5 ( or, if j’ = (j + 1) mod 5 then (i’ - j’) mod 5 = (i - j +1) mod 5), Case II: If j’ = (j + 1) mod 5 then (i’ + j’) mod 5 = (i + j - 1) mod 5 ( or, if i’ = (i + 1) mod 5 then (j’ - i’) mod 5 = (j - i + 1) mod 5 ). We can derive two types of the minimal dominating set by keeping the above characteristics. (b) 4-dimensional hypercube connection *

A set of vertices, such that vertices not in the set are adjacent to elements of the set, is called dominating set. A set, that is a dominating set and no proper subset of it is dominating sets, is called a minimal dominating set.

Figure 5. Construction of parity groups on other types of networks

These two cases make knight's tours, and correspond to a reflection of the mirror. The dominating set in Figure 4, consisting of the shaded nodes {(0, 3), (1, 1), (2, 4), (3, 2), (4, 0)}, is an example of Case I. When node(i,j) is a dominating node in the DRnet, the parity group is constructed from node((i-1) mod 5, j), node(i, (j-1) mod 5), node((i+1) mod 5, j), node(i, (j+1) mod 5), and node(i,j). It implements the function of level 4 RAID by making the dominating node keep parity codes for the other four nodes.

Stage 1: An object node accepts a write request. Stage 2: The object node reads the old data from a disk. Stage 3: The object node writes the new data into the disk. Stage 4: The object node sends the old and new data to the parity node. Stage 5: The parity node reads the old parity code from the parity disk, and calculates new parity code using the expression (i). Stage 6: The parity node writes the new parity into the parity disk.

2.2 Node Behaviors There are four types of operations of each node corresponding to read/write disk accesses, and the presence or absence of disk failure. Adding a data reconstruction operation to replace a damaged disk with a new disk, we have to consider the following five types of operations. Figure 6 illustrates these operations on each node and data movements among the nodes. a. Read without disk failure A disk storing object data is directly accessed, and data is directly transferred from disk to the node. No links are used during the access. b. Write without disk failure Data is stored into the object disk, and parity codes are stored into the parity disk. The paritycode calculation uses the expression (i) in Section 1. The write operation contains four-disk accesses and uses a link between the object node holding the object disk, and the parity node in the parity group to which the object node belongs. We can divide the operation into six stages. (a) read

(b) write

1

XOR

The two data in Stage 4 can be combined using exclusive-or expression at the object node, and sent to the parity node. c. Read under disk failure When the object disk is in good shape but another disk in the parity group is damaged, the behavior of the object node is the same as it in (a). When the object disk is damaged, the object node issues read-access requests to all other nodes in the same parity group. The parity node reads parity code, collects data from the other three nodes, calculates the exclusive-or expression for these data, and sends the result to the object node. Four links in the parity group are used during the operation. Since no parity node has direct requests of read, there is no influence for read operations in the case that only a parity disk is damaged. d. Write under disk failure When the object disk and the parity disk are in good shape but another disk in the parity group is damaged, the behavior of the object node is the same

(c) read under failure

XOR

(d) write under failure

(e) reconstruction

XOR

XOR

Parity Damaged

Parity

4 2

5 6

Parity

3

Parity Damaged

Figure 6. Node behaviors and data movements in DR-net

New

as in (b). When the parity disk is damaged, no parity code is calculated and stored, but the data is directly stored into the object disk. When the object disk is damaged, the parity node issues read-access requests to all other nodes in the same parity group, calculates exclusive-or expression for the data collected from other nodes and data to be stored, and stores the result into the parity disk. Four links in the parity group are used in this operation, too.

0,0

0,1

0,2

0,3

0,4

1,0

1,1

1,2

1,3

1,4

2,0

2,1

2,2

2,3

2,4

3,0

3,1

3,2

3,3

3,4

4,0

4,1

4,2

4,3

4,4

e. Data reconstruction for a new replaced disk When the damaged disk is replaced with a new disk, original data can be reconstructed from data stored in other nodes of the parity group. The operation is similar to it described in (c). Instead of returning the result of calculation in (c), the object node stores it into the new disk.

Parity Node

3. Improving Reliability

Second Parity Group

We assumed a single fault in a parity group in Section 2. It means that the systems of Figure 1 or Figure 4 which have 25 disks may not lose any data in the case that five faults occur in different parity groups, but they will lose data in the case that at least two faults occur in a single parity group. Here, we derive the probabilities pi that the i-th damaged disk does not belong to any parity groups containing a faulty disk. Since there is no parity group containing faulty disks for the first fault, p1 is 1. If the second fault disk is in N-G disks of N-1 disks, the system will be successfully degraded. Here, N and G are the number of disks in a system and a parity group respectively. Hence, p2 = (N - G) / (N - 1) It can be generalized, and the probability for i-th (1 ≤ i ≤ M) fault disk is pi = (N - (i - 1) × G) / ( N - i + 1 )

—— (ii)

Applying value for 5 × 5 torus (N = 25, G = 5), we derive values of the probabilities: p1 = 1, p2 = 0.83, p3 = 0.65, p4 = 0.45 , p5 = 0.24 , p6 = 0. From these probability, we can derive the ratio of recoverable patterns for all patterns: r1 = 1, r2 = 0.83, r3 = 0.54, r4 = 0.25 , r5 = 0.06 , r6 = 0. It indicates that data will be lost by 17% of the second disk fault and 46% of the third one. We now consider methods for improving system reliability to tolerate multiple faults in a parity group of DR-net. We propose a construction of sec-

Figure 7. Construction of SPGs ond parity groups (SPGs) on the same interconnection network of the first parity groups (FPGs) described in previous sections. An SPG is organized from node(i,j) and the node(i ± 1, j ± 1) mod 5. There are two types of data transfer paths between these nodes: swastikas and reverse swastikas. One of them generates link collisions among groups but the other pattern does not. The pattern corresponds to the case of location of dominating set, and reverse swastika has no collisions for the Case I described in Section 2.1. Figure 7 is an example of the collision free communication pattern. The communication in the SPGs keeps the locality during operations. Figure 8 illustrates examples of recovering faults with FPGs and SPGs. Since there are three faults in a parity group of (a), data cannot be reconstructed by only the first parity group, but can be done in combination with the second parity group. At first, SPG 2 and SPG 3 reconstruct data in node(2,2) and node(3,3) respectively, then FPG1 reconstructs data in node(3,1). The combination of FPGs and SPGs can recover at most nine faults in 5 × 5 torus network shown in (b). The diagonal arrows in the Figure represent data transfer by omitting right-angled two hops of the reverse swastika. The number at the upper-left side of a node indicates an example of a sequence for recovery. Of course, there can be other sequences or parallel recovery. An FPG or SPG,

SPG2

SPG3 1 2 5

3

7

9

6

4

8

2,2 3,1

3,3

FPG 1 (b)

(a)

Figure 8. Recovering multiple faults with FPGs and SPGs FPG 2

0,1 1,0

SPG3

3,1

3,3

SPG4

FPG 1

Figure 9. An unrecoverable pattern Table 1. The number of unrecoverable patterns Faults All Patterns 1 25 2 600 3 13800 4 303600 5 6375600 6 127512000 7 2422728000 8 43609104000 9 7.41355E+11 10 1.18617E+13

Unrecoverable Patterns 0 0 120 13920 871200 37944000 1250930000 32539400000 6.84E+11 1.18617E+13

that has a recoverable node, contains the other four nodes being not damaged or of lower number than that of the recoverable node. Figure 9 shows an example of an unrecoverable location pattern of four damaged disks. Dependencies among damaged nodes and parity groups are represented as the following: node(3,1) ← FPG1 ∨ node(3,3) ← FPG1 ∨ node(1,0) ← FPG2 ∨ node(0,1) ← FPG2 ∨ FPG1 ← node(3,1) ∨ FPG2 ← node(1,0) ∨ SPG3 ← node(3,3) ∨ SPG4 ← node(3,1) ∨

SPG4 SPG3 SPG3 SPG4 node(3,3) node(0,1) node(1,0) node(0,1)

It indicates that the dependencies construct a transitive closure. There are not so many patterns in which data cannot be recovered for four faults. All patterns for four faults in 5 × 5 torus are 303,600, and patterns constructing transitive closures are 2,400 (0.8%.) There are also impossible patterns for three faults: an object disk, the parity disk of its FPG, and the parity disk of its SPG are damaged. The ratio of unrecoverable pattern increase with the number of faults in the system. We counted up the number of all patterns and unrecoverable patterns of damaged-disk location for 5 × 5 torus, as shown in Table 1. From the table, we

Recoverable Ratio

1

BG

B

B

G

0.8

B

G B

B

FPG only with SPG

B

0.6

G

B

0.4

B

G

0.2

G

0 1

2

3

4

5

G

G

G

B G

BG

6

7

8

9

10

Number of Faults

Figure 10. Recoverable ratio vs. the number of damaged drives can derive the ratios of recovery pattern for the combination of FPG and SPG: r1 = 1, r2 = 1, r3 = 0.99, r4 = 0.95, r5 = 0.86, r6 = 0.70, r7 = 0.48, r8 = 0.25, r9 = 0.08, r10 = 0. Figure 10 illustrates their transition for the number of faults with only FPGs calculated using expression (ii) and with the combination of FPGs and SPGs. It indicates that data is recovered 100% when any two disks are damaged and 95% with four damaged disks and up to nine damaged disks on the 5 × 5 torus network with SPGs. 4. Dealing with Imbalance Straightforward implementations of DR-nets will have two problems: imbalance of storage volume and of access frequency among network nodes. Since a dominating node should store parity codes for two types of parity group by using SPGs, storage volume for a dominating node is twice that of other nodes. Access frequency for a dominating node is higher than for other nodes with FPGs like level-4 RAID, since an object node and the parity node in the parity group to which the object node belongs are

accessed during write operations. The frequency is increased by SPGs. This makes it hard to configure a symmetrical system, and generates bottlenecks in the system. In this section, we consider two methods for treating these problems by moving parity groups within a network and moving a parity node within a parity group. 4.1 Move Parity Groups within a Network We first consider a method of moving parity groups (MPG) in a network to handle storage and access imbalances. If we can prepare a number of patterns of parity group allocation on a network, these patterns can be overlapped and switched in phase with blocks of disk such as pages, sectors, and tracks. We can derive five patterns for a 5 × 5 torus network for each case of dominating node allocation mentioned in Section 2.1 by varying i or j. Figure 11 shows the five patterns for Case I. Since all nodes become parity nodes in the MPG method, storage volume and access frequency are

Figure 11. Patterns of parity-group allocation

balanced in a network even if SPGs are applied. The exclusive-or operations are also performed in parallel. However, the MPG method degrades system reliability. Related nodes for a disk failure are only four if the parity groups are fixed in the network. A disk failure produces an effect on five parity groups, i.e., 12 nodes, when the parity groups are moved with phase switches (Figure 12.) Using SPGs, all patterns of two faults are still recovered in the MPG method. Patterns of more than three faults that construct no transitive closure under no phase switch, however, will have transitive closures in the MPG method.

0,0

0,1

0,2

0,3

0,4

1,0

1,1

1,2

1,3

1,4

2,0

2,1

2,2

2,3

2,4

3,0

3,1

3,2

3,3

3,4

4,0

4,1

4,2

4,3

4,4

4.2 Move a Parity Node within a Parity Group Now we consider a method of moving a parity node (MPN) in a fixed parity group instead of moving the parity group. The location of the parity node in a parity group is switched in phase with disk blocks. It means that the parity codes are stored in not only the dominating nodes but in the other nodes as well, similar to level-5 RAID. We have to consider communication cost in this method since data is transferred at most four hops in a write operation. Here, we compare operations and data transfer with them mentioned in Section 2.2. Figure 13 illustrates behaviors of nodes and data movements when parity codes are stored in all nodes of the parity group.

Figure 12. Related nodes of a disk drive failure with phase switches b. Write without disk failure Operations are not changed, but the distance between an object node and parity node changes from one to two. c. Read under disk failure Since four nodes with undamaged disks perform read operations in a parity group, there is no difference with storing parity codes in dominating node or other nodes.

a. Read without disk failure There is no difference in the operations described in Section 2.2 except that data stored in a dominating node is also read.

(a) read

(b) write

1

XOR

d. Write under disk failure We can keep network traffic the same as in Section 2.2 by performing the exclusive-or operations

(c) read under failure

(d) write under failure

XOR

(e) reconstruction

XOR

4

XOR

XOR

2 3

5 6

Damaged Parity

Parity

New

Damaged Parity

Parity

Figure 13. Node behaviors for the moving-parity-node method

: Dominating Node

(a)

: Object Node

: Parity Node for FPG

(b)

(c)

: Parity Node for SPG

(d)

(e)

Figure 14. Data transfer in an FPG and an SPG during a write operation for data from three nodes except for parity node, and sending the result to parity node, performing the next exclusive-or operations for the result and the old parity data. e. Data reconstruction for a new replaced disk Operation and data transfer are identical to the case of storing parity codes in dominating node like in (c). No difference exists in read operations between the MPN method and the method of keeping the parity codes in dominating node, with or without disk failure. The number of communication hops in a write operation is only increased. We now consider behaviors of nodes with SPGs. There is no influence in a read operation for SPGs same as for FPGs. In a write operation, the number of communication hops is increased from two to four in a naive implementation making all data pass dominating nodes. We can decrease the number of hops by taking shortcuts. The five diagrams in Figure 14 illustrate data transfer in an FPG and an SPG containing the same object node. We can get other patterns by rotating or shifting these diagrams. The pattern (a) indicates a case in which dominating nodes keep parity code. The number of communication hops in a write operation is one for the FPG and two for the SPG. The other patterns of the MPN method are illustrated in figs (b) to (e). The parity node of the SPG is located across the object node from the FPG parity node for each phase. The number of communication hops becomes two by putting the SPG parity node to the left and below the dominating node, when the FPG parity node is to the right of the dominating node as shown in (b), or by putting the SPG parity node to the right and up when the FPG parity node is to the left of dominating node

as shown in (c). When the FPG parity node is below the dominating node, shortcuts cannot be made. Therefore the number of communication hops in the SPG becomes four, but we can make data paths that do not go through the dominating node as shown in (d). Neither data paths of (b) nor (c) contain the dominating node. It assists the dominating node to avoid becoming a bottleneck. The pattern (e) indicates a case in which data in dominating nodes are accessed. The parity codes are stored above the dominating node for the FPG and to the right and below it for the SPG in this phase. We summarize the number of communication hops for an FPG and an SPG in Table 2. There is a worst case of four hops, but the average is two. Since the average is 3/2 when keeping the parity codes in the dominating node only, the communication cost of a write operation becomes 4/3 times that of Section 2.2. The reliability is, however, retained by this method. We can derive the recoverable ratio transition for the MPG and MPN methods by testing patterns like we did in Section 3. Figure 15 illustrates the results. The FPG/FIX and FPG/MPN are identical. 5. An Experimental System We are now developing a small experimental system of the 5 × 5 torus DR-net using 26 Transputers (T805) and 25 small disks. We use an extra Transputer as an interface node for connecting

Table 2. Communication hops for an FPG and an SPG during a write operation (a)

(b)

(c)

(d)

(e)

FPG

1

2

2

2

1

SPG

2

2

2

4

2

Recoberable Ratio

1

ACBHFG

BHF

BHF

AG

0.8

0.6

C

B F H

B F

AG

H

0.4

B B

C 0 1

2

3

SPG/FIX FPG/MPG SPG/MPG FPG/MPN SPG/MPN

B H

F

ACG

ACHG

ACHFG

B ACHFG

ACBHFG

5 6 7 Number of Faults

8

9

10

AG C

C 4

FPG/FIX

F

AG

0.2

G B C H A F

Figure 15. Recoverable ratio transition with the MPG and MPN method the system with the outside since all four links of each Transputer are used to construct the torus connection. Two links of the extra Transputer are used for inserting it into the torus connection, and one of remaining links is used for connecting it to an Ethernet communication box. Each Transputer has a SCSI controller with a 2.5 inch disk connected. Since the storage capacity of each disk is 120 MB, the whole system has 3 GB disk space. The communication speed between Transputers is 1.7 MB/s. Figure 16 illustrates the hardware configuration of the experimental system. We are now developing a control program of each node in OCCAM. Since the control software is very simple, we will eventually be able to put the function of each node into an intelligent disk con-

troller with a communication facility. The Transputer may be rather expensive for only controlling disks, but it does not seem to be difficult to develop the intelligent disk controller at a reasonable cost by extracting necessary functions. If we use the system for actual applications, we have to put more interface nodes for communicating to the outside in order to make the bandwidth wider. The interface nodes can be located in places indicated by broken lines in the Figure 7. These lines are not used for communication in either SPGs and FPGs. We must also use wider bandwidth lines, e.g., ATM lines, to connect them with the outside, and magnify the network size for 5n × 5m for practical use. The experimental system, however, can be used to obtain fundamental data for evaluating node

Communication Link 1.7MB/s Processing Node Transputer T805

LAN (Ethernet) to Host

SCSI Controller 2.5' Hard Disk 120MB T805: 26 (5 x 5 + 1)

Disk: 3GB ( 120MB x 25 )

Figure 16. Configuration of the experimental system

usage, disk access, and link collision for FPGs and SPGs. MPG and MPN methods should also be compared using it. 6. Concluding Remarks We proposed DR-nets to implement highly reliable parallel disk systems. They can connect a great number of disk drives since bottlenecks in data reconstruction are eliminated by local communication of parity groups constructed in sub-networks of an interconnection network. DR-nets are also capable of handling multiple faults by constructing SPGs on the same interconnection network. Data stored in two damaged disks can be fully recovered in a 5 × 5 torus network and 95% data can be recovered for four faults, while data will be lost under only two damaged disks in the same size RAID. The parity calculations in FPGs and SPGs can be executed in parallel, and the overhead of introducing SPGs is slight. We considered two methods for dealing with two problems of the storage volume and access frequency balance. One is to move parity groups in a network and another is to move a parity node within a parity group. The former eliminates both types of imbalances, but reduces reliability of the system. The latter maintains the reliability, but increases the communication cost in write operations. The communication latency is relatively small comparing with disk access latency. The rise of communication cost is not excessive, and we can connect a great number of disks since the method retains the locality of communication. There are several open issues to be considered. Though we assumed faults in disks, we must also consider faults in a node itself and links between them. We should also consider other network configurations such as three-dimensional tori and hypercube connection networks. Treatment of spare disks is another issue. Since one of the most suitable application of DR-net should be a fault-tolerant digital video data server, we will consider the interface with outside using ATM lines. We will also consider parallel database operations with fault-tolerant functions on the DR-net. The concept of DR-net can apply not only for networks of disks but also for other networks in which faults can be detected at each node. We will investigate other possibilities of DR-net applications.

Acknowledgments The author would like to thank Magnus M. Halldorsson for his helpful advice on refining this paper. Reference [Cao et al. 93] P. Cao, S. B. Lim, S. Venkataraman, and J. Wilkes. The TickerTAIP parallel RAID architecture. In Proc. of the 20th ISCA, pp. 5263, 1993. [Gibbs et al. 93] S. Gibbs, C. Breiteneder, and D. Tsichritzis. Audio/Video Databases: An ObjectOriented Approach. In Proc. of the 9th ICDE, pp 381-390, 1993. [Gibson et al. 89] G. A. Gibson. L. Helleerstein, R. M. Karp, R. H. Katz, and D. A. Patterson. Coding Techniques for Handling Failures in Large Disk Arrays. In Proc. of 3rd Intn'l Conf. on ASPLOS, 1989. [Gibson 92] G. A. Gibson. Redundant Disk Arrays. The MIT Press, 1992. [Kajitani 94] K. Kajitani. Analysis of Disk Array Management for Digital Video Server. In The Transaction of the IEICE, Vol. J77-D-I, No. 1. pp. 66-76. 1994. (in Japanese) [Lee et al. 92] E. K. Lee, P. M. Chen, J. H. Hartman, A. L. Chervenak Drapeau, E. L. Miller, R. H. Katz, G. A. Gibson, D. A. Patterson. RAID-II: A Scalable Storage Architecture for High-Bandwidth Network File Server. In Technical Report of UCB (EECS), UCB/CSD 92/672, 1992. [Muntz and Lui 90] R. R. Muntz and J. C. S. Lui. Performance Analysis of Disk Arrays Under Failure. In Proc. of 16th VLDB, pp. 162-173, 1990. [Merchat and Yu 92] A. Merchat and P. S. Yu. Design and Modeling of Clustered RAID. In Proc. of 22nd FTCS, pp. 140-149,1992. [Patterson et al. 88] D. Patterson, G. Gibson, and R. Katz. A case for Redundant Array of Inexpensive Disks (RAID). In Proc. of SIGMOD, pp.109116,1988. [Wilkes 91] J. Wilkes. The DataMesh Research Project. In Proc. of Transputing ’91, pp. 547553, 1991.