1 Department of Electrical and Computer Engineering, FAMU-FSU College of ... Performance evaluation and reliability prediction are two important factors in the study ... As the number of components in the network increases, so too does the ...
© 2000, HCS Research Lab. All Rights Reserved.
Reliability Modeling of SCI Ring-Based Topologies M. Sarwar1, A. George2, and D. Collins1 1
High-performance Computing and Simulation (HCS) Research Laboratory Department of Electrical and Computer Engineering, FAMU-FSU College of Engineering 2 Department of Electrical and Computer Engineering, University of Florida
Abstract Performance evaluation and reliability prediction are two important factors in the study of multiprocessor and cluster interconnects. One such interconnect is the Scalable Coherent Interface (SCI). SCI is a point-to-point, ringbased interconnect that can be configured in various switched-ring topologies such as counter-rotating rings and tori. While performance analyses of SCI-based interconnects have been discussed in the literature, reliability evaluation has not received much attention. In addition, the reliability of SCI interconnects configured in many of today’s popular topologies cannot be deduced from earlier work on network reliability as link failures within an SCI interconnect are not independent of one another. A single link failure within the topology results in the failure of the entire ringlet to which the link belongs. This paper presents the results of a reliability study on 1D and 2D k-ary ncube switching fabrics for the Scalable Coherent Interface based on ring elimination rather than link elimination. The study is conducted using reliability models created in UltraSAN, a tool based on Stochastic Activity Networks. The models are verified using both combinatorial and Markov modeling. The results demonstrate the inherent reliability characteristics of a single-ring system can be greatly enhanced by the addition of a second redundant ring. By contrast, the results show that the reliability of a torus does not increase significantly with the addition of redundant rings. Hence, the cost of adding redundant rings to certain topologies may or may not be justified, depending upon the degree of reliability sought. 1. Introduction The inevitable shift towards parallel computing systems has accentuated the need for more reliable networks. As the number of components in the network increases, so too does the failure rate of the system. Classic faulttolerant systems use redundant interconnects to provide a measure of fault tolerance. However, other systems often provide fault tolerance by using the inherent redundancy within an interconnect rather than using completely redundant interconnect components. Hypercubes, meshes, tori and other networks that are topologically isomorphic to the family of k-ary n-cube networks are some examples of such topologies, where n is the dimension of the cube, k is the radix (e.g. k = 2 for a hypercube), and the number of nodes in the network is kn. These topologies provide fault tolerance in the form of multiple paths from any source node to any destination node. In this paper we target the inherent redundancy in such networks in order to determine the reliability of SCI [1] topologies. The k-ary ncube family was chosen since SCI is a ring-based network, and hence k-ary n-cubes provide an ideal framework for scalable SCI systems. Reliability analysis of redundant fault-tolerant systems is typically performed using one of two methods, combinatorial modeling and Markov modeling [2]. Combinatorial modeling is applicable to cases where the system can be broken down into series and parallel combinations of components. For more complex systems that cannot be represented using distinct series-parallel combinations, Markov modeling is employed. A higher-level modeling approach based on Petri nets can also be used, providing a graphical view of the operation of a system and the interaction between failures. Petri nets are used to determine the reliability of systems by generating the Markov state space for the model and solving the associated Chapman-Kolmogorov equations. Some of the more popular Petri net packages include DSPNexpress [3], GreatSPN [4], HARP [5], SPNP [6], and SURF-2 [7]. For this paper, the reliability of SCI-based 1D and 2D k-ary n-cubes is studied using UltraSAN [8], a tool based on Stochastic Activity Networks developed by Sanders et al. at the University of Illinois at Urbana-Champaign. UltraSAN was chosen since it permits easy modeling of replicable systems and provides a broad range of analytic solvers. The remainder of this paper is organized as follows. Section 2 presents related research in the areas of faulttolerant network reliability modeling, with an emphasis on ring-based architectures and the use of Petri net models. Section 3 describes the characteristics of the SCI model implemented in UltraSAN with a description of the assumptions made in creating the models. Analytical verification of the UltraSAN models of the studied k-ary ncube systems is detailed in Section 4, using both combinatorial and Markov models. Section 5 examines the reliability results obtained from the UltraSAN model. The reliability of the unidirectional and bi-directional tori and
the single and counter-rotating ring configurations are analyzed and compared, allowing informed decisions to be made regarding network organization and level of redundancy for SCI-based applications with reliability requirements. Finally, the conclusions and possible directions for future research are presented in Section 6. 2. Related research Fault tolerance of interconnect topologies can be measured in terms of the terminal reliability or network reliability. Terminal reliability is the probability that there exists at least one path from a given node to a destination node. It is most commonly used to assess the reliability of multi-stage interconnection networks (MINs) as is done by Colbourn et al. [9] and Varma and Raghavendra [10]. This paper concentrates on network reliability, or the probability that there exists at least one path from every node to all other nodes. Network reliability can be assessed using either combinatorial or Markov modeling. Combinatorial modeling of networks requires the decomposition of a network into subnets and determining the reliability of the entire system as a combination of the subnets. Cheng and Ibe [11] and Menezes and Bakhru [12] use such a method to evaluate the reliability of shuffle-exchange networks. Markov modeling can also be used as an alternative to combinatorial modeling or in conjunction with combinatorial modeling. Blake and Trividi [13] use continuous time Markov chains to determine the reliability of shuffle-exchange networks. In [14], Blake and Trivedi use Markov modeling in conjunction with combinatorial modeling by dividing the studied MINs into a two-level model, obtaining the reliability of each subsystem using Markov modeling and the system reliability using a series system comprised of the Markov components. In [15], Balakrishnan and Reibman use Markov modeling to determine the reliability of private networks where the minimal operational path is dictated by the application. The Balakrishnan-Reibman models present an example where combinatorial analysis is no longer feasible since the reliability models are dependent upon the communication paths. Since the communication paths can take any form, they cannot be accurately represented as series-parallel combinations. This paper concentrates on the reliability of SCI networks configured as k-ary n-cubes. The reliability of ringbased architectures has been studied in Smith and Trivedi [16]. The topology targeted in that paper is the forward loop backward hop (FLBH) network. The FLBH class of 1D ring topologies, which include daisy chain loop, forward loop parity hop networks, and chordal rings, has also been studied by Raghavendra and Silvester [17]. Work has also been conducted on the 2D architectures such as the Manhattan street network (MSN) and Torus. In [18], Chung et al. present the terminal reliability analysis of an MSN and a 2D Torus. Chen and Berger [19] present a reliability analysis of Manhattan street networks showing the complexity of the Markov model for the MSN. Complexity arises from the interdependence between link failures as shown in their paper. In the model presented in this paper, the requirements of the SCI protocol help to simplify the link interdependence as entire rings containing link failures are eliminated. The ring elimination cuts down dramatically on the state space of the corresponding Markov model. In addition, by using the ring as the basic building block of the k-ary n-cube models, Petri nets can be used to determine the reliabilities of the systems. Network reliability analysis using Petri nets has not been carried out extensively. Most reliability models based on Petri nets deal with small redundant systems with a fixed number of components. The reason for this limited usage of Petri nets is that only systems that employ replicable building blocks can be easily modeled. Benitez and Fortez [20] demonstrate the use of a Petri net model for determining the reliability of fault-tolerant processor arrays. In their paper, the processor array is considered to be a replicable system with a single row of processors as the building block of the model. Since SCI uses a fabric of switched rings, and each ring is eliminated if it contains a single faulty link, the k-ary n-cubes being studied can be easily created using replicable components, making ringbased systems with distributed switching ideal for modeling with Petri nets. The primary benefit of modeling a system with Petri nets is the ease with which the model can be created and replicated to create larger models. In addition it provides an easy to understand, visual representation of the system and the interaction between faults. 3. SCI UltraSAN model The simplest configuration of an SCI interconnect is a ring traversing all nodes. The ring is based on the architecture of an SCI interface illustrated in Fig. 1. Incoming packets to the interface pass through an address decoder. If the packet is destined for the local node, the decoder places it into the request- or response-input queue. If the packet is destined for another downstream node, it is forwarded to the bypass FIFO. To output a packet, the SCI node must have sufficient free space in its bypass FIFO to hold all incoming symbols. When there are no packets waiting in the output queue or there is insufficient free space in the bypass FIFO for the output queue data to be sent, data from the bypass FIFO is transmitted on the node’s output link. If the bypass queue is empty, then idle symbols are transmitted. Idle symbols also carry flow-control information and at least one must precede any send or
2
echo packet. This flow-control information is used to inhibit upstream nodes from sending data when the bypass FIFO must be emptied to allow the output queue to be emptied. Larger SCI networks are based on multiple ringlets connected together to create more complex topologies through the use of agents. An agent is essentially a SCI-to-SCI bridge used to interconnect two or more rings. The topologies studied in this paper include the unidirectional and bi-directional forms of the topologies illustrated in Fig. 2. In [21] we presented performance analysis of these SCI-based topologies using the node model shown in Fig. 3. Each switch in a distributed switching fabric provides the ability to interface up to four SCI ringlets and also acts as an interface to the processing unit at that node. The UltraSAN models presented in this paper are based on this switch model.
SCI out
Encoder/ MUX
Bypass FIFO
Response Queue
Request Queue
Response Queue
Request Queue
To processing node
Address Decoder
SCI in
Save Idle
Fig. 1. The SCI interface SCI Interfaces
SCI Interfaces
Fig. 2. (a) Dual-ring k-ary 1-cube, (b) bi-directional k-ary 2-cube
Switch queues
Routers
Processing node
SCI interface
SCI interface
SCI interface
Crossbar
SCI outputs
SCI inputs
Client input queues
SCI interface
Arbitration
Router
Arbitration
Client output queue
Fig. 3. 4-port switch model In the UltraSAN model, it is assumed that each node consists of the processing node and a crossbar. The SCI interfaces are used to connect the node into ring-based topologies. The number of SCI interfaces can be increased from one unit to four units, permitting the node to access up to four rings hence providing it with four input and four output ports.
3
Due to the inherent characteristics of a register-insertion ring and their key role in SCI, all links in an SCI ring must be operational in order for that ring to operate correctly. A single link failure eliminates the entire ring from a multi-ring system, requiring the system to reconfigure and use the remaining links to continue normal operation. The UltraSAN model uses this fact as the basis of determining a failed state for the network. If a node is disconnected from the remainder of the network either due to ring failures or the failure of the crossbar at that node, then the entire system is considered to be in a failed state. If a set of ring failures cause the network to be disjoint, then that condition will also result in a failed state. To create the dual-ring topology, the single-ring model is simply duplicated. The torus model is created using one or more single-ring models representing the rows in a torus with each row sharing the vertical rings. Each row monitors ring failures within itself and the vertical rings connected to it. By doing so, any one of the duplicated row subnets representing the rows can detect and signal a network failure. 4. Model Verification The models are verified using combinatorial modeling for the single and counter-rotating ring systems, and Markov modeling for the torus model. The single and counter-rotating ring systems can be represented as seriesparallel combination of components. The tori systems cannot and hence require the use of Markov modeling to determine their reliability. 4.1 1D topologies (k-ary 1-cubes) In the single ring model, a link failure results in a system failure. In addition, it is assumed that a system failure occurs if a single node is unable to communicate with the rest of the network. Such a case can occur due to a crossbar failure resulting in the isolation of the attached node from the network. The reliability of single ring systems of n nodes can therefore be verified analytically by assuming a series system of links and crossbars. The system reliability is expressed as: n n Rsystem (t ) = Rlink (t ) Rcrossbar (t )
(1)
In Equation 1, based on the Exponential Failure Law (EFL), Rlink (t ) = e -λ link t and R crossbar (t) = e -λ crossbar t . The failure rates of the links and the crossbars were estimated using the Handbook for Reliability Prediction of Electronic Equipment (MIL-HDBK-217F) [23]. These reliabilities are λlink = 3.509x10-6 failures per hour and λcrossbar = 1x10-6 failures per hour respectively. In the series model expressed by Equation 1, all links and crossbars must operate correctly for the system to remain in an operational state. The analytical model for the counter-rotating ring systems is a combined series-parallel system in which at least one of the two rings must be operational and all crossbars must be operational for the system to remain in an operational state. The reliability of the counter-rotating ring systems is expressed as: n n R system (t ) = 1 − [1 − Rlink (t )] 2 Rcrossbar (t )
(2)
n The expression 1 − [1 − Rlink (t )] 2 represents the reliability of having one of the two rings operational. This value is n then multiplied by the reliabilities of the crossbars Rcrossbar (t ) to account for the n crossbars that must remain operational. Fig. 4 shows the UltraSAN and analytical model reliability results for the single ring (SR) and counterrotating ring (CRR) systems. In both cases the reliabilities obtained from the UltraSAN models are identical to the values obtained analytically.
4.2 2D topologies (k-ary 2-cubes) Due to the inter-ring dependencies, the 2D topologies cannot be represented using a distinct series-parallel model, hence Markov modeling must be employed. In addition, the state spaces of the Markov models for the 2D systems increase rapidly with each increment in k. For this reason, the smaller Markov model of a 9-node torus is used to verify the reliability results obtained from UltraSAN. Fig. 5 depicts the Markov model for a 9-node unidirectional torus. Fig. 6 shows the determination of the critical rings after a given ring failure. Failure of any critical ring will then result in a network failure. The larger tori models use a simple extension of the 9-node torus
4
model, and thus the verification below adds credence to their accuracy as well. Equation 3 gives the reliability of a 9-node unidirectional torus as a function of mission time. For corresponding reliability expression derived from the model is - 6 λringt
− 6e
−5λringt
+ 6e
−4 λringt
1.0
1.0
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
Reliability
9-node SR analytical 9-node SR UltraSAN 16-node SR analytical 16-node SR UltraSAN 25-node SR analytical 25-node SR UltraSAN 36-node SR analytical 36-node SR UltraSAN
0.4 0.3 0.2 0.1
(3)
0.5 9-node CRR analytical 9-node CRR UltraSAN 16-node CRR analytical 16-node CRR UltraSAN 25-node CRR analytical 25-node CRR UltraSAN 36-node CRR analytical 36-node CRR UltraSAN
0.4 0.3 0.2 0.1
20000
18000
16000
14000
12000
8000
10000
6000
0
20000
18000
16000
14000
12000
10000
8000
6000
4000
2000
0
4000
0.0
0.0
2000
Reliability
R9-node (t ) = e
Mission Time (hours)
Mission Time (hours)
Fig. 4. UltraSAN and analytical reliabilities for single ring and counter-rotating ring systems In Equation 3, λ ring is the reliability of each ringlet. For the 9-node torus, setting λ ring = 3λ link accounts for the 9 three links making up each ringlet. To determine the system reliability, Equation 3 is multiplied by Rcrossbar (t ) to account for the 9 crossbars within the system. For a detailed description of the approach used to derive the reliability expression above, the reader is directed to [2]. Fig. 7 illustrates the accuracy of the UltraSAN model, where the analytical and UltraSAN results for the 9-node torus are nearly identical.
1-5λ∆t
6λ∆t
0
1-6λ∆t
1 2λ∆t
1-4λ∆t
1.0
3λ∆t
F
4λ∆t 2
Fig. 5. Markov model for a 9-node unidirectional torus
Non-Critical Rings Failed Ring
Critical Rings
Fig. 6. 9-node torus with critical and non-critical rings after one ring failure
5
1.00 0.95
Reliability
0.90 0.85 0.80 0.75 0.70 0.65
9-node torus analytical 9-node torus UltraSAN
20000
18000
16000
14000
12000
8000
10000
6000
4000
2000
0
0.60
Mission Time (hours)
Fig. 7. UltraSAN and analytical reliability results for the 9-node torus 5. Reliability Results In this section, the reliability results obtained for the k-ary n-cubes listed in Table 1 are presented. Afterwards, in the next section, three case studies are presented to demonstrate the application of the reliability results to SCI systems requiring varying levels of fault tolerance. Table 1. 1D and 2D model descriptions 1D Systems
2D Systems
Single Ring
Counter-Rotating Ring
Unidirectional Torus
Bi-directional Torus
9-ary 1-cube (9-node)
9-ary 1-cube (9-node)
3-ary 2-cube (9-node)
3-ary 2-cube (9-node)
9-ary 1-cube (16-node)
9-ary 1-cube (16-node)
4-ary 2-cube (16-node)
4-ary 2-cube (16-node)
9-ary 1-cube (25-node)
9-ary 1-cube (25-node)
5-ary 2-cube (25-node)
5-ary 2-cube (25-node)
9-ary 1-cube (36-node)
9-ary 1-cube (36-node)
6-ary 2-cube (36-node)
6-ary 2-cube (36-node)
The reliabilities of the systems presented in this section are based on estimated component reliabilities. Even though the individual component reliabilities may not be precise for any particular implementation, maintaining the values constant throughout the evaluation process permits a relatively fair comparison of the systems. 5.1 1D Systems The 1D SCI systems consist of single and counter-rotating ring topologies where each ring traverses all nodes within the system. The reliabilities determined from the UltraSAN models for the 1D single ring (SR) and counterrotating ring (CRR) systems are shown in Fig. 8. The trend shows a decrease in reliability with each increment in ring size. The ratio of the analytical reliabilities, from Equations 1 and 2, for the counter-rotating ring and single ring systems is given by
RCRR (t ) n = 2 − Rlink (t ) RSR (t )
(4)
This equation shows that the reliability of each counter-rotating ring system ranges from 1 to 2 times that of the n comparable single ring systems. Rlink (t ) is a function of both time and the number of nodes in the system. It is observed that as time tends to infinity, the reliability of each counter-rotating ring system approaches twice the
6
reliability of the comparable single ring system of equal size. However, at this point, the reliabilities are almost insignificant. Examining the effect of the number of nodes n at a constant time, it is seen that as the number of nodes increases, the ratio of the reliabilities of the counter-rotating ring systems to the single ring systems increases. Hence, it can be concluded that the addition of a second ring to the single ring systems improves the reliability of the larger systems more significantly than the smaller ones. 1.00
1.00 0.90 0.80
0.90
0.60 9-node CRR 16-node CRR 9-node SR 25-node CRR 16-node SR 36-node CRR 25-node SR 36-node SR
0.70 0
20000
18000
16000
14000
12000
8000
10000
6000
4000
0
2000
0.00
Mission Time (hours)
5000
0.10
4000
0.20
3000
0.30
9-node CRR 16-node CRR 9-node SR 25-node CRR 16-node SR 36-node CRR 25-node SR 36-node SR
0.80
2000
0.40
1000
0.50
Reliability
Reliability
0.70
Mission Time (hours)
Fig. 8. Reliability curves of the single ring and counter-rotating ring systems 5.2 2D Systems An increase in the number of dimensions might be expected to provide an increase in reliability. Hence a move from the 1D counter-rotating ring to a 2D unidirectional torus should provide an increase in reliability. In both topologies, each node shares two ringlets, supplying two input/output ports per node. Since the reliabilities of the 1D systems are dependent upon the number of nodes per ringlet, a reduced number of nodes per ringlet and an increase in the number of ringlets comprising the system might be expected to provide a higher overall system reliability. However, this expectation was found not to be the case. From both the analytical and UltraSAN models, it is found that the reliabilities of the counter-rotating ring and unidirectional tori systems were identical. A closer examination of the topologies shows that after a single link failure, an n/(2n-1) probability exists that a second failure will result in a node disconnection. This probability holds true for both the counter-rotating ring and tori topologies of n nodes. The benefit of using a torus over a counterrotating ring is the ability of the torus to degrade, permitting communication between some nodes. The counterrotating ring system is incapable of such graceful degradation. For example, in the case of a 9-node counter-rotating ring system, two ring failures cause the entire system to fail. For a 9-node torus, the failure of a single ring creates three critical rings. A failure of any one of the three critical links will create a disjoined system in which 8 of the nodes still have the ability to communicate. By adding a counter-rotating ring alongside each ring in the unidirectional torus, an added degree of fault tolerance can be achieved. The reliabilities of both the unidirectional and bi-directional tori are shown in Fig. 9. 1.00
1.00 0.90
0.95
0.80
0.90 Reliability
0.60 9-node UT 9-node BT 16-node UT 16-node BT 25-node UT 25-node BT 36-node UT 36-node BT
0.50 0.40 0.30 0.20 0.10
9-node UT 9-node BT 16-node UT 16-node BT 25-node UT 25-node BT 36-node UT 36-node BT
0.85 0.80 0.75
0.00
Mission Time (hours)
Mission Time (hours)
Fig. 9. Reliability curves of the unidirectional and bi-directional tori systems
7
7000
6000
5000
4000
3000
2000
1000
0
20000
18000
16000
14000
12000
10000
8000
6000
4000
2000
0.70 0
Reliability
0.70
A comparison of the unidirectional tori (UT) and bi-directional tori (BT) curves shows a different trend from that seen in the 1D systems. For the 1D systems, the reliability curves showed an expected trend wherein the counter-rotating rings demonstrated a consistently higher reliability than the single ring systems. For the 2D systems, the trend for small systems is reversed. For example, in Fig. 9b, the reliability of the 9-node unidirectional torus is higher than that of the comparable bi-directional torus. As the number of nodes is increased to 16, an intersection point is seen at a mission time of 3000 hours. For mission times less than 3000 hours, the reliability of the 16-node unidirectional torus surpasses the reliability of the bi-directional torus. Beyond 3000 hours, the 16-node bi-directional torus provides a higher reliability than its unidirectional counterpart. This trend also occurs for the 9node and 25-node systems at mission times of 13000 and 1500 hours, respectively. 5.3 Topology Comparisons In order to compare the reliabilities of the k-ary n-cubes, the increase in reliability of one topology can be calculated relative to another. For example, the reliability increase of the counter-rotating ring relative to the single ring topology is expressed as R (t ) − RSR (t ) (5) reliability increase = CRR RSR (t ) where RCRR (t ) is the reliability of the counter-rotating ring system at time t and R SR (t ) is the reliability of the single ring system at the same time t. Fig. 10 shows the increase in reliability of the CRR systems relative to comparable SR systems for CRR reliabilities ranging from 0.7 to 0.99. As the reliability demands on the CRR system increase, the improvement in reliability of the CRR over the SR systems decreases. Figure 11 shows the reliability increase for BT over UT, for BT reliabilities ranging from 0.7 to 0.99, superimposed over the CRR-to-SR comparison of Fig. 10. 50 45 Reliability Increase (%)
40 35 30 25 20 15 10 5 0 0.70
0.75
0.80
0.85
0.90
0.95
0.99
CRR Reliability
Fig. 10. Increase in reliabilities of the CRR systems relative to comparable SR systems 90% 36-node 25-node 16-node 9-node CRR-to-SR
80%
Reliability Increase (%)
70% 60% 50% 40% 30% 20% 10% 0% 0.70 -10%
0.75
0.80
0.85
0.90
0.95
0.99
BT Reliability
Fig. 11. Increase in reliabilities of the BT systems relative to comparable UT systems
8
Unlike the CRR-to-SR reliability comparison, the reliability increase of the bi-directional tori systems is a function of the system size. For BT systems with a reliability of 0.95, the reliability increase relative to comparable UT systems falls within the range -1.86% to 3.50% for the 9, 16, 25, and 36-node systems. This range decreases to 0.45% to 0.04% for a higher BT reliability of 0.99. In contrast, for the CRR-to-SR comparison, the increases in reliability were 11.88% and 3.11% respectively at a CRR reliability of 0.95 and 0.99. Hence, the shift from unidirectional tori to bi-directional tori does not provide a significantly large reliability increase for practical reliabilities of 0.95 and higher. At these higher reliabilities, the CRR show a more significant increase in reliability relative to the single ring topologies. Also, for the 9-node and 16-node bi-directional tori with reliabilities greater than 0.8 and 0.9 respectively, the reliability increase relative to comparable unidirectional tori is negative. These two examples demonstrate that redundancy could in fact reduce the overall system reliability. This reduction in reliability is the result of an increase in the number of components that could fail. 6. Conclusions In this paper, the reliabilities of 1D and 2D k-ary n-cubes were evaluated using UltraSAN, a tool based on Stochastic Activity Networks. The accuracy of the models was verified using both combinatorial and Markov techniques. The feasibility of modeling SCI ring-based networks using ring elimination rather than link elimination is also demonstrated. This contribution shows the feasibility of modeling relatively large networks using this technique when the basic components are replicable. Furthermore, this research develops a framework regarding the inherent reliability of SCI-based networks, laying a portion of the groundwork for the use of SCI-based networks for mission-critical applications. A comparison of the single and dual ring systems showed that the reliabilities of the 9-node, 16-node, and 25node single-ring systems exceed the reliabilities of the larger 16-node, 25-node, and 36-node dual-ring systems for mission times exceeding 14000, 13000, and 12000 hours respectively. These examples represent several such trends that can be found for single- and dual-ring systems of various sizes where a tradeoff exists between the number of nodes required by an application versus the reliability desired for the application. Similarly, it was shown that the 9node and 16-node unidirectional tori provided a higher reliability than the 16-node and 25-node bi-directional tori for mission times less that 13000 hours and 4000 hours respectively. This insight is an important one for those considering the organization of SCI-based networks for mission-critical applications, for which the penalty of a network failure may be very expensive. Comparing the reliability results of the dual-ring systems and unidirectional systems, it was shown that reliabilities of both topologies were identical for equal-sized networks. For target reliabilities of 0.8, 0.85, 0.9, and 0.95, the percentage improvements in reliability of the dual-ring systems over the single-ring systems were 32.49%, 27.10%, 20.11%, and 11.88%, respectively. The same comparison conducted using 2D tori yielded percentage improvements in reliability of the bi-directional tori over the unidirectional tori of 36.61%, 21.74%, 10.38%, and 2.82%, for target reliabilities of 0.8, 0.85, 0.9, and 0.95, respectively. These results are particularly significant, as they demonstrate the improvements in reliability that can be achieved by doubling the number of links in the network without a radical reorganization of network topology. Of course, it is noteworthy that moving to a dualring or bi-directional implementation of the same topology also generally results in an improvement in the effective bisection bandwidth and latency of these networks. The unidirectional tori and dual-ring systems provide the same percentage improvement in reliability over the single-ring systems. The bi-directional torus provides a higher percentage improvement in reliability over the single-ring systems for target reliability values less that 0.9. The percentage improvements in reliability of the bidirectional tori over the single-ring systems were 105.14%, 70%, 24.71%, and 9.98% for target reliabilities of 0.8, 0.85, 0.9, and 0.95, respectively. These results demonstrate that the bi-directional torus is not as effective as the dual-ring systems and the unidirectional tori in providing high levels of reliability when the target reliabilities exceed 0.9. The lower improvement in reliability is the result of an increased number of components in the bidirectional torus topology. For target reliability values 0.9 or less, the bi-directional tori demonstrate a larger percentage improvement than the dual-ring systems and unidirectional tori. This observation points to the suitability of such a network organization for applications with very long mission times and no provision for repair. There exist several possible directions for future research. One such direction is the reliability analysis of larger networks of the same topology using the technique of ring elimination. Another possible direction is the reliability analysis of SCI-based network topologies in other popular configurations, such as meshes. Yet another possible area for future research is the application of the techniques described in this paper to other emerging highperformance networks.
9
References 1. 2. 3.
4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24.
IEEE, 1596-1992 IEEE Standard for Scalable Coherent Interface (SCI), Piscataway, NJ: IEEE Service Center, 1993. B. W. Johnson, Design and Analysis of Fault-Tolerant Digital Systems, Addison-Wesley, 1989. C. Lindemann, “DSPNexpress: A Software Package for the Efficient Solution of Deterministic and Stochastic Petri Nets,” Proceedings of the Sixth International Conference on Modeling Techniques and Tools for Computer Systems Performance Evaluation, pp. 15-29, Edinburgh, Great Britain, 1992. G. Chiola, “GreatSPN 1.5 Software Architecture,” Proc. 5th Int. Conf. on Modeling Techniques and Tools for Computer Performance Evaluation, Torino, Italy, Feb. 1991. S. J. Bravuso, J. B. Dugan, K. S. Trividi, E. M. Rothman, W. E. Smith, “Analysis of Typical Fault-Tolerant Architectures using HARP,” IEEE Transactions on Reliability, vol. R-36, no. 2, pp. 176-185, June 1987. G. Ciardo, J. Muppala, and K. S. Trividi, “SPNP: Stochastic Petri Net Package”, Proceedings of the Fourth International Workshop on Petri Nets and Performance Models, pp. 142-151, Kyoto, Japan, December 1989. C. Beounes et al., “SURF-2: A Program for Dependability Evaluation of Complex Hardware and Software Systems,” Proceedings 23rd Int. Symp. on Fault-Tolerant Computing (FTCS-23), IEEE, Toulouse, France, June 1993. W. H. Sanders, W. D. Obal, M. A. Qureshi, F. K. Widjanarko, “The UltraSAN modeling Environment,” Performance Evaluation, vol. 24, no 1-2, pp. 89-115, November 1995. C. Colbourn, J. Devitt, S. Harms, D. Daryl, “Assessing Reliability of Multistage Interconnection Networks,” IEEE Transactions on Computers, vol. 42, no. 10, pp. 1207-1221, October 1993. A. Varma, C. Raghavendra, “Reliability Analysis of Redundant-Path Interconnection Networks,” IEEE Transactions on Reliability, vol. R-38, no.1, pp. 130-137, April 1989. X. Cheng, O. Ibe, “Reliability of a Class of Multistage Interconnection Networks,” IEEE Transactions on Parallel and Distributed Systems, vol. 3, pp. 241-246, March 1992. B. Menezes, U. Bakhru, “New Bonds on the Reliability of Augmented Shuffle-Exchange Networks,” IEEE Transactions on Computers, vol. 44, no. 1, pp. 123-129, January 1995. J. Blake, K. Trivedi, “Reliability Analysis of Interconnection Networks Using Hierarchical Composition,” IEEE Transactions on Reliability, vol. 38, no.1, pp. 111-120, April 1989. J. Blake, K. Trivedi, “Multistage Interconnection Network Reliability,” IEEE Transactions on Computers, vol. 38, no. 11, pp. 1600-1604, November 1989. M. Balakrishnan, A. Reibman, “Reliability Models for Fault-Tolerant Private Network Applications,” IEEE Transactions on Computers, vol. 43, no. 9, pp.1039-1053, September 1994. E. Smith, K. Trivedi, “Dependability Evaluation of a Class of Multi-Loop Topologies for Local Area Networks,” IBM Journal of Research and Development, vol. 33, no. 5, pp. 511-523, September 1989. C. Raghavendra, J. Silvester, “A Survey of Multi-Connected Loop Topologies for Local Computer Networks,” Computer Networks and ISDN Systems, vol. 11, no. 1, pp. 29-42, January 1986. T. Chung, N. Sharma, D. Agrawal, “Cost-Performance Trade-offs in Manhattan Street Network versus 2-D Torus,” IEEE Transactions on Computers, vol. 43, no. 2, pp. 240-243, February 1994. Z. Chen, T. Berger, “Reliability and Availability Analysis pf Manhattan Street Networks,” IEEE Transactions on Communications, vol. 42, no. 2/3/4, pp. 511-522, February/March/April 1994. N. Lopez-Benitez, J. Fortes, “Detailed Modeling and Reliability Analysis of Fault-Tolerant Processor Arrays,” IEEE Transactions on Computers, vol. 41, no. 9, pp. 1193-1200, September 1992. M. Sarwar and A. D. George, “Simulative Performance Analysis of Switching Fabrics for Scalable SCI Networks,” Microprocessors and Microsystems, vol. 24, no. 1, pp. 1-11, March 2000. K. Kibria, Interconnect Systems Solution, http://www.iss-us.com/LincCore.htm MIL-HDBK-217F: Handbook for Reliability Prediction of Electronic Equipment, Defense Printing Service, Philadelphia, PA. W. Yost, “Cost Effective Fault Tolerance for Network Routing,” Master of Science Thesis, University of Washington, 1995.
10