Journal of Systems Architecture. Vol.54, No.1-2, 2008, pp.70-80.
RISC: A Resilient Interconnection Network for Scalable Cluster Storage Systems YUHUI DENG Center for Grid Computing, Cranfield University Campus, Bedfordshire MK430AL, United Kingdom Email:
[email protected]
Abstract: The explosive growth of data generated by information digitization has been identified as the key driver to escalate storage requirements. It is becoming a big challenge to design a resilient and scalable interconnection network which consolidates hundreds even thousands of storage nodes to satisfy both the bandwidth and storage capacity requirements. This paper proposes a Resilient Interconnection network for Storage Cluster systems (RISC). The RISC divides storage nodes into multiple partitions to facilitate the data access locality. Multiple spare links between any two storage nodes are employed to offer strong resilience to reduce the impact of the failures of links, switches, and storage nodes. The scalability is guaranteed by plugging in additional switches and storage nodes without reconfiguring the overall system. Another salient feature is that the RISC achieves a dynamic scalability of resilience by expanding the partition size incrementally with additional storage nodes along with associated 2 network interfaces that expand resilience degree and balance workload proportionally. A metric named resilience coefficient is proposed to measure the interconnection network. A mathematical model and the corresponding case study are employed to illustrate the practicability and efficiency of the RISC. Keywords: Resilience, Interconnection Network, Cluster Storage, Scalability 1. Introduction The explosive growth of data generated by information digitization has been identified as the key driver to escalate storage requirements. There are two major technologies which impact the evolution of storage systems. The first one is parallel processing such as Redundant Arrays of Inexpensive Disks (RAID) [1]. The second one is the influence of network technology on storage system architecture. Network based storage systems such as Network Attached Storage (NAS) and Storage Area Network (SAN) [2, 3] offer a robust and easy method to control and access large amounts of storage resources. However, most of the modern high performance computing systems demand petabytes and even exabyte storage capacity, and aggregate bandwidth over 100GB/s (e.g., the data produced by protein folding, global earth system model, high energy physics etc). NAS and SAN cannot meet the requirements. Storage systems must make the transition from relatively few high performance storage engines to thousands of networked commodity-type storage devices [4]. With the ever increasing storage demand, it is a big challenge to design a scalable storage system which consolidates hundreds even
Journal of Systems Architecture. Vol.54, No.1-2, 2008, pp.70-80.
thousands of storage nodes to satisfy both the bandwidth and storage capacity requirements. A cluster is a group of loosely coupled nodes that work together so that in many respects they can be viewed as a single powerful node. The evolution of high performance processors and high speed networks is rapidly driving forward the development of cluster [5,6,7,8]. Due to Commodity-Off-The-Shelf (COTS) hardware components, cluster is becoming an appealing platform for parallel computing and supercomputing compared with the traditional Symmetric Multi-Processor (SMP) and Massive Parallel Processing (MPP) systems. The components of a cluster are commonly connected to each other through fast local area networks. The parallel algorithms in cluster computing have to communicate with each other frequently. The communication delay between sender and receiver is a system-imposed latency. The key to an effective and scalable cluster computing system lies in minimizing the delays imposed by the system. The computing nodes normally employ small and low latency messages. As a result, cluster computing requires a custom network to alleviate or eliminate the communication delay. The research community has been very active in the area of improving various aspects of the communication performance of cluster during the past decade. Virtual Interface Architecture (VIA) [9, 10] is a user-level memorymapped communication architecture that is designed to achieve low latency and high bandwidth across a cluster. Fast Messages (FM) [11, 12] is a low-level message layer designed to deliver the hardware performance of underlying network to applications. FM is also designed to enable a high performance layer to the APIs and the protocols on top of it. Active Messages [13, 14] intends to serve as a substrate for building libraries that provide higher-level communication abstractions and generate communication code from a parallel language compiler, rather than for direct use by programmers. The Active Messages exposes the full hardware performance to higher layers. GM [15, 16] is a message based light-weight communication layer for Myrinet. The design objectives of GM include low CPU overhead, portability, low latency, and high bandwidth. One of the significant advances in cluster networks over the past several years is that it is now practical to connect up to tens of thousands of nodes with networks that have enormous scalable capacity, and in which the communication from one node to any other node has the same cost [17]. The cluster networks offer useful references for building storage interconnection networks, because they are capable of delivering high performance, mass storage capacity, and dealing with very large scale. However, storage systems place different requirements on the interconnection network compared with the cluster computing. Generally, the cluster storage systems distribute data across multiple storage nodes, and employ a parallel file system to boost the aggregate bandwidth by spreading read and write operations across the nodes [18]. The nodes in cluster computing typically adopt small and low latency messages to communicate with each other. Unlike the computing nodes, the storage nodes in a cluster storage system are loosely coupled with each other. There is very little communication between the storage nodes except data migration, data reorganization or data backup. Therefore, it requires a high bandwidth rather than a low latency. Storage interconnection networks, because of their tolerance for higher latency, can exploit commodity technologies such as gigabit Ethernet instead of custom network hardware used by the cluster computing networks [4].
Journal of Systems Architecture. Vol.54, No.1-2, 2008, pp.70-80.
A cluster storage system must provide resilience to guarantee a high reliability and availability at a reasonable latency besides high bandwidth, because the data in it is valuable, even impossible to regenerate, and may not be reproducible [19]. Thus, the resilience is a very important feature for designing a large-scale cluster storage system. However, few research efforts have been invested in devising the resilience of cluster networks. Beowulf cluster [20] employs a simple single subnet and a two layer network topology. Computational nodes are grouped by the physical cabinet in which they are mounted. Each cabinet has its own Ethernet switch. The cabinets and servers are connected through a server switch. Lustre [18] that builds a cluster file system for 1,000 nodes provides support for heterogeneous networks. It is possible to connect some clients over an Ethernet to the Metadata Servers (MDS) and Object Storage Targets (OST) servers, and others over a QSW network, all in a single installation. LVS [21] consists of two layers, a LVS routers layer (one active and one backup) that balances the workloads on the real servers and a pool of real servers layer which provides the critical services. Some projects employed a failover metadata server or some data redundancy mechanisms to provide availability [21, 22], but the projects did not consider the failure of network components. Recent efforts have made important strides in designing interconnection networks for large-scale storage systems [4, 18]. Andy D. Hospodor and Ethan L. Miller [4] proposed to integrate 1Gb/s network and small (4–12 port) switching elements into object based disk drives to construct a petabyte-scale high performance storage system. They also discussed how to construct the system with different interconnection architectures. Their research results indicate that the hypercube, 4-D and 5-D torii appear to be reasonable design choices for a 4096 nodes storage system capable of delivering 100GB/s from 1 petabyte. However, the methods are not so cost-effective due to the tailored disk drives. Qin Xin et al. [19] found that the hypercube is more robust than a multi-stage butterfly network and a 2D mesh structure. The reason is that the fault tolerance of an N dimensional hypercube network is N-1. The hypercube was originally proposed as an efficient interconnection for Massively Parallel Processors (MPP). A large number of researches have been done in solving the parallel problems and routing mechanisms on hypercubes [23, 24, 25, 26]. However, the hypercube trades increased cost and complicated routing mechanism for a high fault tolerance, because the number of connections at each node in the hypercube grows as the system gets larger. For instance, a three-dimensional hypercube consists of eight nodes with each node connected to three other nodes. A four-dimensional hypercube contains two three-dimensional hypercubes arranged so that each node of one sub-cube is connected to the corresponding node of the other sub-cube. There are now four connections emanating from each node and a total of 16 nodes. The network can be generalized to higher dimensions, but a large number of connections emanating from each node cause some engineering problems which limit the network scalability [27]. In this paper, based on COTS components, we propose a Resilient Interconnection network for Storage Cluster systems (RISC), drawing on previous experience from the cluster computing community. Since large-scale data intensive applications frequently involve a high degree of data access locality, the RISC divides storage nodes into multiple partitions to facilitate the locality and simplify intra-partition communications. Multiple spare links between any two storage nodes are employed to offer a strong
Journal of Systems Architecture. Vol.54, No.1-2, 2008, pp.70-80.
resilience to reduce the impact of the failure of links, switches, and storage nodes. The scalability is guaranteed by plugging in additional switches and storage nodes rather than replacing the switches with more expensive switches which have more ports available. Another salient feature is that the RISC achieves dynamic scalability of resilience by expanding the partition size incrementally. A resilience model of the RISC has been constructed to explore the features of the interconnection network. Case study shows the practicability and efficiency of the RISC. The remainder of the paper is organized as follows. Traditional storage interconnection networks and the RISC are introduced in section 2. Section 3 discusses several different failure scenarios and the corresponding solutions of the RISC. A resilience model of the RISC is constructed in section 4. Section 5 illustrates a case study of the RISC based on the model presented in section 4. Section 6 concludes the paper with remarks on main contributions. 2. System Overview 2.1. Traditional Storage Interconnection Networks Clients
...
Failover Switch
... Switch MS
Fig.1. Highly Available Star Interconnection Network Traditionally, most of the cluster storage systems employ a star interconnection network where all storage nodes are connected to a single central switch, and the nodes communicate across the network by passing traffic through the switch [21, 28]. But the star architecture cannot provide constant availability, because there is only one single link between any two storage nodes in the system. An overall system crash will be induced by a failure at the switch. One way to solve this problem is to use two switches (the second one is for failover) in a star interconnection network (see Fig.1). We name it as Highly Available Star (HAS) in the following discussion. Each storage node is connected to the two switches. There are two possible links between any two nodes to guarantee availability. However, in this architecture, one switch becomes over loaded when another one fails, because all the data traffic will go through the remaining switch. Failure of the two switches will destroy the overall system. Another problem is the scalability. With ever increasing storage demands, star and HAS architectures cannot be expanded to a large scale by only replacing the switches with ones that have more available ports. Please note at present the gigabit Ethernet switch is available only up to 48-ports. Most of the existing storage systems are scaled by replacing the entire storage system in a “forklift upgrade”. This method is unacceptable in a system containing petabytes of data because the system is too large [4]. A hierarchical interconnection network is a collection of star networks arranged in a hierarchy. The hierarchical architecture is normally adopted to extend a cluster to a larger scale. The
Journal of Systems Architecture. Vol.54, No.1-2, 2008, pp.70-80.
scalability is easily guaranteed by plugging in additional switches and storage nodes. However, the hierarchical interconnection network has the inherent drawback of a single point of failure. A bunch of storage nodes could be isolated from the network by a singlepoint failure of a switch. 2.2. The RISC Interconnection Network Data locality is a measure of how well data can be selected, retrieved, compactly stored, and reused for subsequent accesses [29]. In general, there are two basic types of data locality: temporal and spatial. Temporal locality denotes that a data is accessed at one point in time and will be accessed in the near future. Temporal locality relies on the access pattern of different applications and can therefore change dynamically. Spatial locality defines that the probability of accessing a data is higher if a data near it was just accessed (e.g. prefetch). Unlike the temporal locality, the spatial locality is inherent in the data managed by a storage system. The spatial locality is relatively more stable, and does not depend on applications, but rather on the data organization which is closely related to the system architecture. The data locality is a property of both the access pattern of applications and the data organization of system architectures. Reshaping access patterns can be employed to improve the temporal locality [30]. Data reorganization is normally adopted to improve the spatial locality. Clients
...
MS
S1
S2
N10 N11 N12
N20 N21 N22
S4
S5
S3 N30 N31 N32
S6
Fig.2. A Storage Cluster with RISC Consisting of 9 Storage Nodes and 6 Switches Since large-scale data intensive applications frequently involve a high degree of data access locality [29], many research efforts have been invested in exploiting the impact of access pattern and data organization of applications on the data locality [29,30,31]. The communication latency has long been a challenging problem of the cluster community. A well designed interconnection network of scalable cluster storage systems should be able to enhance the spatial locality to reduce the communication latency. Because different interconnection networks of a cluster storage system can have different impacts on the overall system performance and application circumstance, the RISC takes advantage of its interconnection network to divide the involved storage nodes into multiple partitions to facilitate the spatial locality. Each partition is composed of a number of storage nodes and one local switch. Multiple partitions communicate with each other through the multiple inter-partition switches. Each node in a partition is connected to two networks: intra-partition network and inter-partition network. The RISC enables the cluster storage
Journal of Systems Architecture. Vol.54, No.1-2, 2008, pp.70-80.
system to limit the impact of local data access within one partition rather than the overall system performance, both during regular operations and under fault conditions. Fig.2 illustrates the RISC consisting of 9 storage nodes and 6 switches (in the dashed rectangle). Where MS is short for Metadata Server, N denotes storage node, and S denotes switch. The major role of metadata is describing the information of how the data is distributed across the storage nodes and how the data can be accessed in a system. The MS manages all the metadata of the system, which allows applications or data users to search, access, and evaluate the data by providing standardized descriptions of the stored data [18]. The storage node in RISC is a commercially available PC consisting of processor, memory, motherboard etc. Each node integrates a RAID subsystem to aggregate storage capacity, I/O performance and reliability based on data striping and distribution. Unlike the hypercube architecture discussed in section 1, each storage node in the RISC has two network interfaces which connect the node to intra-partition network and inter-partition network, respectively. Compared with the hypercube architecture, the approach significantly decreases the complexity of the routing strategy when the system grows to a large scale. It’s very easy to extend the network interface of a storage node from one to two by plugging one additional Network Interface Card (NIC) into a PCI slot of the motherboard. S1
S2
S3
N10 N11 N12 N20 N21 N22 N30 N31 N32
S4
S5
S6
Fig.3 Mapping Scheme Between the Switches and Storage Nodes of RISC A storage cluster mainly consists of MS, clients, storage nodes and interconnection network [18]. Fig.2 illustrates a typical storage cluster system using RISC. Please note that this paper mainly focuses on the interconnection network of cluster storage systems, the availability of MS is beyond of scope of this paper. The nine storage nodes in Fig.2 are divided into three partitions in terms of data locality. The partition depends on the hops of network communication. The hops are defined as links which an I/O has to travel. For instance, three partitions in Fig.2 could be division1 {P11= (N10, N11, N12), P12= (N20, N21, N22), P13= (N30, N31, N32)} or division2 {P21= (N10, N20, N30), P22= (N11, N21, N31), P23= (N12, N22, N32)}. The reason is that two divisions have the same number of hops. For the division1, the hops from the storage nodes of partitions P11, P12, P13 to the corresponding switches S1, S2, S3 are one hop. The same scenario is applied to the division2 where the hops from the storage nodes of partitions P21, P22, P23 to the corresponding switches S4, S5, S6 are one hop as well. The basic idea of RISC is from a set associative cache scheme which groups slots into sets. Drawing inspiration from the cache mapping scheme, we depict the mapping scheme between the nine storage
Journal of Systems Architecture. Vol.54, No.1-2, 2008, pp.70-80.
nodes and six switches in Fig.3. Nine storage nodes are clustered into three sets in terms of the port number of switch. The connections between the sets of storage nodes and the switches S1, S2, and S3 are direct mapping. Each set of storage nodes has one connection to the switches S4, S5, and S6, respectively. By adding more storage nodes and switches without having to reconfigure the overall interconnection network, the RISC guarantees a dynamic scalability as the bandwidth and capacity requirement increases and continues to deliver an improved reliability and a higher performance and connectivity. For instance, we can add an additional storage node into a specific partition to extend the partition size. A new partition consisting of multiple storage nodes can also be added into the RISC to expand the scale of overall system. The high resilience of the RISC is offered by multiple spare links which will be detailed in section 4. The same factors employed to increase the resilience also result in a performance improvement. No centralized switch with many available ports removes a potential performance bottleneck, and applying load balance techniques to the multiple spare links provides a consistent performance for the overall system. Compared with the HAS architecture, the RISC does not require more expensive switches with more ports for the spare links. It is very easy to calculate that the interconnection network illustrated in Fig.2 requires 18 switch ports and 18 physical links which are the same number as that of the HAS, but the RISC requires six cheaper 3 port switches instead of two expensive 9 port switches. Table1. Characteristics of Different Interconnection Networks of Cluster Storage Systems Interconnection Network Star HAS Hierarchy Hypercube RISC
Resilience
Scalability
Spatial locality
No Middle Single Point of Failure Very High High
Low Low
High High
Complexity of Routing Mechanism Low Low
High
Could be High
Low
Low High
Could be High High
Very High Middle
The RISC is devised for scalable cluster storage systems which could be expanded to a large scale, while still providing high resilience, high scalability, and data access locality enhancing. Due to the virtual two-tiered interconnection, a two-tired parallelism and scalability of the intra-partition and the inter-partition are achieved. Another feature is that the RISC is based on COTS components. It is much more cost-effective than a custom cluster network. Table1 summarizes the characteristics of different interconnection networks of cluster storage systems described in this section. We will discuss the routing mechanism in the next section further. 3. Failure Analysis and Routing Mechanism 3.1. Failure Scenarios The RISC is designed for scalable cluster storage systems. The system could be expanded to thousands of storage nodes to satisfy both the bandwidth and storage capacity requirements imposed by the data intensive applications. A storage system with multi-petabytes of data typically consists of over 4,000 storage devices, thousands of
Journal of Systems Architecture. Vol.54, No.1-2, 2008, pp.70-80.
connection nodes, and tens of thousands of network links [19]. Although a single component is fairly reliable, with a large number of components including storage nodes, switches, links etc, the aggregate rate of a component failure can be very high. R.Zimmermann et al [32] illustrated that if the Mean Time To Failure (MTTF) of a single disk drive is of the order of 1,000,000 hours, the MTTF of some disk drives in a large storage system consisting of 1000 disk drives is of the order of 1000 hours. Due to the invaluable data, it is fatal to incur a failure in a large-scale cluster storage system. Data losing can cause a significant influence on the economy. Designing a scalable cluster storage system with a high resilience has long been a challenging problem. To address this problem, we have to investigate the types of failure in a largescale cluster storage system. In reality, there are many possible failures, such as the failure of the network control system, the power failure, the software failure etc. This paper mainly concentrates on the failure of the components which impact the interconnection network. There are three such types of failure scenarios: the link failure, the connection node failure (i.e. the switch in Fig.2), and the storage node failure. The link failure indicates that the connection between a pair of nodes (including storage nodes and connection nodes) gets interrupted, which causes all traffic between the pair of nodes to be totally disconnected. A resilient interconnection network must be able to provide multiple spare links between any pair of nodes to guarantee the data availability and the load balance. The connection node failure means the failure of switches in Fig.2. The failure can be caused by power failure, fire, and etc. Compared with the link failure, the failure of connection nodes are more serious since a number of links attached are simultaneously broken. The failure can totally isolate a number of storage nodes that are connected through the connection node. Therefore, for a resilient interconnection network, it is very important to maintain a high availability of the connection nodes. Because the storage node failure can directly incur data loss, many data redundancy mechanisms have been devised to protect the data loss in cluster storage systems [22]. The redundancy mechanisms are normally managed by metadata servers. In our example, there are nine storage nodes labelled from N10 to N33 and six switches labelled from S1 to S6 (see Fig.2). Applications can access the RISC through the six switches, respectively. Assume that a data intensive application accesses the data residing in the system through the switch S4 from the MS. The specified I/O stream will be sent from the MS to the storage node N30. We track the path of this I/O stream in various scenarios. At the initial state without any failure, the I/O request can be simply transferred through the S4 to the target storage node N30. In the following analysis, we employ the division2 discussed in section 2.2 to divide the RISC into three partitions. We will analyze the three failure scenarios in two cases: the intra-partition failure and the inter-partition failure. A rerouting mechanism will be introduced if a specific failure occurs. Intra-partition failure. In terms of the division2, the storage nodes (N10, N20, N30) and the switch S4 are within one partition. If the link between the switch S4 and the target storage node N30 is broken, the I/O stream can take a path (S5→N31→S3→N30), or path (S6→N32→S3→N30), or path (S5→N11→S1→N12→S6→N32→S3→N30) etc. There are many possible alternate paths, but normally the shortest one is preferred. The choice of a new path is determined by the rerouting mechanism and the system situation at that moment. In the RISC, each storage node plays a role of router which can
Journal of Systems Architecture. Vol.54, No.1-2, 2008, pp.70-80.
forward packages when it is necessary. If a failure happens at the switch S4, the I/O stream will travel the same paths determined by the rerouting mechanism when the link between the switch S4 and the node N30 is broken. If the target storage node N30 fails, the I/O stream will detour to the node where the redundant data is stored in terms of the employed redundancy mechanism. Inter-partition failure. Part of the inter-partition failure of the switches and the links will not cause any problems if the intra-partition components work well. Assume that one I/O stream goes from MS to N30. Even if the switch S3, the link between the target storage node N30 and the switch S3 are all crashed, the overall system keeps running. But if any intra-partition component fails any more (e.g. the switch S4 or the link between the S4 and the N30), the storage node N30 will be isolated from the overall system, which results in an I/O stream loss. This problem can be solved by redirecting the I/O stream to other storage nodes where the redundant data resides. If the three paths (N30→S3), (N10→S1), (N20→S2) are broken simultaneously (including links and switches), even if it is an unlikely event, the partition (N10, N20, N30) will be isolated from the cluster storage system. Fortunately, the RISC dynamically increases the number of the spare paths which connect one partition to another partition with the expansion of the partition size. In the event of one component failure or multiple component failures in the RISC, there are many possible alternate paths to connect any two nodes. This makes it easy to replace the failed components with minimal impact on the overall system availability or downtime. The resilience degree keeps pace with the continuous growth of the partition size, due to the increased spare links associated with the added storage nodes. The RISC is more resilient and less susceptible to failures than the traditional interconnection networks illustrated in table 1. 3.2. Data Access and the Routing Mechanism The interaction between the applications and the storage nodes of the RISC is mediated through a MS. When an application requests data stored in the RISC, firstly, it contacts the MS to get some metadata information such as the address about the required data. Secondly, it adopts the address information to access the corresponding storage nodes to retrieve the data. Andy D. Hospodor and Ethan L. Miller [4] connected external clients through routers placed at regular locations within the cube-style networks (meshes, torii, and hypercube). The arrangement resulted in very poor performance due to the congestion near the router nodes. To avoid the centralized routing mechanism, we decentralize the routing functions across all storage nodes in the RISC. Routing is the process of building up the forwarding tables that are used to choose paths for delivering packets. Each storage node in the RISC plays a role of router which can forward I/O traffic when it is necessary. But under normal circumstances, the I/O traffic does not go through other storage nodes because each node in the system has a direct connection to the MS through the Inter-partition network. In the event of a link failure which connects a storage node to the inter-partition network, the traffic emanating from the node has to be forwarded by the remaining nodes in the same partition which have links to the inter-partition network. For instance, if a
Journal of Systems Architecture. Vol.54, No.1-2, 2008, pp.70-80.
data set is stored across the three storage nodes of N30, N31, and N32 in partition P13, the application contacts the MS to retrieve the metadata first. According to the metadata, the application establishes three direct connections with the three storage nodes N30, N31, N32 through the three inter-partition switches S4, S5, and S6 to transfer data, respectively. The procedure does not involve any node as a router. But if the S4 or the link between the S4 and the N30 is broken, the I/O traffic emanating from N30 has to be forwarded by N31 and N32 to S5 and S6, respectively. More complicated failure scenarios have been discussed in section 3.1. The implementation of the routing tables in the storage nodes is straightforward. Resilient Overlay Network (RON) [26] is an architecture that allows distributed Internet applications to detect and recover from path outages. RON’s design supports routing through multiple intermediate nodes. However, D. Andersen et al. [26] discovered that using at most one intermediate RON node is sufficient most of the time. Therefore, it is feasible for each storage node to maintain a local routing table with limited alternate paths, even when the RISC grows to a large scale. The key point of a resilient routing mechanism in the RISC is how to identify a failure and update the routing tables. A lot of research efforts have been invested in designing fault-tolerant routing algorithms [19, 24, 25, 26]. In contrast to the works, our method is based on the features of a typical cluster storage system. Because the overall information of a cluster storage system is normally managed by metadata servers, and the size of the system is relatively small to the Internet scale, the metadata servers are good candidates to monitor the failures of the whole system. Each storage node in the RISC sends short messages to the MS at regular intervals. If a message is not acknowledged by the MS for a particular period, the path between the storage node and the MS is assumed to have failed. The node will choose an alternate path from its local routing table to deliver the I/O traffic. The periodic heartbeat diffusion between the MS and the storage nodes is adopted to monitor the failures in the RISC. A reasonable low frequency of heartbeat diffusion is feasible because excessive failure detections will increase maintenance cost significantly and be unnecessary. By tuning the diffusion period, the time to recover from a failure can be balanced against the bandwidth overhead of periodic message transmissions. 4. Resilience Model of the RISC The most important aspect of a resilient interconnection network for cluster storage systems is that an I/O stream can be delivered to its target storage nodes successfully in any failure scenarios. We define a resilience coefficient as a metric to measure the resilience. Definition: Resilience coefficient is the ratio of the maximum number of data paths to the minimum paths between any two storage nodes. For simplicity, we assume that there is a cluster storage system consisting of m partitions, each partition has N k ( 1 ≤ k ≤ m ) storage nodes. The number of storage nodes can be expressed by following formula: (1) N = (N1 N 2 L N m ) It is required to have at least one path between any two storage nodes to keep the RISC based cluster storage system running. Thus, there should be at least one link
Journal of Systems Architecture. Vol.54, No.1-2, 2008, pp.70-80.
between any two partitions. It’s very easy to calculate the minimum number of data paths as follows: m m 1 (2) Pathmin imum = × (∑ N i × ∑ N j ) 2 i =1 j =1, j ≠ i We assume that the RISC employs a full-duplex mode which refers to the transmission of data in two directions simultaneously. Because we are computing the 1 path number, where the coefficient indicates that a full-duplex transmission is counted 2 as one path. The RISC adopts multiple spare links between any two partitions to guarantee a high resilience. The number of links between partitions forms a m × m connection matrix C as follows: ⎛ c11 c12 L c1m ⎞ ⎟ ⎜ ⎜ c 21 c 22 L c 2 m ⎟ (3) C =⎜ M M O M ⎟ ⎟ ⎜ ⎟ ⎜c ⎝ m1 c m 2 L c mm ⎠ Where cij ( 1 ≤ i ≤ m, 1 ≤ j ≤ m ) denotes the direct links between the i th and the j th partition. cij = 0 indicates that there is no direct link between the i th and the j th partition. Due to the partition number m , the nodes in original partitions can communicate with the nodes in target partitions through zero to up to m − 2 intermediate partitions. Different scenarios will be discussed in the following section. The first, we assume that the i th and the j th partition communicate with each other directly (i.e. no intermediate partitions). We have the data path pathij : (4) Path0 − ij = N i × cij × N j In this scenario, the maximum number of data paths of the overall system is expressed as follows: m m 1 (5) Path0 = × (∑ N i × ( ∑ cij × N j )) 2 i =1 j =1, j ≠ i The second, if the data transmission between the i th partition and the j th partition are mediated through one partition labelled l , because the storage nodes in the l th partition can be configured to route data packages, the data paths can be calculated using following formula: (6) Path1−ij = N i × cil × N l × clj × N j The maximum number of data paths of the overall system is calculated as follows: m m m 1 (7) Path1 = × (∑ N i × ∑ (cil × N l ) × ∑ (clj × N j )) 2 i =1 l =1,l ≠ i j =1, j ≠ i , j ≠ l By analogy, the data transmission between the i th partition and the j th partition could be mediated through 2, 3,L , m − 2 partitions, respectively. We have:
Journal of Systems Architecture. Vol.54, No.1-2, 2008, pp.70-80.
1 m m −1 Pm (∏ ( N i × ci ( i +1) )) × N m (8) 2 i =1 Where Pmm is the number of full permutation of the m partitions. The maximum number of data paths of the whole system can be expressed as follows: Pathm − 2 =
m−2
Pathmax imum = ∑ Pathi
(9)
i =0
According to the definition, the resilience coefficient R is calculated using following formula: R = Pathmax imum Pathmin imum (10)
5. Case Study Case study is a particular research method which offers a systematic way of investigating events, collecting data, analyzing information, reporting the results, validating hypotheses etc. As a case study, we will adopt the resilience coefficient as a metric to examine the resilience degree, the failure scenario, and the dynamic scalability in a 3 × 3 RISC illustrated in Fig.2.
5.1. Resilience Degree Let’s consider a simple RISC consisting of 9 storage nodes illustrated in Fig.2. According to the equation (1) (2) (3), we can easily calculate the storage nodes number N = (3 3 3) , the minimum number of data ⎛ 1 3 3⎞ ⎜ ⎟ paths Pathmin imum =3×3+3×3+3×3=27, and the connection matrix C = ⎜ 3 1 3 ⎟ , ⎜3 3 1⎟ ⎝ ⎠ respectively. The maximum number of data paths between any two storage nodes is illustrated in Table 2 in terms of the equation (5) and (7). Table 2. The Maximum Number of Data Paths Of a RISC with 9 Storage Nodes Divided into Three Partitions (see Fig.2) Direct Communication Mediate through one partition
Path N1 → N2 N1 → N3 N2 → N3 N1 → N3 → N2 N1 → N2 → N3 N2 → N1 → N3
Path Number 3 × 3 × 3=27 3 × 3 × 3=27 3 × 3 × 3=27 3 × 3 × 3 × 3 × 3=243 3 × 3 × 3 × 3 × 3=243 3 × 3 × 3 × 3 × 3=243
Total Path Number 81
729
Therefore, in the light of the equation (10), the resilience coefficient R is (81+729)/27=30. The resilience coefficient R =30 shows that the RISC is much more resilient than the HAS architecture which has a resilience coefficient of 2, while requiring the same number of switch ports and physical links.
5.2. Failure Scenarios
Journal of Systems Architecture. Vol.54, No.1-2, 2008, pp.70-80.
The storage nodes (N12, N22, N32) and the switch S6 are within one partition according to the division2 discussed in section 2.2. We assume that the switch S6 or the links {(N12→S6), (N22→S6), (N32→S6)} fail simultaneously. The connection matrix ⎛1 2 2⎞ ⎜ ⎟ is C = ⎜ 2 1 2 ⎟ . According to the equation (5) and (7), the number of data paths ⎜2 2 1⎟ ⎝ ⎠ between any two storage nodes are Path0 = (3×2×3) ×3=54, Path1 = (3×2×3×2×3) ×3=324, respectively. We have the resilience coefficient R = (54+324)/27=14. In the most serious scenario, we assume that the switch S5 and S6 crash synchronously, or the links {(N12→S6), (N22→S6), (N32→S6)} and {(N11→S5), (N21→S5), (N31→S5)} fail at the same time. The connection matrix is computed ⎛ 1 1 1⎞ ⎜ ⎟ as C = ⎜1 1 1⎟ . We have Path0 = (3×1×3) ×3=27, Path1 = (3×1×3×1×3) ×3=81. The ⎜ 1 1 1⎟ ⎝ ⎠ resilience coefficient R is (27+81)/27=4. It indicates that the RISC running in the failure mode is still more resilient than the traditional HAS architecture.
5.3. Dynamic Scalability A distinct feature of the RISC is that the RISC achieves a dynamic scalability of resilience by expanding the partition size incrementally with additional storage nodes along with associated 2 network interfaces. If we keep the three partitions depicted in Fig.2, but increase the partition size from 3 nodes to 4 nodes and construct the RISC with three 4 port switches for the intra-partition network and four 3 port switches for the interpartition network. Based on the same calculation, we have N = (4 4 4 ) , the minimum number of data paths Pathmin imum =4×4+4×4+4×4=48, and the connection ⎛1 4 4⎞ ⎜ ⎟ matrix C = ⎜ 4 1 4 ⎟ , respectively. It’s very easy to compute the ⎜4 4 1⎟ ⎝ ⎠ Path0 =(4×4×4)×3=192 and Path1 =(4 × 4 × 4 × 4 × 4) × 3=3072 in terms of the equations (5) and (7). Therefore, the resilience coefficient R is (192+3072)/48=68. The resilience coefficient of this configuration is increased by a factor of 2.27 compared with that of the configuration which has three partitions each consisting of three storage nodes. It indicates that the increase of partition size incurs a significant improvement of the resilience. Another important feature of the RISC is a dynamic load balance. Because each storage node in the RISC plays a role of router which forwards I/O packages when it is necessary, in the event of a specific failure, it does increase the workloads of the storage nodes which forward the I/O traffic. But the workloads will be shared by multiple nodes, and the I/O traffic taken over by each node will be decreased with the increased number of additional added storage nodes in the partition. For instance, let’s suppose that the required data reside in node N30 (see Fig.2). If the switch S4 or the link between the
Journal of Systems Architecture. Vol.54, No.1-2, 2008, pp.70-80.
node N30 and the S4 fail, the I/O traffic emanating from the node N30 has to go through the node N31 and N32, which increases the workloads of the node N31 and N32, but the two nodes share the traffic. If we put one more storage node N34 into the partition, three nodes will take over the I/O traffic. This feature strikes a good balance between the load balance and the dynamic scalability of resilience.
6. Conclusions In this paper, we proposed a resilient interconnection network for cluster storage systems named RISC which takes a significant step towards a resilient and scalable cluster storage system by dividing the storage nodes into multiple partitions and providing multiple spare links between any pair of storage nodes. The RISC enhances the data access locality by partitioning the correlated storage nodes together, which can reduce the communication latency by limiting most of the messages within a local partition. To expand the cluster storage system, we can add more nodes and more small switches without having to reconfigure the whole architecture. Even though multiple spare links are employed to provide a high resilience, the RISC requires the same number of switch ports and links as that of a HAS architecture. The resilience coefficient is proposed as a metric to measure the system’s resilience degree. The case study provides useful insights into the behaviors of the RISC.
Acknowledgments The author would like to thank the anonymous reviewers for their useful comments and feedback which help us to refine our thoughts about the RISC. Their suggestions are very helpful for our future works.
References 1. D. Patterson, G. Gibson, R. Katz. A Case for Redundant Arrays of Inexpensive Disks (RAID). In Proc. of ACM Conf. on Management of Data, 1988, pp. 109116. 2. Garth A. Gibson, Rodney Van Meter. Network Attached Storage Architecture. Communications of the ACM, Vol.43, No.11, 2000, pp.37-45. 3. M. Mesnier, G. R. Ganger, E.Riedel. Object-based storage. IEEE Communications Magazine, Vol.41, No.8, 2003, pp.84 – 90. 4. A. Hospodor and E. L. Miller. Interconnection Architectures for Petabyte-Scale High-Performance Storage Systems. In Proc. of the 21st IEEE / 12th NASA Goddard Conference on Mass Storage Systems and Technologies, 2004, pp. 273281. 5. N.J. Boden, D. Cohen, R.E. Felderman, A.E. Kulawik, C.L. Seitz, J.N. Seizovic, W. Su. Myrinet – A Gigabit per Second Local Area Network. IEEE Micro, Vol. 15, No.1, 1995, pp. 29-36. 6. D.Garcia and W.Watson. ServerNetTM II . In Proc. of Parallel Computer Routing and Communication (PCRCW'97), 1997, pp.119-136. 7. D.B.Gustavson, Qiang Li. The Scalable Coherent Interface (SCI). IEEE Communications Magazine, Vol.34, No. 8, 1996, pp.52 - 63.
Journal of Systems Architecture. Vol.54, No.1-2, 2008, pp.70-80.
8. S.Oral, A.D.George. A User-Level Multicast Performance Comparison of Scalable Coherent Interface and Myrinet Interconnects. In Proc. of 28th Annual IEEE International Conference on Local Computer Networks, 2003, pp.518 – 527. 9. Virtual Interface Architecture Specification 1.0. 1997. http://rimonbarr.com/repository/cs614/san_10.pdf. 10. G.Amerson, A.Apon. Implementation and design analysis of a network messaging module using virtual interface architecture. In Proc. of 2004 IEEE International Conference on Cluster Computing, 2004, pp.255 – 265. 11. S. Pakin, V. Karamcheti, A. Chien. Fast Messages: Efficient Portable Communication for Workstation Clusters and Massively Parallel Processors. IEEE Concurrency, Vol.5, No.2, 1997, pp.60-73. 12. M.Lauria, S.Pakin, A.A.Chien. Efficient layering for high speed communication: Fast Messages 2.x. In Proc. of the 7th International Symposium on High Performance Distributed Computing, 1998, pp.10 – 20. 13. T.V. Eicken, D.E. Culler, S.C. Goldstein, K.E. Schauser. Active Messages: A Mechanism for Integrated Communication and Computation. In Proc. of the 19th ISCA, 1992, pp.256-266. 14. Thorsten von Eicken, David E. Culler, Klaus Erik Schauser, Seth Copen Goldstein. Retrospective: Active Messages: A Mechanism for Integrating Computation and Communication. In Proc. of 25 years of the international symposia on Computer architecture, 1998, pp.83-84. 15. Lee Lie-Quan, A.Lumsdaine. The Generic Message Passing framework. In Proc. of International Parallel and Distributed Processing Symposium, 2003, pp.0-10. 16. Generic Messages Documentation. http://www.myri.com/GM/doc/gm_toc.html 17. C.L.Seitz. Recent advances in Cluster Networks. In Proc. of 2001 IEEE International Conference on Cluster Computing, 2001, pp.365 - 365. 18. Lustre. http://www.lustre.org/ 19. Qin Xin, Ethan L. Miller, Thomas J. E. Schwarz, S. J. ,Darrell D. E. Long. Impact of Failure on Interconnection Networks for Large Storage Systems. In Proc. of the 22nd IEEE / 13th NASA Goddard Conference on Mass Storage Systems and Technologies, 2005, pp.189--196. 20. Beowulf. http://beowulf.cheme.cmu.edu/hardware/network.html 21. Linux Virtual Server.http://www.linuxvirtualserver.org/ 22. Kai Hwang, Hai Jin, Roy S. C. Ho. Orthogonal Striping and Mirroring in Distributed RAID for I/O-Centric Cluster Computing. IEEE Trans. Parallel Distrib. System, Vol.13,No.1,2002, pp. 26-44. 23. Youcef Saad and Martin H. Schultz. Topological Properties of Hypercubes. IEEE Transactions on Computers, Vol.37, No.7, 1988, pp.867 - 872. 24. M.A. Sridhar and C.S. Raghavendra. Fault-Tolerant Networks Based on the de Bruijn Graph. IEEE Trans. Computers, Vol. 40, No. 10, 1991, pp. 1167-1174. 25. P.T. Gaughan and S. Yalamanchili. A Family of Fault-Tolerant Routing Protocols for Direct Multiprocessor Networks. IEEE Trans. Parallel and Distributed Systems, Vol. 5, No. 6, 1995, pp. 482-487. 26. D. Andersen, H. Balakrishnan, F. Kaashoek, and R. Morris. Resilient Overlay Networks. In Proc. of the 18th ACM SOSP,2001, pp.131-145.
Journal of Systems Architecture. Vol.54, No.1-2, 2008, pp.70-80.
27. A.J.G.Hey. High performance computing-past, present and future. Computing & Control Engineering Journal,Vol.8,No.1,1997, pp.33-42. 28. Wataru Katsurashima, Satoshi Yamakawa, et al. NAS Switch: A Novel CIFS Server Virtualization. Proc. of the 20th IEEE/11 th NASA Goddard Conference on Mass Storage Systems and Technologies (MSS'03), 2003, pp.82-86. 29. FrederikW. Jansen and Erik Reinhard. Data Locality in Parallel Rendering. In Proc. of the 2nd Eurographics Workshop on Parallel Graphics and Visualisation,1998,pp.1-15. 30. Aart J.C. Bik .Reshaping Access Patterns for Improving Data Locality. In Proc. of the 6th Workshop on Compilers for Parallel Computers, 1996, pp. 229-310. 31. Mar´ıa E. G´omez and Vicente Santonja. Characterizing Temporal Locality in I/O Workload. In Proc. of the 2002 International Symposium on Performance Evaluation of Computer and Telecommunication Systems, 2002. 32. R.Zimmermann, S.Ghandeharizadeh. Highly Available and Heterogeneous Continuous Media Storage Systems. IEEE Transactions on Multimedia, Vol.6, No. 6, 2004, pp.886 – 896.
Dr.Yuhui Deng received his PhD degree in computer architecture from Huazhong University of Science and Technology in July 2004. He was involved in several National Nature Science Foundation projects in the National Key Laboratory of Data Storage System of P. R. China when he was a PhD candidate. He joined in Cranfield University in February 2005. He is now a research officer at the Centre for Grid Computing, Cranfield University. Presently he is working on a project named Grid Oriented Storage (GOS) funded by EPSRC/DTI. His research interests cover computer architecture, network storage, cluster storage, parallel and distributed computing etc.