Fault-Tolerant Distributed Mass Storage for LHC ... - Semantic Scholar

4 downloads 96557 Views 103KB Size Report
the next years — the evolution of hard disk capacities is likely to enhance ... ity of a node would result in the loss of its data for the entire computer or other nodes ...
Fault-Tolerant Distributed Mass Storage for LHC Computing Arne Wiebalck† , Peter T. Breuer‡ , Volker Lindenstruth† , Timm M. Steinbeck† †

Chair of Computer Science and Computer Engineering Kirchhoff Institute for Physics University of Heidelberg, Germany e-Mail: {wiebalck,ti,timm.steinbeck}@kip.uni-heidelberg.de ‡

Department of Telematic Engineering, University Carlos III of Madrid, Spain e-Mail: [email protected]

Abstract— In this paper we present the concept and first prototyping results from a fault-tolerant distributed mass storage architecture for Linux PC clusters as for example deployed by the upcoming particle physics experiments at CERN. The device masquerading technique using the Enhanced Network Block Device enables local RAID over remote disks as the key concept of the ClusterRAID architecture. Index Terms—Distributed RAID, Fault-tolerance, HA Computing, Linux Cluster, LHC Computing

I. I NTRODUCTION Next generation particle physics experiments, such as the experiments ALICE [1], ATLAS [2], CMS [3], or LHCb [4] at the particle physics laboratory CERN [5] will deploy commodityoff-the-shelf (COTS) PC clusters with O(1000) nodes for each experiment. These clusters will be used either as part of the detector readout chain, for the offline analysis of the data or even both. The requirements of all of these experiments concerning mass storage capacity and aggregate access rates is of the order of 1PB/a and 0.1 through 1GB/s, respectively [6]. The most important feature of such data is that it consists of typically more than 109 individual data sets, events, which are entirely independent and can therefore analyzed trivially in parallel. The processing here is of the multiprogramming type and not a parallel processing application. As the areal density of of magnetic disk devices is increasing at a rate of about 50% per year [7] — recording densities of 100Gb/in2 using the giant magnetoresistance technology [8] or the feasibility of 1Tb/in2 using the extraordinary magnetoresistance technology [9] indicate that this trend will continue over the next years — the evolution of hard disk capacities is likely to enhance available commodity hard drives with today’s capacities of 200GB to capacities beyond 1TB during the next five years. Thus, the disks in the LHC experiments’ PC farms will have around 1PB of online disk capacity. In order to avoid the well-known bottleneck and single-pointof-failure drawbacks of centralized storage architectures [10]

and since the storage capacity is distributed over the cluster anyway, a distributed mass storage system is the natural choice to make the clusters’ disk capacity available to applications, that is the analysis software. However, fault-tolerance has to be provided in case of distributed storage as any failure or unavailability of a node would result in the loss of its data for the entire computer or other nodes relying on it. A possible solution is to expand the fault-tolerant encoding, typically implemented on the disk level as RAID [11] to the Cluster, a distributed RAID as discussed first by [12]. The architecture we propose is based on the mechanism of device masquerading [13]. Remote resources are imported to nodes to appear as local devices. As these devices have the same interface as “real” hard drives, a software RAID can be installed on top, resulting in a reliable distributed mass storage device, a local RAID over remote disks. In this paper we discuss the buiding blocks of the architecture – mainly the network block device – and examine first prototype performance results of the ClusterRAID architecture. All of this should be regarded as work in progress, rather than a complete solution. The paper is organized as follows. Section II explains the functional details of the Enhanced Network Block Device (ENBD) and discusses issues like latency and bandwidth restrictions imposed by a network layer in the data path. It also provides performance measurements for this basic building block. Section III deals with RAID over ENBD and discusses how the network block device supports a RAID on top. In section IV the reliability (in terms of stability under permanent reading and writing as well as in terms of error handling) and the performance of RAID over ENBD is examined. Section V discusses the advantages and weaknesses of the ClusterRAID architecture. Before the paper is closed by a discussion of the status of the project and the ojectives of future research in section VII, related architectures and research projects are discussed in section VI.

Client Side

Server Side

Application Client Daemons

USER SPACE

KERNEL SPACE

Network

File System

Server Daemons

File System

ENBD

HARD WARE

Fig. 1. details.

NBD’s main components and method of working. See the text for

II. N ETWORK B LOCK D EVICE In order to distribute data over the disks attached to the nodes in a cluster the mechanism of a network block device is exploited. A network block device masquerades remote files, partitions or whole hard disks as local block devices with their standard interface and behaviour. Network block devices differ from classical networked file access solutions like NFS or AFS in that access is at a lower level than the file system layer, and so any file system structure at all may be supported on the device once it has been set up. The latency imposed by a network layer in the data path is negligible, since it will be dominated by the hard drives. The latency of today’s hard disks, seek time and rotational delay, is of the order of 5 milliseconds. Standard network technologies like Ethernet have data transmission latencies which are up to several hundred microseconds at the most. Thus, the expected latency increase is about 10%. The same holds true for bandwidth restrictions. If a Gigabit type network technology is used, the component limiting the throughput is the hard disk as the throughput increase in network technology is and will be much faster than for hard disks. Figure 1 illustrates the NBD’s main components as well as the principal method of working. The three active components of the NBD are a Linux kernel driver, a client daemon and a server daemon. These are associated with a locally visible block device on the client and a remote disk resource on the

server as the passive components of the system. Userspace accesses to the device are intercepted by the kernel, passed to the client daemon and transmitted across the network. The server daemon on the remote server translates the incoming read and write requests into accesses to one of its local resources and responds back across the net. The responses are captured by the helper client daemon and translated back into responses to the kernel. The end-result is a proxy device: a local simulacrum of the remote resource. In addition to the basic architecture and protocol of an NBD, the Enhanced NBD provides failure-mode and redundancy features. The ENBD will automatically reauthenticate and reconnect after a network failure, for example, and runs over several communication channels at once, possibly through different routes, which adds redundancy. The channels are demand driven, so a dead channel results in an automatic fail-over to the remaining live channels. A pair of auxiliary daemons, one on the server and one on the client side, help maintain connectivity across reboots by maintaining persistent information about the desired state of the connection, and initiating new connections when necessary. The multiple connections maintained by ENBD work asynchronously, and this gives rise to a pipelining effect that greatly speeds up the protocol, especially on platforms with multiple CPUs. ENBD itself has no size limitations apart from those imposed at the client end by the kernel on any block device, and by the file system layer at the server end. This means that between 1 and 8TB is currently available per device on 32-bit platforms, depending on the precise kernel version. The standard ENBD uses TCP/IP over Ethernet as its communication protocol and network. Since TCP/IP has been designed to make reliable communication over unreliable connections possible, the protocol stack is complex and computing intensive. For that reason ENBD has been adapted to be able to also use the low overhead SCI (Scalable Coherent Interface) technology [14] as its communication network. The ENBD has been extensively tested regarding its functionality, stability and peformance. All these tests have been done on a PC cluster consisting of dual Pentium III nodes with a CPU speed of 800 MHz, 512 MB RAM and 40 GB disk space.1 All testing has been done exporting a 1GB file. Figure 2 compares the ENBD read data throughput of different network technologies when reading the whole 1GB ENBD device using dd(1). As expected, the achieved throughput for FastEthernet of about 10MB/s is limited by the network technology itself. The transfer rates of GbE and SCI however are limited by the speed of the underlying hard disk, which has a sustained read transfer rate of about 28.5 MB/s as measured with the Linux standard I/O benchmark Bonnie++ [15]. These measurements indicate that ENBD imposes no significant limitation on the data path’s read bandwidth. The average CPU load for the Gigabit type networks SCI and GbE is about 18 and 25%, respectively. Under Fast Ethernet the load is also about 18%. However, in order to compare these 1 IBM DTLA 307045 and IBM Deskstar 60 GXP hard disks, Tyan Thunder HEsl motherboards equipped with the Serverworks HEsl chipset, connected by FE (on-board Intel EtherPro 100), GbE (3com 3C996B-T GbE) and SCI (SCI link chip LC3 and the PCI link chip PSB66 by Dolphin), Linux Kernel 2.4.16, ENBD version 2.4.28

MB/s resp. CPU Load/%

MB/s

30

25 Disk limit

20 GbE

15 SCI

30 25 20

CPU Load

15

10 10 FE

5

0 0

Net

5

20

40

60

80

100

120 Time/s

Fig. 2. Comparison of the ENBD client-side Read Throughput for a 1GB transfer over point-to-point ENBD using different network technologies. In the case of GbE and SCI the throughput is limited by the bandwidth of the serverside hard drive, for FE the network itself is the limiting factor.

numbers, they have to be combined with the actual transfer rate of the corresponding network. Applying the ratio of CPU load and transfer rate as a simple metric, GbE gains a factor of two, SCI a factor of three compared with FE. It is expected that the CPU overhead of the SCI adaption of ENBD can be further reduced, since the original version of ENBD was designed and implemented for the stream interface of TCP/IP and is not yet optimized for the shared memory architecture of SCI. The Fast Ethernet write performance of ENBD, without exploiting the MD5 feature, is shown in Figure 3. The average transfer rate is about 6 to 7 MB/s. One can see a periodic stop-and-go behaviour after 45 seconds of operation. In regular intervals the network does not transfer any packets and the CPU on the client side is idle as well. This behaviour is due to the buffer management of the virtual memory subsystem on the server side. By adjusting the parameters controlling the ageing and flushing behaviour of dirty buffers on the server side this oscillation can be damped to a constant throughput rate (not shown in the figure). As a metric for the latency increase by ENBD the measurements of Bonnie++ for random seeks are shown in Table I. As expected, the latency for data access is mainly limited by the

0

20

40

60

80

100

120

140

160

180 200 Time/s

Fig. 3. Client-side ENBD Write throughput and CPU load for a 1GB transfer.

seek time of the server-side hard disk. The increase by ENBD is of the order of 10%, slightly varying with the network technology used. The stability of ENBD has been tested by continous reading and writing using the Bonnie++ benchmark. In 24-hour tests no error could be observed. TABLE I B ONNIE ++ S EEK T IME R ESULTS . T HE LATENCY

INCREASE IS

CALCULATED RELATIVE TO THE SEEK TIME OF THE HARD DISK .

Hard Disk ENBD on FE ENBD on GbE ENBD on SCI

Random Seeks [1/s] 200 175 179 185

Avg. Seek Time [ms] 5 5.7 5.6 5.4

Latency Increase [%] 14 12 8

III. R AID OVER NBD As the Enhanced Network Block Device masquerades remote resources as local block devices, any layer built upon it will not notice that access to the block device is directed to a remote node. This feature makes it possible to build a redundant

Application

MB/s

Application

RAID5

Exported

35 30

Local

Exported

A

A

B

C

Client

D

D

25 20 15

Local

Exported

Exported

B

C

Local

Server

10 5 Faulty Server

Application

Application

0 0

20

40

60

80

100

120 Time/s

Fig. 4. Possible setup for software RAID over network block devices. The central client node imports four resources from the surrounding four server nodes and builds a software RAID on top. In a cluster environment this setup may be expanded symmetrically, i.e. every node is client importing from the remaining as well as server exporting resources to the four other nodes.

storage device which distributes its data to several remote nodes using the standard software RAID module as shipped with the Linux kernel. In such a configuration the software RAID will work on several devices imported via ENBD (and additionally maybe local devices), where data can be stored redundantly. In order to operate RAID on top of the network block device layer efficiently, the ENBD has some special features to support RAID. Normally, ENBD hides errors from any upper layers in the kernel, which results in accesses being blocked until connectivity is reestablished. In the case of RAID over ENBD, this is not neccessarily the desired behaviour, because the RAID device then cannot decide whether to fault the ENBD component offline or not. So ENBD allows the error behaviour to be selected to either block or show errors, as desired. Further, when a RAID mirror device reintegrates a previously failed device, it brings the component up to date by recreating it based on the remaining components. This can take a great deal of time – although it does not interfere with the normal operation of the device, it is a period in which redundancy is lost, the load on the affected nodes is increased and the available network bandwidth is reduced. To speed up the reintegration,

Fig. 5. Network traffic over time during a network error. Note: For reasons of clarity the faulty server (red) is scaled by a factor 0.9. Only the client, the faulty server and one of the three remaining servers is shown. The application in this scenario using the RAID device is the dd(1) command.

the ENBD samples at regular intervals to see if it can increase speed by shifting into an alternate mode in which it skips block writes after pre-reading both sides and finding the MD5 sums to be the same. When most of the writes are rewrites of the same material, as in the case of a RAID mirror resync, a speedup occurs and the ENBD will switch into the alternate mode. In this mode ENBD can exceed the maximum transfer speed over the transport medium (20MB/s observed on a 100BT network is not uncommon). A possible scenario of RAID over network block devices is depicted in Figure 4. Here, a client node imports resources from four server nodes. The imported block devices are arranged in a RAID5 array on the client. This setup has been used to investigate functionality and performance of software RAID over ENBD. IV. F UNCTIONALITY AND P ERFORMANCE The network traffic during an error scenario is shown in Figure 5. After about 30 seconds of operation the network cable of the “faulty” server is unplugged. About ten seconds later the ENBD reports the error to the upper layer, in this case the

RAID module. The RAID module tags the corresponding device as faulty, whereas the application can continue using the RAID device after about 30 seconds, the point in time at which the RAID starts to use the remaining three devices. As soon as the ENBD device is operational again, which happens automatically after the network cable is reconnected, it is also possible to remove and add the device from the RAID. The RAID will then start to resynchronize it during operation (not shown in the figure) without affecting the application that uses the device. Thus, the RAID over ENBD can not only cope with a failure of a hard disk, but with the failure of an entire node without loosing any data and having the data available for applications on the working node all the time. The blockwise read and write performance of this setup as measured with Bonnie++ is 41 MB/s and 11 MB/s, respectively. With RAID0, which does simple striping with no redundancy, the measured throughput for four disks is 55 MB/s for large reads and 30 MB/s for large writes. This indicates the parity overhead imposed on the system by the RAID5 algorithm. First test have been started using a RAID configuration symmetrically on all 5 nodes in the setup of Figure 4: All nodes import from and export to all other nodes in a group, a so-called RAID cell. Regarding the intended use of a distributed RAID in a cluster environment, the cluster may be divided in such independent cells. The setup has undergone a 24-hour test of permanent reading. The system showed no errors. The aggregate throughput is 38 MB/s for FE and only slightly higher for GbE. This is limited by the disks serving four different applications each leading to a reduced effective disk bandwidth due to the increased time spent for seeking. The CPU load was about 35% for FE and 30% for GbE, for both CPUs. So far the write tests were only successful for throttled write speeds. V. A DVANTAGES AND S HORTCOMINGS

is only feasible when using a block level interface. Additionally, when using a file system on the ClusterRAID device no changes to the file system are necessary. The modularity of the architecture, i.e. separating block device, redundancy (RAID), file system, and possibly a cluster file system, enables the easy exchange of components and thus provides maximum fexibility in contrast with integrated systems. If combined with the standard RAID5, the architecture inherits the characteristics of the RAID5 data distribution scheme. The space overhead is 1 − NN−1 , where N is the number of available disks in the system. This is small when compared with architectures implementing mirroring schemes [17][18]. Of course, the inherent and well-known drawbacks of RAID5 are transferred to our architecture. They manifest themselves in a lack of write performance, especially for small write requests. This is due to the parity update overhead of RAID5, since a every small write implies a read-modify-write cycle of the parity block. This is a shortcoming of the architecture and it depends on the application accessing the device how heavily this drawback weighs. For instance, the LHC data analysis software as discussed in the introductory chapter are applications that write once, but read often. Thus, a lack of write performance is not a major drawback in this case. The possibility of adding nodes or disks to the system during runtime is depending on the capability of the redundancy layer to do so. Nonetheless, as for standard RAID it is possible to define spare ENBD devices, that are attached in the case of failure. Currently, there is no single I/O space (SIOS) for the pooled storage resources. This is not regarded as part of the block device layer, rather than of a distributed file sytem to be built on top. To sum up, the advantage of the architecture lies in its read performance and the possibility of coping with complete nodes failures. The write performance is due to the inherent charcteristics of RAID5 a shortcoming.

In contrast with locally attached disks and a local hard- or software RAID the feasibility of local RAID on remote disks increases reliablity as a failure or the unavailability of complete nodes is more probable than failures of single hard drives. In the case of data stored completely locally, the failure or unavailability of a node would result in the (at least temporary) loss of data, whereas the ClusterRAID provides access to the data in cases of hard disk, network or node failures. Most of the distributed architectures provide the availability of data in case of a hard disk failure, but only a few can tolerate the loss of a complete node [16][17]. Block level distributed RAID systems, such as [18] do networking at kernel level. As discussed in chapter II, the ENBD does networking in user space. Though the transfer of requests involves a transition between kernel and user space, the user space networking enables easy adaption to other protocols or network technologies. As the ClusterRAID architecture provides a block level interface rather than a file system interface, the usage requires no special user libraries or special function calls by user applications. Some special applications, e.g. data bases, like to do the block administration of the storage devices on their own, which

VI. R ELATED W ORK This section examines the interface abstraction and degree of fault-tolerance of some of the projects dealing with distributed data storage architectures similar to the ClusterRAID. TickerTAIP [16] is a parallel architecture for disk arrays with distributed controller functionality. It is one of the few systems that is able to cope with node failures. Data consistency is granted by sequencing user requests. Swift/RAID [19] is a UDP-based distributed RAID system. Through a user library it provides an application interface with the conventional Unix file operations. For each user request an execution plan is assembled, which is transmitted to the server nodes. This plan determines the distribution of the data over the physical devices. Distribution schemes according to RAID levels 0, 4 and 5 have been implemented. Petal [17] is a distributed block level storage system providing the abstraction of shared virtual disks. It can tolerate disk, network and node failures. Chained Declustering [21] is used to distributed the data over multiple storage servers. For the use of the distributed virtual disks the distributed file system Frangipani [22] has been developed.

The NASD [23] architecture embeds the disk management functionality into the device electronics. The aggregation of stoarge devices is managed by central servers, whereas the data is transferred directly between device and client. Instead of a block-type interface a object-store interface is provided. Swarm [24] is a network storage system designed to run on a cluster of network-attached storage devices. It uses a striped log abstraction to simplify the distributed control of the system and to ensure the availability of the data in case of a server failure. RAID-x/OSM [18] is a block level distributed RAID architecture for Linux clusters. It uses orthogonal mirroring and striping to ensure both, the reliability and the overall system performance. The architecture provides a single I/O space for the pooled storage using a concurrency control at block level. VII. S UMMARY AND O UTLOOK Upcoming particle physics experiments will deploy COTS clusters with online disk storage capacities of around 1PB. To make this storage available to applications in a resilient and efficient way is the goal of the ClusterRAID architecture with the Enhanced Network Block Device as its central building block. We have shown that the prototype system built is insensitive in terms of data availability to the failure of a complete node. The read performance of the system is mainly limited by the speed of the hard disk and network components used, not by the system itself. The improvement of the write performance is the subject of further investigation by the authors, although it is their belief that the write throughput is mainly limited by the inherent characteristics of the distribution scheme and parity overhead of the RAID5 algorithm. Bandwidth and latency are only slightly affected by the additional network layer in the RAID system. The overhead however may be of more concern. Since a network transaction is involved in any access to the data, the additional computational overhead has to be considered carefully when designing such a system. Although caching in the block cache of the virtual memory system of the kernel or the MD5 checksums inside the ENBD may help to reduce the network traffic in certain circumstances, the main focus must be on the network itself and on the protocol used. Network cards with TCP/IP offload engines that are able to process part of the protocol stack and relieve the host CPU, protocol alternatives like the low overhead protocol STP [25], or low overhead network technologies like SCI or Myrinet may help in order to save the CPU cycles for the application. Reducing the computational overhead by using specialized low overhead procols will be a central issue of our future research. R EFERENCES [1] A Large Ion Collider Experiment at CERN LHC http://alice.web.cern.ch/Alice/ [2] A Toroidal LHC Appartus http://atlas.web.cern.ch/Atlas/ [3] The Compact Muon Solenoid http://cmsinfo.cern.ch/ [4] The Large Hadron Collider beauty Experiment http://lhcb.web.cern.ch/lhcb/ [5] European Organization for Nuclear Research, http://www.cern.ch [6] S. Bethke et al., Report of the Steering Group of the LHC Computing Review, CERN/RRB-D 2001-3. [7] E. Grochowski, IBM Magnetic Hard Disk Drive Technology: Areal Density Perspective, http://www.almaden.ibm.com/sst/ [8] P. M. Levy, Giant magnetoresistance in magnetic layered and granular materials, Solid State Physics, Vol. 47, p. 367-462, 1994.

[9] S. A. Solin et al., Nonmagnetic semiconductors as read-head sensors for ultra-high density magnetic recording, Applied Physics Letters, Volume 80, Number 21, p. 4012-4014, 27 May 2002. [10] T. Anderson, M. Dahlin, J. Neefe, D. Patterson, D. Roselli, and R. Wang, Serverless Network Filesystems. ACM Transactions on Computer Systems, p. 41-79, January 1996. [11] D. Patterson, G. Gibbson, and R. Katz, A Case For Redundant Arrays Of Inexpensive Disks (RAID). International Conference on Management of Data, p. 109-116, June 1988. [12] M. Stonebraker and G. A. Schloss, Distributed RAID – A New Multiple Copy Algorithm. Proceedings of the International Conference on Data Engineering. p. 430-437, February 1990. [13] R. Ho, K. Hwang, and H. Jin, Design and Analysis of Clusters with single I/O space. Proceedings of the 20th International Conference on Distributed Computing Systems, June 1999. [14] IEEE Standard 1596-1992, IEEE Standard for Scalable Coherent Interface (SCI), The Institute of Electrical and Electronics Engineers, Inc., 1993. [15] Bonnie++ Homepage, http://www.coker.com.au/bonnie++/ [16] P. Cao et al., The TickerTAIP Parallel RAID Architecture, ACM Trans. Computer Systems, Vol. 12, No. 3, p. 236-269, August 1994. [17] E. K. Lee et al., Petal: Distributed Virtual Disks, Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems, p. 84-92, October 1996. [18] K. Hwang et al., Orthogonal Striping and Mirroring in Distributed RAID for I/O-Centric Cluster Computing, IEEE Transactions on Parallel and Distributed Systems, Vol. 13, No. 1, p. 26-44, January 2002. [19] L. Cabrera and D. Long, Using distributed disk striping to provide high I/O rates, ACM Computing Systems, 4:405-436, 1991. [20] G. Gibson et. al., A Cost-Effective, High-Bandwidth Storage Architecture. Proceedings of the 8th Conference on Architectural Support for Programming Languages and Operating Systems, October 1998. [21] H. Hsiao and D. DeWitt, Chained Declustering: A new Availability Strategy for Multiprocessor Database Machines. Proceedings of the 6th International Conference on Data Engineering, p. 456-465, 1990. [22] C. Thekkath, T. Mann, and E. K. Lee Frangipani: A Scalable Distributed File System, Proceedings of the ACM Symposium on Operating System Principles, p. 224-237, October 1997. [23] G. Gibson et. al., A Cost-Effective, High-Bandwidth Storage Architecture, Proceedings of the 8th Conference on Architectural Support for Programming Languages and Operating Systems, 1998. [24] J. Hartman, I Murdock, and T. Spalink The Swarm Scalable Storage System, Proceedings of the 19th IEEE Conference on Distributed Computer Systems, June 1999. [25] STP Project Hompage, http://oss.sgi.com/projects/stp/

Suggest Documents