L
ISSN 0918-2802 Technical Report
Automatic Reconfiguration of an Autonomous Disk Cluster Daisuke Ito and Haruo Yokota TR01-0007 June
D EPARTMENT OF C OMPUTER SCIENCE TOKYO I NSTITUTE OF T ECHNOLOGY ˆ Ookayama 2-12-1 Meguro Tokyo 152-8552, Japan http://www.cs.titech.ac.jp/ c The author(s) of this report reserves all the rights.
Automatic Reconfiguration of an Autonomous Disk Cluster Daisuke Ito Department of Computer Science Tokyo Institute of Technology
[email protected] Haruo Yokota Global Scientific Information and Computing Center Tokyo Institute of Technology
[email protected] June 2001 Titech-CS Technical Report: TR01-0007
Abstract Recently, storage-centric configurations, such as network attached storage (NAS) and storage area network (SAN) architectures, have attracted considerable attention in advanced data processing. For these configurations, high scalability, flexibility, and availability are key features. Central control is unsuitable for realization of these features. We propose autonomous disks to enable distributed control in storage-centric configurations. Autonomous disks configure a cluster in a network, and implement the above key features using Event-Condition-Action rules with a distributed directory. The user can change the control strategy by modifying the rules. The combination of rules and the distributed directory also manages load distribution automatically and has the capability to automatically reconfigure the cluster. In this paper, we focus on the cluster-reconfiguration function of the autonomous disks to handle disk failures or modify management strategies. To add/detach disks to/from a cluster without a special server, autonomous disks use dynamic coordination and a multi-phase synchronization protocol similar to that used in distributed transaction commitment. They allow the usual operations to be accepted during reconfiguration. We are now implementing an experimental system using Java on PCs to demonstrate the effectiveness of the approach. The results of preliminary experimentation indicate that the synchronization cost is adequately acceptable.
Keywords: autonomous disks, network disks, NAS, SAN, cluster, reconfiguration 1
1 Introduction From the point of view of system configuration for advanced data processing, the storage becomes the center of the entire system. These are called storage-centric configurations. To implement a scalable storage-centric system, network-attached storage (NAS) and storage area network (SAN) architectures have attracted considerable attention recently. A NAS is directly connected to an IP network to be shared by multiple hosts connected to a local area network (LAN), while a SAN consists of a dedicated network, separate from the LAN, using serial connection of storage devices, e.g., a fiber channel. Disks are currently assumed to be passive devices in these configurations. That is, all their behavior is controlled by commands from their hosts via the network. Therefore, communications between disks and their hosts are very frequent and this limits the performance of the I/O. Moreover, to make the system efficient, placement of the data, including replicas and spare space control, is very important. The management of data location usually requires a dedicated central server. Because the server must also control all the accesses to the system the server will become a bottleneck as the system becomes large. To make a storage-centric system scalable, by removing the performance bottlenecks, and reliable, by excluding the complicated central control, distributed autonomous control of the storage nodes is essential. Technological progress, such as compact high-performance microprocessors for disk controllers and large semiconductor memories for disk caches, allow this capability. We propose a concept of autonomous disks [1], which utilize disk-resident processing capability to manage the data stored on them. This approach increases flexibility, availability, and scalability while reducing network communication and the load on the hosts. In this paper, we propose a method for reconfiguring a cluster of autonomous disks. The remainder of this paper is organized as follows. Section 2 briefly introduces the concept of the autonomous disk. In Section 3, we consider data placement in a cluster, causes for its reconfiguration, and configuration information in each autonomous disk. Then, we describe cluster reconfiguration algorithms in Section 4, our experimental system in Section 5, and related works Section 6. We summarize our work and describe possible future work in the final section.
2 Autonomous Disks A set of autonomous disks is configured as a cluster in a network for either a NAS or a SAN. Data is distributed within the cluster so it can be accessed uniformly. The disks accept simultaneous accesses from multiple hosts via the network. Disk-resident processors in the disks handle data distribution and load skews to enable efficient data processing. They are also capable of tolerating failures in disks and their controllers, and of reconfiguring the cluster to detach the damaged disks or modify management strategies. The data distribution, skew handling, and fault tolerance are completely transparent to the hosts. Thus, the hosts are not involved in communication between disks to realize these functions. This provides high scalability of the system, as there is no central controller. There may be many approaches to achieving the capabilities using disk-resident processing power. In [1], we proposed the use of Event-Condition-Action (ECA) rules with command layers 2
Host 0
...
Host i
...
Hostm
Network Controller
Controller
Controller
Directory
Directory
Directory
Primary0
Primary1
Controller
Primary2 log
Directory
Directory
Directory
Backup2
Backup0
Backup1
Rule
Rule
Rule
Rule
Disk 0
Disk 1
Disk 2
Disk 3
Figure 1: Data flow in autonomous disks
and a stream interface. The ECA rules are adopted in active databases [2, 3]. The stream interface provides a logical treatment of data objects such as OBSD proposed by NSIC [4]. The combination of the rules and the stream interface provides sufficient flexibility to the user. Moreover, to provide transparency to the hosts, our approach adopts the Fat-Btree [5] as a distributed directory. Combined with the rules, it enables each disk to accept requests for all the data in a cluster and balances the load within the cluster. Figure 1 illustrates an example of data flow for an insert operation to an autonomous disk cluster. In this example, Hosti submits a request to insert a stream to Disk0 via the network. However, the stream should be stored into Disk1 for proper data distribution. A rule for treating the insert is triggered to derive the destination disk, Disk1 , using the distributed directory, and to transmit the request to Disk1 by the rule. In Disk1 , the same rule is triggered again, but it executes an actual insert operation instead of transmitting the request, because the distributed directory indicates that the current disk is appropriate. During the insertion process, Disk1 sends log information to the log disk (Disk3) indicated by the mapping information according to the write ahead log (WAL) protocol [6], and executes the insert operation locally. Finally it returns the value true to Hosti . Update logs accumulate gradually in Disk3 , and each put-log operation triggers a check operation on the log size. When the log size exceeds a threshold, catch-up operations are asynchronously invoked by another rule. In these operations, the accumulated logs are transmitted and applied to the corresponding backup disk (Disk2 ). We can change management strategies by modifying rule descriptions. For example, we can easily change the number of logs, and backup and catch-up triggers, by rewriting rules. More detailed descriptions of the properties and behavior of autonomous disks, and examples of actual rules are in [1].
3
3 Cluster Reconfiguration on Autonomous Disks In this paper, we focus on the cluster reconfiguration function of the autonomous disks. To add/detach disks to/from a cluster with no central controller, synchronization between all members of the cluster is required. We propose a synchronization protocol to reconfigure an autonomous disk cluster while accepting usual operations. The protocol looks similar to the two-phase commit protocol for distributed transactions [6].
3.1 Data Placement in a Disk Cluster To make a system reliable, redundancy is essential. The parity calculation technique used in RAID5 systems [7, 8] can be applied to autonomous disks. This technique reduces the amount of redundant data required. Here, however, we consider a simple primary-backup replication approach. The parity calculation requires frequent communications between disks, which is suited to tightly coupled configurations. Moreover, storage cost has not recently been a serious constraint: we now have large- capacity disks at reasonable prices because of technological progress in disk manufacture. Thus, as shown in the previous example, we adopt a primary-backup approach as the first step in making an autonomous-disk cluster reliable. To place primary and backup copies, the area of a disk is divided into multiple fragments. A backup fragment keeps data logically identical to a primary fragment. These fragment pairs must be placed on physically different disk devices to tolerate a disk failure. If high reliability, tolerating multiple disk failures, is required, multiple backup fragments of each primary fragment are required. When data in a primary fragment is updated, data in the corresponding backup fragments must be updated, to keep consistency. However, simultaneous updates for both a primary and its backups increases the response time of the update. Therefore, we take an approach of asynchronous update using logs. Sequential accesses for writing logs to a disk produces the best performance of the disk, by eliminating head seek operations and rotational latency. There are two types of backups: physical and logical. ‘Physical’ means that disk pages of the primary and backups have physically identical data, whereas ‘logical’ means that backup fragments keep logically identical streams to the primary fragment. We adopt logical backup because of its flexibility. The logical backups enable independent directories for backups. Therefore, we prepare independent distributed directories for both the primary and backups. In the example of Figure 1, Diski has a primary fragment (Primary i) and a backup fragment (Backupj ) for Diskj ’s primary, where j = mod(i − 1, n) and n is the number of data disks. We can extend this approach to multiple backups; Diski also has secondary backup fragment for Diskk , where k = mod(i − 2, n). This type of configuration is called staggered allocation [6]. Although the log can be distributed among the data disks, the log is gathered into one disk in the example to make log disk accesses sequential without interruptions by access for other fragments in the same disk. If the log disk becomes a performance bottleneck, we can prepare multiple log disks. There can be other types of allocation. For instance, if there are several types of disks in a cluster, which have different properties related to storage capacity, access performance, and so on, it is better if the combination of a primary and its backups are housed in a group of the same type. The size of each fragment may also be varied according to the disk properties; higher performance disks keep a larger amount of data than lower ones. This type of configuration is called group allocation. 4
Another situation where a cluster is configured over distant locations is where the backups should be kept in another location to tolerate power shortage, or disaster. This type of configuration is called location dependent allocation. It is important to deal with variable situations flexibly. Our motivation is to manage these types of allocation using ECA rules, and keep the allocation after reconfigurations. Each disk also keeps an identical set of the ECA rules to indicate disk management strategies. Although the contents of rules in each disk can be varied, we assume identical sets of rules.
3.2 Causes for Reconfiguration There are basically two causes requiring reconfiguration of a cluster of autonomous disks: disk failures and modification of management strategies. First, we consider disk failures. In the above model, there are two cases to consider: failures of data and log disks. Failures can be detected by communication between disks, and the failed disks should be detached from the cluster. A detachment decreases the number of disks in the cluster. We can categorize modifications of the management strategies for a cluster of autonomous disks as follows. • Adding new data disks to increase the total storage capacity, and/or to improve performance by data redistribution. • Adding new log disks to improve the system reliability, and/or to enhance performance by removing access congestion. • Adding new backup fragments in all data disks to improve the system reliability, and/or to enhance read performance by replicating hot spots. Redundant data disks, log disks, and backup fragments may also be detached from the cluster for the opposite reasons. These can be seen as examples of guidelines on strategy modification related to cluster reconfiguration. This type of reconfiguration is invoked intentionally. That is, we distinguish cluster reconfigurations into two types: • voluntary reconfiguration by strategy modification, • involuntary reconfiguration because of failures. Voluntary reconfigurations can increase or decrease the number of disks in the cluster, while involuntary reconfiguration only decreases the number. Note that the reconfiguration does not always require a change of the number of disks in the cluster. Adding or removing backup fragments does not change the size of the cluster. On the other hand, we do not treat the replacement of a disk with a new disk as an operation, but as two operations of detaching and adding. The hot-spare-disk treatment is also included in the strategy. Here, however, we assume a cold spare to make the discussion simple. Figures 2, 3, and 4 are examples of adding and detaching a data disk, adding and removing a log disk, and backup fragments, respectively.
5
Controller
Controller
Controller
Primary0
Primary1
Primary2
Controller
Controller
Controller
Controller
Controller
Primary0
Primary1
Primary2
Primary3
Log
Controller
Log
Backup2
Backup0
Backup1
Backup3
Backup0
Backup1
Backup2
Rule
Rule
Rule
Rule
Rule
Rule
Rule
Rule
Rule
Disk0
Disk1
Disk2
Disk3
Disk0
Disk1
Disk2
Disk4
Disk3
Figure 2: Adding or detaching a data disk (Disk4)
Controller
Controller
Controller
Primary0
Primary1
Primary2
Controller
Controller
Controller
Primary0
Primary1
Log0
Controller
Controller
Log1
Log0
Backup2
Backup0
Backup1
Backup1
Backup0
Rule
Rule
Rule
Rule
Rule
Rule
Rule
Rule
Disk0
Disk1
Disk2
Disk3
Disk0
Disk1
Disk2
Disk3
Figure 3: Adding or removing log disks without changing the number of disks
Controller
Controller
Controller
Primary0
Primary1
Primary2
Controller
Controller
Controller
Controller
Primary0
Primary1
Primary2
Log
Backup2
Backup0
Backup1
Controller
Log
Backup2
Backup0
Backup1
Backup1
Backup2
Backup0
Rule
Rule
Rule
Rule
Rule
Rule
Rule
Rule
Disk0
Disk1
Disk2
Disk3
Disk0
Disk1
Disk2
Disk3
Figure 4: Adding or removing backup fragments in data disks
6
3.3 Configuration Information As described above, we assume no special network node in a cluster that monitors the network, to manage cluster configuration. Any disks in a cluster invoke reconfiguration by accepting commands from users, or by finding some failures in the cluster. This means that all members in the cluster have identical information about the configuration of the cluster. Table 1 lists the main items of this information. Table 1: Configuration Information in Each Disk The number of data disks: DataDiskNum An array of IDs of data disks: DataDiskID[DataDiskNum] The number of log disks: LogDiskNum A table mapping data disks to log disks LogMap[ID][LogDiskNum] The number of backup fragments in a data disk: BackDiskNum A table mapping data disks to their backups BackMap[ID][BackDiskNum]
By reserving the configuration information in all cluster members, such usual operations as sending logs to the log disk or catching up the contents of backup, do not require communication to decide the destination of messages. However, when some part of the information is updated to reconfigure the cluster, another type of synchronizing communication is required to make the information consistent. The frequency of updating configuration information is less than that of the usual operations. In Section 4 we propose algorithms to synchronize the configuration information, which have more than one phase, similar to the two-phase commit protocol for distributed transactions.
3.4 Modification of the Configuration Information Before describing the synchronization algorithm, we consider modification of the configuration information to change the cluster. Here, we assume that the number of data disks, DataDiskNum, is n, and assume the staggered allocation described in 3.1. To add a new data disk,
the following steps are required to modify the configuration information.
Step 1: increase DataDiskNum by one, Step 2: assign a new disk identifier to DataDiskID[DataDiskNum-1], Step 3: assign each identifier of log disks to LogMap[DataDiskID[DataDiskNum1]][L], where L is from 0 to LogDiskNum-1, and Step 4: assign each identifier of a disk keeping corresponding backup fragments to BackMap[DataDiskID[DataDiskNum-1]][B], where B is from 0 to BackDiskNum-1, modulo DataDiskNum - LogDiskNum. To detach a data disk, decrease DataDiskNum by one, and check restrictions of BackMap. There may be a case that cannot keep consistency, such as in Figure 5 where two backup fragments are placed in one disk. In that case, the administrator must be involved in the reconfiguration. 7
Controller
Controller
Controller
Controller
Controller
Primary p
Primary q
Primary r
Primary p
Primary r
Backup r
Backup p
Backup q
Backup r
Backup p
Backup q
Backup r
Backup p
Backup r
Backup p
Rule
Rule
Rule
Rule
Rule
Disk p
Disk r
Disk q
Disk p
Disk q
Figure 5: An example of not satisfying the backup restriction To add a log disk without changing the total disk number, detach a data disk from the cluster first, then increase LogDiskNum by one, and assign the identifier of the new log disk L to LogMap[L][n]. To remove a log disk without changing the total disk number , decrease LogDiskNum by one, and add a data disk as described above. To add backup fragments, BackupNum is increased by one, and each identifier of a disk keeping a corresponding backup fragment, B, is assigned to BackMap[DataDiskID[B]][n] modulo n. To remove backup fragments, BackupNum is simply decrease by one. To treat a failed data disk,
detach the failed data disk according to the steps described above.
To treat a failed log disk, detach the failed log disk according to the steps described above.
4 Cluster Reconfiguration Algorithms In this section, cluster reconfiguration algorithms corresponding to the causes described in 3.2 are considered. From the point of view of scalability, flexibility, and availability, a special server is not suited to autonomous disks [1]. Therefore, we here assume no central server to manage cluster reconfiguration. However, all members of an autonomous disk cluster must keep the common configuration information described above. This means that updating of the configuration information should be synchronized. The synchronization requires some coordinator. A static coordinator is a contradiction of the previous assumption of having no central server. Therefore, we adopt a 8
dynamic coordinator as in the two-phase commit protocol of distributed transactions. Thus, any member can be a coordinator. We also adopt multiple phases of distributed transaction commit protocol to synchronize the reconfiguration.
4.1 An Algorithm for Adding a New Data Disk The following is an outline of the algorithm for adding a data disk Da to the cluster. Add-single-disk Algorithm Step 1: Da chooses a disk Dc in the cluster as a coordinator of the reconfiguration. Any disk in the cluster can be the coordinator. Step 2: If Dc is not involved in another reconfiguration process, Dc returns an accept message to Da , and goes to Step 3. Otherwise, Dc returns a reject message to Da , Da sleeps for an adequate period, and goes back to Step 1. Step 3: Dc sends prepare-reconfiguration messages containing the ID of Da to all members of the cluster. Step 4: If any member disk is not involved in another reconfiguration process and has no problem in being a member of this reconfiguration, it returns a ready-reconfiguration message to Dc . Otherwise, it returns a reject-reconfiguration message. Step 5: If Dc receives ready-reconfiguration messages from all members, it sends new configuration information to all members including Da , and goes to Step 6. If it receives at least one reject message or timeout, it sends rollback messages to all members, and goes back to Step 3. Step 6: If any member disk receives the new configuration from Db , it returns a configurationreceived message to Dc . Step 7: If Dc receives configuration-received messages from all members, it sends changeconfiguration messages to all members including Da . The above algorithm requires many communications between member disks, similar to the twophase commit protocol. This makes the cost of the algorithm rather high. However, we assume that the frequency of the cluster reconfiguration is not so high. Therefore the synchronization cost of the reconfiguration is not detrimental to the performance of the total system. After the reconfiguration, the data itself will migrate automatically by the mechanism of the Fat-Btree used as the distributed directory. The Fat-Btree detects the amount of data skew and makes it flat by migrating data [5]. This migration is executed independently after finishing the reconfiguration algorithm.
4.2 An Algorithm for Adding Multiple New Data Disks It may be necessary to add multiple disks to a cluster simultaneously. The previous algorithm for adding a new data disk can be used in this case by applying it multiple times. If the add-single-disk 9
algorithm is applied repeatedly, we require separate synchronization phases for each application. Moreover, we must consider the overlap of data migration caused by the early additions. We can prepare a dedicated algorithm for adding multiple data disks. By applying the add-multiple-disk algorithm, the time for reconfiguration can be reduced. Because the difference between the two algorithms is small and obvious, we omit details of the add-multiple-disk algorithm here. Similar multiplexing approaches are available for the following reconfiguration algorithm. We also omit details of them here.
4.3 An Algorithm for Detaching a Data Disk We first consider an algorithm for detaching a data disk voluntarily, and then another algorithm for involuntarily detaching it, to handle disk failures, in the next subsection. The major difference between the two cases is the migration of data. Because the target disk in the voluntary detachment is alive during the algorithm and able to join the algorithm, the primary data in the disk can be migrated to other disks. On the other hand, in involuntary detachment, the primary data in the target disk cannot be accessed because the disk is out of order during the algorithm. To allow data migration during reconfiguration, the algorithm has three phases, whereas the algorithm for adding a data disk has two. Detach-single-disk Algorithm Step 1: Disk Dr to be detached becomes the coordinator. Step 2: Dr sends prepare-reconfigure messages to all disks in the cluster to initiate the detach process. Step 3: If any member disk is not involved in other reconfiguration processes, and has no problem of being a member of this process, it returns a ready-reconfiguration message to Dr . Otherwise, it returns a reject-reconfiguration message. Step 4: If Dr receives ready-reconfiguration messages from all members, go to Step 5. If it receives at least one reject message or timeout, it sends rollback messages to all members, and goes back to Step 1. Step 5: Dr migrates its own primary data to the disk Dq that is logically next to Dr . During the migration, automatic skew-handling processing using the Fat-Btree is suspended. Step 6: After finishing migration, Dq becomes the new coordinator and sends new configuration information to all members. Step 7: When each member disk receives the new configuration from Dq , it returns a configuration-received message to Dq . Step 8: If Dq receives configuration-received messages from all members, it sends changeconfiguration messages to all members. Step 9: Each member disk switches configuration, and sends a finish-reconfiguration message to Dq . Step 10: Dq informs Dr that it can detach. 10
4.4 Algorithms for Adding and Removing Log Disks To add a log disk without changing the number of disks in the cluster, a data disk must first be detached from the cluster, and then the configuration information is simply updated so the detached data disk can be treated as a log disk, as described in 3.4. To remove a log disk while keeping the same number of disks, the configuration information must first be modified to detach the log disk from the cluster. Then the disk is added as a new data disk to the cluster using the algorithm for adding a data disk.
4.5 Algorithm for Adding and Removing Backup Fragments The following is an outline of the algorithm for adding and removing backup fragments. Change-backup-number Algorithm Step 1: A host chooses a disk Dc in the cluster as a coordinator. Step 2: If Dc is not involved in other reconfiguration processes, Dc returns an accept message to the host, and goes to Step 3. Otherwise, Dc returns a reject message to the host, the host sleeps for an adequate period, and goes back to Step 1. Step 3: Dc sends prepare-reconfiguration messages containing the number of backups to all members of the cluster. Step 4: If any member disk is not involved in other reconfiguration processes and has no problem with being a member of this reconfiguration, it returns a ready-reconfiguration message to Dc . Otherwise, it returns a reject-reconfiguration message. Step 5: If Dc receives ready-reconfiguration messages from all members, it sends a changeconfiguration message to all members, and goes to Step 6. If it receives at least one reject message or timeout, it sends rollback messages to all members, and goes back to Step 3. Step 6: If Dc receives finish-configuration messages from all members, it sends a finishconfiguration message to the host.
4.6 Algorithms for Handing Failures We do not assume special hardware in autonomous disks to detect disk failures. They can detect failures during communication with each other. As, in our experimental system, we use socket connections between them a failure can be treated as an exception of a socket connection. In other implementations, disk failures can be detected by software if we allow delay. Backup by log mechanism with the WAL protocol accepts this type of delay. After detecting a failure, the following algorithm is invoked to reconfigure the cluster.
11
Controller
Controller
Controller
Primary0
Primary1
Primary2
Controller
Controller
Controller
Controller
Primary0 Primary1
Primary1
Primary3 Primary2
Log
Controller
Log
Backup2
Backup0
Backup1
Backup2
Backup0
Rule
Rule
Rule
Rule
Rule
Rule
Backup1 Rule
Rule
Disk0
Disk1
Disk2
Disk3
Disk0
Disk1
Disk2
Disk3
Figure 6: Treatment of a disk failure Failure-Handling Algorithm Step 1: A disk detecting the failure Dd becomes a coordinator. Step 2: Dd sends prepare-reconfiguration messages containing the identifier of the failed disk to all members of the cluster except the failed disk. Step 3: If any member disk is not involved in other reconfiguration processes and has no problem of being a member of this reconfiguration, it returns a ready-reconfiguration message to Dd . Otherwise, it returns a reject-reconfiguration message. Step 4: If Dd receives at least one reject message, go back to Step2 after some time period. If Dd receives ready-reconfiguration messages from all members except the failed disk, it invokes a catch-up process that makes the backup fragments corresponding to the primary fragment in the failed disk up-to-date, and migrates the contents of the logically next disks depicted in Figure 6. Step 5: Dd sends change-reconfiguration messages to all members of the cluster except the failed disk.
5 An Experimental System We are now implementing an experimental system of the autonomous disks. Because no diskresident processor has been programmable from the outside yet, we currently use Java on PCs connected to an ordinary LAN as emulators of autonomous disks to evaluate the feasibility of autonomous disks and the reconfiguration cost of the cluster. Figure 7 illustrates a configuration of the experimental system. We developed six basic components in Java: communication, lock, data, directory, log, and rule managers. Requests are queued between the communication and rule managers. Another special queue is prepared for processing logs to avoid deadlocks. Rules and the configuration information are stored in the disk and loaded into the system. The loaded configuration information are used by all components. When reconfiguration processes are finished, the configuration information both on the disk and memory is changed. 12
Network Written in Java Lock Manager
Communication Manager Send Queue
Receive Queue
Log Queue
Rule Manager
Directory Manager
Data Manager
Cnfiguratoin Information
Log Manager
Directory Primary Data Directory Backup Data Logs Compiler
Rules
Cnfiguratoin Information
Figure 7: A configuration of an experimental system Using the experimental system constructed from a 100BaseT switching hub and Linux2.2 PCs having a 700MHz Celeron processor and 128MB memory, we are evaluating the performance of commands such as inserting, deleting, and retrieving data. For example, the response time of an insert request is about 6 to 8 ms. The detail of the evaluation will be reported in another paper. Here, we do a preliminary experimentation to evaluate the cost of reconfiguration. Table 2 lists average time for the add-single-disk algorithm in 4.1 with varying the number of disks in the cluster. It does not contain the cost of data migration or disk access for writing the configuration information. However, it demonstrates that the overhead of multi-phase synchronization is not so large even by using commodity components. Table 2: Time for adding single disk 2 3 4 5 The number of disks in a cluster Time for reconstruction (ms) 20.0 25.4 28.3 29.0
6 27.1
7 31.3
6 Related Work Many techniques that automatically add/detach nodes to/from a network have previously been proposed. DHCP is one of the best known protocols to treat IP network nodes and uses a server in each network to allocate an IP address for a new node. This suits a network where many nodes add and detach frequently and is widely used in LAN environments. The other techniques also assume some 13
kind of central server in the network. They assume rather light operations, such as simply adding and detaching network nodes. However, a central server is not suited to a cluster of autonomous disks, as described above. For example, recent SANs tend to prepare a special server for managing storage space with pooled spare disks. It is important for effective management to gather access statistics. However, it is difficult for the server to collect access information from all nodes in the SAN, because accesses are distributed to all disks. If we make the accesses go to the server first to collect the information, access performance would be severely reduced. In autonomous disk clusters it is very simple to collect this type of information by circulating a message locally to gather statistics, because each disk has intelligence. There are several other academic research projects to utilize a disk-resident processor and memory for executing application programs: the IDISK project at UC Berkeley [9] and the Active Disk projects at Carnegie Mellon [10] and UC Santa Barbara/Maryland [11]. They focus on the functions and mechanisms for making a combination of the disk-resident processor and a host execute storage-centric user applications, such as decision support systems, relational database operations, data mining, and image processing. However, they do not consider the management and reliability of the data. The reliability of the system is very important in data processing. There are several approaches to preventing loss of data because of a disk failure, such as mirroring, using error-correcting codes, and applying parity-calculation techniques, as described in papers on RAIDs [7, 8]. We also proposed a highly reliable and scalable network-connected parallel disk system called a DR-net using the parity-calculation techniques[12]. However, these approaches are less flexible. On the other hand, the method of autonomous disks can be applied to controllers of multiple RAID systems. ISTORE project at UC Berkeley [13], started almost simultaneously with our project independently, also considers disk failure and load balance. However, they focus on execution of applications inside the storage like the IDISK. We think that their application-specific-storage approach narrows its applicability. We limit functions of autonomous disks to disk management. Thus, autonomous disks can be used from the wider range of applications. As another fault-tolerance technique, network monitoring, such as that proposed by Rodeheffer et al.[14], can be applied to detect failures in a network. However, they assumed special hardware to monitor the network. Because the aim of autonomous disks is to construct them using only off-the-shelf components we currently do not assume special hardware for the autonomous disks.
7 Conclusion In this paper, we proposed a method for reconfiguring a cluster of autonomous disks. We first considered data placement in the cluster, and causes for reconfiguring it. There are eight types of reconfiguration in the autonomous disk cluster: involuntary detachment of data and log disks, voluntary addition or detachment of data disks, voluntary addition or removal of log disks, and voluntary addition or removal of backup fragments. We then considered modification of the configuration information and reconfiguration algorithms for each reconfiguration type. In the algorithms, we adopted multiple phases and dynamic coordinators, similar to commit protocols of distributed transactions. The algorithm can be applied to online situations of accepting usual operations dur14
ing reconfiguration. We then reported our experimental system using Java on PCs connected via LAN. The result of preliminary experimentation on the system indicates that the synchronization overhead is adequately acceptable. We will continue the experimentation. Moreover, since the experimental system uses queues for message communication, we plan to generalize our approach to synchronization between components using message queues.
References [1] Haruo Yokota. Autonomous Disks for Advanced Database Applications. In Proc. of International Symposium on Database Applications in Non-Traditional Environments (DANTE’99), pages 441–448, Nov. 1999. [2] Dennis R. McCarthy and Umeshwar Dayal. The Architecture of an Active Data Base Management System. In Proc. of SIGMOD Conf. ’89, pages 215–224, 1989. [3] J. Widom and S. Ceri (ed.). Active Database Systems: Triggers and Rules for Advanced Database Processing. Morgan Kaufmann Pub, 1996. [4] National Storage Industry Consortium (NSIC). Object based storage devices: A command set proposal. http://www.nsic.org/nasd/1999-nov/final.pdf, Nov 1999. [5] Hauo Yokota, Yasuhiko Kanemasa, and Jun Miyazaki. Fat-Btree: An Update-Conscious Parallel Directory Structure. In Proc. of the 15th Int’l Conf. on Data Engineering, pages 448–457, 1999. [6] Jim Gray and Andreas Reuter. Transaction Processing:Concepts and Techniques. Morgan Kauf-Mann, 1993. [7] David A Patterson, Garth Gibson, and Randy H. Katz. A Case for Redundat Arrays of Inexpensive Disks(RAID). In Proc. of ACM SIGMOD Conference, pages 109–116, Jun 1988. [8] Peter M. Chen et al. RAID: High-Performance, Reliable Secondary Storage. ACM Computing Surveys, 26(2):145 – 185, Jun 1994. [9] Kimberly Keeton, David A. Patterson, and Joseph M. Hellerstein. A Case for Intelligent Disks (IDISKs). SIGMOD Record, 27(3):42–52, Sep. 1998. [10] Erik Riedel, Garth Gibson, and Christos Faloutsos. Active Storage for Large-Scale Data Mining and Multimedia. In Proc. of the 24th VLDB Conf., pages 62–73, 1998. [11] Anurag Acharya, Mustafa Uysal, and Joel Saltz. Active Disks: Programming Model, Algorithms and Evaluation. In Proc. of the 8th ASPLOS Conf., Oct. 1998. [12] Haruo Yokota. DR-nets: Data-Reconstruction Networks for Highly Reliable Parallel-Disk Systems. ACM Computer Architecture News, 22(4):41–46, Sep 1994.
15
[13] Brown A., D. Oppenheimer, K. Keeton, R. Thomas, J. Kubiatowicz, and D.A. Patterson. ISTORE: Introspective Storage for Data-Intensive Network Services. In Proc. of HotOS-VII, 1999. [14] Thomas Rodeheffer and Michael D. Schroeder. Automatic Reconfiguration in Autonet. In 13th ACM Symposium on Operating System Principles, pages 183–187. ACM, 1991.
16