Automated Physical Storage Provision Using a Peer-to-Peer Distributed File System Simon Chong, Paul A. Watters, Michael Hitchens Department of Computing, Macquarie University NSW 2109, AUSTRALIA {simonc,pwatters,
[email protected]} Abstract The allocation and management of physical storage structures in relational database systems is a timeconsuming manual process undertaken by database administrators. If datafiles, redo logs and control files exceed the available disk space on a volume, then new space will need to physically allocated on a new device. This reduces availability due to downtime, and arbitrary assumptions are often made about sizing requirements in the allocation of new storage media. What is required is an infinitely extensible, logically striped volume that can request and access storage on demand. In this paper, we present a Peer-to-Peer distributed file system that creates a single, virtual striped volume that can be mounted as a normal logical file system through the Common Internet File System (CIFS). New peers contribute unused space as required by the dynamic and growing physical storage requirements of relational databases.
1. Introduction Manually managing physical storage of database files (datafiles, redo logs and control files) can be a timeconsuming and error-prone task. Disks are inevitably filled to capacity by ever-growing datafiles, especially in modern database systems that feature hot backups, local replication and so on. Even control files – historically much smaller than other database files, and fairly constant in size – can now store many megabytes of data, and are constantly growing in size. Rather than taking a server down every time a new local disk needs to be added to a RAID array or similar, wouldn’t it be easier if the file system that the datafiles were stored on grew automatically to meet current and anticipated requirements? In addition, utilizing existing server disk space on other peers in a local area network would also reduce the overall of server infrastructure. Autonomic computing may provide a solution through the automated discovery and allocation of storage resources on a local area network. The goal of autonomic computing is to create closed control loops [1], where control decisions about resources are based on automated monitoring of usage, and prediction of future resource utilization requirements.
Proceedings of the 21st International Conference on Data Engineering (ICDE ’05) 1084-4627/05 $20.00 © 2005
IEEE
While there is a large existing literature on fault tolerance and dependable systems [2], the evolving requirement is to ensure the availability of resources ondemand, without direct operator intervention. In terms of physical database file storage, an automated system for allocating storage automatically to the underlying storage of datafiles is required. To meet this requirement, we have designed and implemented a peer-to-peer (P2P) distributed file system, that provides RAID level 0 (striping) functionality. This presents a single logical disk volume to the operating system that is infinitely extensible, and only limited by the number of peers who can contribute space. While initially conceived as an intranet-level project, the system works just as well across the Internet, as long as connectivity and bandwidth can be guaranteed. The peers are autonomous and loosely coupled, with the database server maintaining a file allocation table of all datafiles stored on the striped volume. One potential solution to satisfy the growing demand for data is to utilise the unused space within existing networked computers. To utilise this space we need systems that share and distribute the free storage space among the users within the network. As the network expands and more users and computers are added it would not be sufficient for systems to simply share and manage data; such systems need to be able to continually find and update the amount of storage available. Having networked computers capable of sharing and finding unused storage space then challenges us to develop more complex schemes that may remove the need for servers in a network. As the requirement for more storage space grows, an agent requests more storage space from a pool of available peers on the local area network. Thus, dedicated file servers can be replaced with a distributed storage system, based on a combination of current P2P technologies (such as Apple’s Rendezvous) and existing RAID algorithms. We have created a distributed storage prototype (based on the Common Internet File System CIFS) that utilises the unused free storage space on existing computers, to offer a large contiguous storage space. As such, the methods, techniques and results obtained through the creation of the prototype are discussed in this paper, as they relate to the automated physical storage of datafiles.
2. Distributed Storage Goals The broad goal of distributed storage systems is to provide infinite capacity on demand, while making use of unused storage space on dedicated file servers – or, in the case of P2P architectures, any unused space on a peer. Distributed storage can be implemented by harnessing unused local space, or through the network by sharing folders. The problem with just utilising local storage is that users are not always at the same computer, and would therefore find it difficult to access their files from remote computers. Having shared folders, on the other hand solves the problem of being able to access data over the network, but also raises concerns about the integrity, privacy and security of data. Other problems include data fragmentation, where distributed data blocks are forgotten or unevenly distributed within the network. This may overload certain workstations, making them unusable, while at the same time denying users access to their data. A distributed storage system that manages unused storage space in an efficient manner is therefore required. Ideally, the system would be decentralised and selfmanaging, and would also be able to: • Represent collective storage as one logical disk. • Distribute all data efficiently and evenly. • Automatically recover lost or missing data. • Guarantee the integrity and security of data. To date, no such system is in wide commercial use, although there are many centralised technologies very commonly used, based on Sun’s Network File System (NFS) [3] or Microsoft Common Internet File System (CIFS) [4]. Many research prototypes of various and levels of decentralization are available, as reviewed in Section 3, but these are not presently suited to meeting the integrity requirements of commercial data storage. Building a distributed storage system using P2P protocols enables us to study the suitability of existing networks, and the throughput that can be gained from existing desktop systems. Having a basic distributed storage system also allows us to further research ways of making storage faster and more reliable for specific applications, such as storing datafiles. Algorithms that handle the distribution of data, backup of data, and the recovery of data, can be studied and tested whilst adding to the functionality of the system. It is also envisaged that serverless distributed storage systems may replace dedicated file servers. Automatically configured computers could join a network, and gain access to storage as needed, when a relational database server notifies a set of known or unknown peers. P2P node management protocols like Apple’s Rendezvous already provide this functionality. We have integrated RAID levels into the system to provide a logical volume with unlimited capacity. RAID 0, commonly known as data striping or interleaving, is
Proceedings of the 21st International Conference on Data Engineering (ICDE ’05) 1084-4627/05 $20.00 © 2005
IEEE
performed on datafiles, by splitting their constituent data into equal portions, and then writing this data to multiple disks in parallel, as opposed to writing files to a single physical disk. This approach enables data to be written at much higher speeds than writing to a single disk. At the same time, striping allows for greater storage capacities by transparently combining multiple volumes to create one single and inexpensive virtual volume [5]. It is important to note that, even though RAID 0 offers greater performance and speed, the risk of disk failure is also increased - this is the reason that providing redundancy through mirroring, and sophisticated variations on both mirroring and striping in higher RAID levels.
3. Existing Distributed Storage Systems There have been numerous attempts at creating distributed storage systems in the past, that allow files to be shared across the network without the need for a server to coordinate and store the shared data. The two most successful example systems are FARSITE and OceanStore, which are outlined in this section.
3.1. FARSITE Federated, Available, and Reliable Storage for an Incompletely Trusted Environment, or FARSITE, is a serverless distributed file system, similar to this project [6]. FARSITE aims to provide secure, reliable, and fault tolerant storage for corporate and academic environments through encryption, randomized replication, and Byzantine fault-tolerance. In contrast, this project aims to produce a viable and efficient distributed storage system that works largely within a trusted network of peers. Although still being developed, FARSITE’s emphasis on encryption, randomised replication and Byzantine fault-tolerance creates latencies, and extra CPU work for client computers. All files in FARSITE are encrypted, and any duplicate copies are detected and eliminated, so there is only one single instance of any file. This enables the efficient use of storage space, while ensuring that identically encrypted files do not saturate the network, which also creates extra work for the CPU. The downside is that FARSITE requires large amounts of processing power to encrypt data for an untrusted network, whereas local networks may be trustworthy.
3.2. OceanStore The OceanStore project is aimed at creating an Internet scale distributed archival system, allowing users to store files in a secure manner, while providing high availability and durability. This is currently being developed through the ideas of replication, globally unique identifiers, updates through versioning and dynamic optimisation [7].
destination would congest the network and with multiple computers using numerous connections the network could be flooded. Therefore, it was imperative to have one component that handled all network communication to avoid crippling the network. Furthermore, incoming connections were handled by the P2P communication component, as well as being able to constantly maintain a list of all computers that are ready to share storage space. This removed the extra overhead of searching the network every time storage space was required. WindowsPC
P2P communication component
PC locating , Space Allocation , Storage and Retrieval
P2P communication component
Readi ng, Writ ing, B rows ing
WindowsPC
Readi ng, Writ ing, B rows ing
Under OceanStore, server-based fault tolerance is achieved by a degree of self-maintenance, and scalability. Files are updated by generating new versions, whereby all versions are read-only, allowing for easy repair. The routing of messages is handled by a scheme called Tapestry [8], which utilises a peer-to-peer structure, to route messages directly to the closest copy of the files. Tapestry makes it possible for OceanStore to adapt to changing network conditions and faults, while optimising message routing efficiency [9]. Pond, the OceanStore prototype, outperforms NFS by 4.6 on read operations, but underperforms NFS by 7.3 on write operations [10]. Read performance is increased through file replication, while write performance is sacrificed to ensure that data security and safe archiving. OceanStore is still in relative infancy, and more tests and significant modifications would be required before commercialisation. The key problem with OceanStore in the context of using it in a local area network, is that its write response time is slow; this is because it has been designed as an archival system with an emphasis on fault tolerance, rather real-time used-based interaction. OceanStore’s read-only versioning approach for updating files is also inappropriate, because this would produce files that consume large and unnecessary amounts of storage space.
File System component
File System component
Operating System component
Operating System component
4. Prototype Architecture This section describes the high-level architecture of out system. The architecture was split into three encapsulated components: the P2P communication component, the File System component, and an Operating System level component is shown in Figure 1, with relational database servers primarily interfacing with the Operating System level component. Each of the components handled different tasks and each component was intended to be independent, so that any change in one component would have no affect on the other. This design decision was important for any future upgrade of the architecture to be made, without restructuring the whole design. One example of an upgrade could be to implement a new data distribution algorithm, or a different RAID level. Simply changing specific parts of a component would enable the prototype to behave in a new way. The modular design of the architecture was also echoed within the design details of the individual components. Any major algorithm or function was to be assigned with its own module so that they could be interchanged without changing the behaviour of other modules. This would also allow the modules to be reusable as there basic functionality is encapsulated within the each sub component. All network communication was confined in the P2P communication component, to decrease the likely hood of creating unnecessary connections to other computers. Multiple connections that were connected to the same
Proceedings of the 21st International Conference on Data Engineering (ICDE ’05) 1084-4627/05 $20.00 © 2005
IEEE
Figure 1 Distributed Storage Architecture The File System component contained all the functions necessary to track where data was being stored, and how to retrieve it. Processing data, and getting it ready to send out to the network, was performed using this component. All information about the locations of stored files and there attributes was intended to be persistent. Thus, this component either interfaces with a database, or a file stored persistently on the local computer. The Microsoft Windows operating was chosen to be the main operating system to be integrated with the distributed storage prototype. It was chosen because of its widespread use, and its range of protocols that are already built into its existing networking capabilities. Given that the architecture is generic, any operating system could be used, even with the specific implementation of the current prototype, as CIFS and Rendezvous are widely available for different platforms.
To be able to store and retrieve files using the normal interfaces provided by the Windows operating system, a connection to the distributed storage system was implemented. The Operating System level component would handle filtering data, sending, receiving, and executing commands, as if it were apart of the operating system itself. This connection would allow the normal Windows operating system’s built in graphic user interface to use the distributed storage system, but would have to be implemented separately for each host operating system in a cross-platform environment. The reason behind this decision was to allow users to transparently use the distributed storage system without any additional knowledge, training or guidance. It also allowed existing user interfaces to work with a new distributed storage system, without any changes to its implementation. Connections between the components within the architecture were established using basic function calls. After the appropriate connections were established, pipes and data streams would be used to communicate data between the components. The components would process the data like a pipeline, with the resultant output transmitted to multiple computers.
4.1. P2P communication component The P2P communication component manages the search for free space, and the retrieval of data on the network. The current prototype uses Apple’s Rendezvous for this purpose. Peer management can be achieved using existing P2P protocols implemented in either a user application, or a service running on the Windows platform. The P2P communication component must be able to do the following: • Connect to other P2PFS nodes running the P2P communication component. • Search for other computers running the P2P communication component within a specified network i.e. locate P2PFS nodes. • Request information on the availability of space. • Satisfy requests for information on the availability of space. • Request information on the files that are stored on the P2PFS node. • Satisfy requests for information on files. • Request and reserve space on any P2PFS node. • Reserve and satisfy request for space on locally attached disk. • Store the locations of files in some sort of table that allows fast retrieval of address corresponding to files.
4.2. File System Component The file system component handles the tasks of preparing data that is to be sent, reassembling data that has been received, and keeping track of files that are
Proceedings of the 21st International Conference on Data Engineering (ICDE ’05) 1084-4627/05 $20.00 © 2005
IEEE
spread across the network. The key functionality that the different approaches should incorporate is listed as follows. • Read and write files to and from streams of data. • Break down (or stripe) files and then send them through the network to be stored in real time. • Receive and store files to the local disk in real time. • Persistently remember file locations and there attributes. • Request and reconstruct (striped) files from the network. • Communicate with the P2P component to search for storage.
4.3. Operating System component The main task of the Operating System component is to listen and respond to Windows operating system commands, and to provide an interface for user applications like database servers. These commands consist of basic operations that allow the operating system to read and write data as if it were reading and writing data from an attached local hard disk drive. The component can be designed around the CIFS (Common internet File System) protocol, an interface to Windows drivers, the Windows API, and numerous other protocols that are supported by Windows. For the Operating System component to function within the parameters that are required for the distributed storage prototype the following functionality must be a part of the Operating System component. • It must be able to create and emulate a normal Windows disk, appearing to other programs and users like a normal logical disk drive. • Commands from the operating system, such as reads and writes, must be handled by the component. It must then act accordingly and fulfil the command request as if it were a part of the host operating system. • The Operating System component must have the ability to communicate with the File System component to pass data to be processed and distribute it on the network.
4.4. CIFS Integration CIFS is the current Windows network file and printer sharing protocol that is widely used in Windows networking. CIFS, shown in Figure 2, can be utilized within a service or application to interface with Windows through the native internet TCP protocol. This allows an Operating System component to act as a server on the local machine, communicating all necessary commands and responses that are requested by the user from the operating system.
The advantage of this approach is that the CIFS protocol is widely used to share data among many operating systems other than Windows; allowing the distribute storage system to work with other operating systems without many modifications. Another advantage of this approach is that it allows an Operating System component to interface with the Windows platform without modifying the operating system code itself. This decouples the Operating System component with any operating system so long as the host operating system supports the CIFS protocol. The disadvantage of this approach is that the CIFS protocol has many versions and variations. This makes the approach difficult to implement with limited standard documentation. The other disadvantage is that the approach uses the network loop back interface to provide communication between the host operating system and the distributed storage system. In effect the communication would be analogous to having an English conversation translated into German so that it can then be translated back into English. This would be inefficient because the communication can be done directly through a simple operating system pipe. The intended behaviour of this approach is outlined below: • Reading and writing to and from the local disk is handled through the service or application. Any requests to store and retrieve data are made through the network using the CIFS protocol. • Emulating a normal Windows disk would be achieved by automatically mapping a network drive using the application or service. This mapped network drive would then be connected to the service or application which acts as a mediator for transferring files through the distributed storage system. • The processing and handling of data would be done within the application or services. This allows striping or any other algorithm such as an encryption algorithm to be implemented within the distributed storage system. • Receiving and storing files in the distributed storage system is done by directly communicating with the distributed storage applications or services running on the other machines. • Communication with the P2P component can be done by using remote procedure calls.
Proceedings of the 21st International Conference on Data Engineering (ICDE ’05) 1084-4627/05 $20.00 © 2005
IEEE
Figure 2. CIFS Overview
5. Prototype The P2P communication component is broken up into a server that handles requests from other distributed storage nodes, a sender and receiver that communicates with the server of other distributed storage nodes, and a module that discovers other distributed storage peers. These modules are necessary because the communication between the nodes must be able to handle the sending, receiving and serving of data whilst being able to detect and find nodes. Node discovery, the selection of a target node, and new space allocation, are currently implemented using Apple’s Rendezvous protocol. The File System component is broken down into three modules the File Manager, the Striping and the Destriping component. The File Manager manages and maintains a File Table that stores all the information relating to file attributes, and folder structure. The Striping and Destriping modules process file data and determines how the files are to be distributed across the network. These modules are necessary because they keep track of files as well as manage the way files are separated for distribution. The Operating system component is divided into numerous modules that handle each command called from Windows through an operating system loop back component. The command modules process most of the Windows commands and are used in combination with the File System and P2P component.
Java was the language that was chosen to be used to implement the system. This decision was made because Java has strong networking capabilities. In the design, the bulk of the work done by the distributed storage system is done via the network. Being able to create connections, transfer data and search for other distributed storage hosts was therefore important. Using Java also would allow for the distributed storage system to work on any platform. The platform independence of Java would then allow the distributed storage system to utilise the storage space and network resources of any platform that supported Java. Although the distributed storage system design is targeted at the Windows operating system, it is quite possible that future developments to the storage system could be targeted at all platforms that support Java.
6. Performance The prototype was created to test the viability and scalability of the distributed storage system with respect to datafiles. After all, if the performance of the system is not comparable to a local RAID array or hard drive, then database administrators would be justifiably wary of using the system in production. A comparison of the distributed storage prototype against several standard storage mediums was conducted. The tests were to compare the speed of sending and retrieving a 48MB datafile. The Disk Bench program was used to measure the time that it would take to copy and retrieve a file from two specified locations. The transfer rates were then calculated by dividing the size of the file transferred by the amount of time taken to transfer the file. The files were copied from the local hard disk drive. Six other storage mediums were tested and compared with the distributed storage prototype. The mediums were the local disk, portable hard disk drive, a normal network drive, a file system network drive, a secure digital memory card and a USB flash memory key. Details of the six mediums used are listed below. • Local hard disk. The Local hard disk drive used was based in a Pentium 4, 1.6GHz laptop. • USB 2.0 portable hard disk drive. This hard disk was connected to a Pentium 4, 1.6GHz laptop via a USB 2.0 port and has a storage capacity of 40GB. • Mapped network drive from a Pentium 4, 1.6GHz laptop to a Pentium 4 2.4 GHz desktop. The network drive was created using standard Windows file sharing protocols. • Mapped network drive from a Pentium 4, 1.6GHz laptop to a Sun Solaris file server used by several hundred users in our local university network. The CIFS network protocol was used to map the network drive. • SD or Secure Digital memory card. An SD card is a standard medium that digital cameras use to store photos and videos. It was connected to a Pentium 4,
Proceedings of the 21st International Conference on Data Engineering (ICDE ’05) 1084-4627/05 $20.00 © 2005
IEEE
1.6GHz laptop using an integrated SD card reader based on the laptop. • USB 1.1 flash memory key. The 128MB key was connected to a Pentium 4, 1.6GHz laptop using the standard USB interface. The six mediums and the distributed storage prototype was tested three times (A, B and C) for both the write and read tests. They were tested three times to ensure that a representative measure could be obtained. From those tests, an average was calculated and all data was plotted into a column graph. The tests were designed to show the speed of striping and destriping using software whilst sending and receiving data. It was expected that the distributed storage prototype would be comparatively slower because of this extra overhead required to send and receive data over the network. Figure 3 shows the results for the file writing comparisons for three different transfer rates and their average. While the average transfer rate for a local hard drive was 21.57 MB/s, our prototype achieved some 5.45 MB/s, with the P2P infrastructure incurring a small overhead compared to a standard CIFS server, with 6.79 MB/s.
Figure 3. File Writing Performance Figure 4 shows the results for the file reading comparisons for three different transfer rates and their average. While the average transfer rate for a local hard drive was 20.09 MB/s, our prototype achieved some 4.83 MB/s, favourably comparing to a standard CIFS server, with 4.81 MB/s. Increasing the buffer to send larger amounts of data at a time increased performance. For example, increasing the buffer to 3MB from 0.5KB produced a twofold
increase in the transfer rate. This demonstrates the influence of free parameters in the system’s configuration. Utilising the free storage space from multiple computers to form one large contiguous storage space using the prototype appears to be feasible. Using striping to split data into fragments allowed the prototype to distribute files in a uniform fashion such that free space could be utilised effectively. However, it was discovered that the key for a uniform distribution of data depends on the method that peers are selected when data is distributed. If peers are selected on the basis of how much storage space they have, then the amount of free storage space would eventually even out on all computers. If peers are selected on the basis of speed, then transfers should be running at optimal speed but the distribution of files would be skewed toward those faster computers.
Figure 4. File Reading Performance
7. Discussion Future work will focus on two issues: extending the support for RAID levels beyond striping, to ensure availability through highly distributed redundancy, and predicting future resource requirements based on past workloads. The latter issue is crucial, since a simple moving average of past storage requirements is unlikely to anticipate transient but significant demand increases in storage requirements. We are currently investigating a number of pattern recognition techniques that may be able to provide “advanced warning” of such shifts in demand, based on data collected on datafile utilisation. We are also exploring aspects of TCO, deployment and selfmanagement, as performance is a critical but secondary goal for autonomic systems. An important issue in P2P systems generally, but particularly in the current application area, is the extent to
Proceedings of the 21st International Conference on Data Engineering (ICDE ’05) 1084-4627/05 $20.00 © 2005
IEEE
which peers can be trusted. Our current work in this area is focusing on trust models for determining the extent to which peers who advertise their storage availability can be trusted, based on recommendations from other peers [11]. Again, being able to predict whether a peer will become untrustworthy after many successful interactions remains a significant problem requiring further investigation.
10. References [1] A.G. Ganek and T.A. Corbi, “The Dawning of the Autonomic Computing Era,” IBM Systems J., vol. 42, no. 1, 2003, pp. 5–18. [2] A. Avizienis, J.C. Laprie, and B. Randell, “Fundamental concepts of dependability”, Research Report N01145, Laboratory for Analysis and Architecture of System of the National Center for Scientific Research (LAAS-CNRS), 2001. [3] Sun Microsystems, “NFS: Network File System protocol specification”, RFC 1094, March 1989. [4] P. Leach and D. Perry, “CIFS: A Common Internet File System”, Microsoft Internet Developer, November 1996. [5] D. A. Patterson, G. Gibson and R.H. Katz, “A case for redundant arrays of inexpensive disks (raid)”. In Proceedings of the 1988 ACM SIGMOD International Conference on Management of Data, 1988, pp. 109-116. [6] A. Adya, W. Bolosky, M. Castro, R. Chaiken, G. Cermak, J. Douceur, J. Howell, J. Lorch, M. Theimer, and R. Wattenhofer, “Farsite: Federated, available, and reliable storage for an incompletely trusted environment”, 2002. [7] J. Kubiatowicz, D. Bindel, Y. Chen, P. Eaton, D. Geels, R. Gummadi, S. Rhea, H. Weatherspoon, W. Weimer, C. Wells and B. Zhao. “Oceanstore: An architecture for global-scale persistent storage”. In Proceedings of ACM ASPLOS. ACM, 2000. [8] K. Hildrum, J. Kubiatowicz, S. Rao, and B. Zhao. “Distributed object location in a dynamic network”. In Proceedings of the Fourteenth ACM Symposium on Parallel Algorithms and Architectures, 2002. pp. 41-52. [9] B. Zhao, L. Huang, J. Stribling, S. Rhea, A. Joseph, and J. Kubiatowicz. “Tapestry: A global-scale overlay for rapid service deployment”. IEEE Journal on Selected Areas in Communications, 2003. [10] S. Rhea, P. Eaton, D. Geels, H. Weatherspoon, B. Zhao, and J. Kubiatowicz. “Pond: The oceanstore prototype”. In Proceedings. of USENIX File and Storage Technologies (FAST), 2003. [11] H. Tran, M. Hitchens, V. Varadharajan and P.A. Watters, “A trust based access control framework for P2P file-sharing systems”, Proceedings of the Hawaii International Conference on System Sciences (HICSS38), Honolulu HI, USA, 2005.