Shiping Chen, Surya Nepal, Jonathan Chan, David Moreland, John Zic. Networking Technologies Laboratory, CSIRO ICT Centre. Cnr Vimiera & Pembroke Rds, ...
Virtual Storage Services for Dynamic Collaborations Shiping Chen, Surya Nepal, Jonathan Chan, David Moreland, John Zic Networking Technologies Laboratory, CSIRO ICT Centre Cnr Vimiera & Pembroke Rds, Marsfield, NSW 2122, Australia {Firstname.Lastname}@csiro.au collaborator as possible. That is, if a collaborator leaves the collaboration, the remaining collaborators should be able to create, store and share the data independent of who joins or leaves the collaboration.
Abstract Dynamic coalitions formed by business partners who may be competitors but yet still need to co-operate with each other require independent, secure and reliable storage for exchanging and sharing data among its members. This paper presents a virtual storage services architecture that addresses these requirements by abstracting physical storage space in order to provide a pool of logical storage space using virtualisation technique. The abstraction of a variety of physical storage technologies and their capabilities is achieved by designing an XML-based simple distributed storage interface (SDSI). This paper also presents a case study of a secure distributed storage service to demonstrate how our architecture offers secure data storing and sharing for dynamic coalitions. We present a summary of our implementation, a Web Services-based prototype storage system using SDSI, and evaluated its performance against a number of alternative distributed storage technologies and data transport protocols.
•
Secure, private and reliable: The information kept within a specific collaboration is valuable and confidential for its members. Typically, this information is bound by the project’s and particular collaboration’s access policies. For example, one particular collaboration between three parties A, B and C will allow only the sharing of information between the three if all three are present. If one leaves, that information is for that collaboration may no longer be available by any of the members in the group. Another collaboration’s policy may specify that the shared information is available to only those parties that are still within the collaboration and unavailable to the one that has left. A third policy may need to be put in place for a collaboration involving three partners, two of which are competitors [1], meaning that no one wishes to host the collaborative information, nor allow access to their own storage infrastructure. Thus the corresponding information must be stored, protected, and shared in a known, predictable, secure and reliable manner.
•
Minimal management storage: In most cases, individual collaborating participants may have little knowledge, support or skills to manage complex storage infrastructures. Therefore, the storage resource for collaborations should be easy to use and require minimum or zero management from individual collaborators. This enables individual collaborators to focus on the work at hand in the collaboration rather than managing underlying infrastructures.
1. Introduction Large scale projects typically involve multiple parties from different organizations that (by definition) exist for a finite timespan and aim to achieve a set of common goals [1], such as producing a movie, constructing an airport and/or conducting a joint research project. Interoperation and collaboration between parties is paramount in order for the projects to succeed. Any, information and communication technologies (ICT) should fundamentally support the required infrastructures to facilitate and enable effective collaborations. Naturally, these infrastructures in turn support the exchange and sharing of information specific to that project and its collaborators, in a controlled, defined and agreed upon manner. A data storage system becomes an essential and important component required by the collaborations. In particular, the nature of dynamic collaborations has special requirements for data storage as highlighted below: •
Independent and transparent: One of the features of a dynamic collaboration is that its members may join and leave at any time during the lifetime of the project, with some participating from beginning to end, while others for only part of the project (and hence, part of the collaboration period), and others may join and leave multiple times during the project, on a needs basis. This implies that the storage resources for the collaboration should be as independent and transparent from individual
Storage Service Providers (SSP) are emerging to provide online data storage to meet these requirements for information content sharing, such as BitTorrent [2] and Amazon S3 [3]. While both are successful business models, an individual SSP may be less likely to meet the requirements of emerging applications in dynamic collaborations due to factors such as limited capacity, poor performance for a specific application requirement, inappropriate interfaces, poor data management or the inability to meet security and confidentiality requirements. Even though clients can reduce the degree of risk by encrypting their data before storing on the SSP, key management remains a problem due to the dynamic nature of coalitions [4]. We therefore propose a secure storage solution for dynamic collaborations and their applications that utilises untrusted, existing SSPs.
Our solution is inspired by the Virtual Network Operator (VNO) based business model [5], and aims at addressing the above issues offering a Virtual Storage Operator (VSO) architecture as a new data storage business model for dynamic collaborations. In this architecture, individual SSPs own and manage the different physical data storage infrastructures that provide a simple storage services such as read/write raw data blocks. A VSO uses SSPs (as it does not own or manage a storage infrastructures) in constructing a new set of value added services by using multiple SSPs. Towards this, we develop an XML-based Simple Distributed Storage Interface (SDSI) as an abstract layer between a VSO and any SSPs. This enables VSOs to deliver a variety of value added services based on SSPs via SDSI. A case study of VSO is presented to demonstrate how our architecture offers secure data storing and sharing for dynamic coalitions. We have prototyped this architecture using Web Services technologies and conducted performance tests against a number of alternative data storage system and data transport protocols.
as naming, discovery, subscription, publication and registration) is done in a common service registry that spans the top three layers. •
Storage Infrastructure Layer represents an assembly of heterogeneous storage systems offered by different vendors. Each of these storage systems may have different configurations in terms of devices and protocols to offer different quality of storage services. This layer in our architecture supports the co-existence of heterogeneous storage systems and provides a unified framework for the easy and efficient use of storage facilities.
•
Storage Service Provider (SSP) Layer is introduced to hide the complex and heterogeneous nature of the underlying storage infrastructures, by providing a virtual global storage infrastructure. The Simple Distributed Storage Interface (SDSI) is used by each SSP to specify the contribution of individual storage infrastructure and establish a virtual storage pool. This leverages the underlying Storage Infrastructure Layer and provides storage services at a higher level of abstraction, effectively allowing clients, e.g. a VSO, to read and write blocks of data to the storage infrastructure. Moreover, the SDSI provides a logical interconnection between different storage systems.
•
Virtual Storage Operator (VSO) layer provides an opportunity for business entrepreneurs building a variety of new storage services, by using a combination of services offered by one or more SSPs. These new services provide value-added services to meet different storage requirements that are not directly supported by individual SSP.
•
Application layer (as usual) allows a variety of applications from different domains to transparently use the underlying storage infrastructure for their specific purposes via VSOs. Note that the linkage between application layer and VSO layer in Figure 1 just shows their logical (business) relationship, meaning that it is unnecessary for clients to store/retrieve data from SSP physically via VSO at runtime. For example, it is likely that a client buys a specific storage service from a VSO by downloading its driver and put/get data via the driver that is responsible for resource allocation and data distribution by directly connecting multiple SSPs.
•
Services Registry (SR) is a central, support component in this architecture providing functionalities such as service publishing, naming, searching, and addressing, using a standard interface across multiple layers. This component has the functions based on those defined in the Web Services standard UDDI.
•
Common Services are a set of supporting services that may be required by a VSO to deliver a specific storage service, such as key management service, authentication service, or certificate issue services.
2. Our Vision of Storage Services We envisage that as network bandwidth becomes faster and cheaper, many SSPs will emerge to provide basic storage services. These services may be backed with a variety of dedicated storage hardware (RAID, SAN, NAS etc.). However, these heterogeneous infrastructures will be hidden from client applications. Since these storage infrastructures are operated by individual SSPs who compete with each other, they are reluctant (and thus unlikely) to directly connect to each other to meet a customer’s requirements. Instead, it is envisaged that a set of third-party services will emerge to provide high level services, such as indexing, versioning, distribution, directory and authentication services. We also foresee that a lot more virtual storage operators (VSO) will emerge to use these SSPs to deliver new storage services that an individual SSP is unable to provide. For instance, A VSO can offer highly reliable data-replication services to back-up client data onto multiple SSPs at different locations to reduce the risk of data loss or offline due to natural disasters in a specific area. And another VSO may offer a secure storage service by fragmenting client data, and distributing the encrypted fragments on different SSPs. Such a VSO may deliver a variety of storage services required by dynamic collaborations based on a composition of storage services from one or more SSPs.
3. VSO-based Storage Services 3.1 Architecture Figure 1 illustrates our VSO-based storage services architecture, which consists of four layers: the Application Layer, the Virtual Storage Operator (VSO) Layer, the Storage Service Provider (SSP) Layer and the Storage Infrastructure Layer. Coordinating the access to the variety of services (such
Figure 1. VSO-based storage services architecture Our architecture may be distinguished from other distributed storage architectures [6, 7, 8, 9] in that our SDSI of the SSP allows the creation of new storage services to run as a business model by offering a variety of (possibly competing) storage services using the same underlying infrastructures.
Inputs: - authInfo: authorization information for a specific client, such as user name with password. - accountToken: an unique identity of a user account. - dataKey: a unique identity of the data, i.e. blodkID. - dataStream: a byte stream that a user wants to store.
3.2 SDSI API • SDSI defines the component structures and basic data types of the APIs. SDSI should expose sufficient information to the client interface (whether the client is a VSO, another SSP or even an application) so that each client can negotiate and purchase storage services with specific requirements, store and access data, perform enquiries and obtain status information related to their account. In order to cover all these aspects of interaction between client and SSP, the SDSI API provides functional and non-functional components. 3.2.1 Functional: The minimal set of requirements for a client is its ability to store data into, and retrieve data from, the abstracted storage infrastructure defined as follows: •
put_data: This API is used to store data into the storage infrastructure of a SSP. The syntax for put_data is shown below:
get_data: The get_data API is used to retrieve data from the storage infrastructure. The syntax for get_data is shown below.
Inputs: - authInfo: this is an authorization information such as user name and password. - accountToken: an unique identity of a user account. - dataKey: a unique identity of read data, blockID Returns: - dataStream: a byte stream that a user wants to read. 3.2.2 Non-Functional (Optional): A SSP should also support administrative APIs that are not directly related to operations on data (and thus referred to as non-functional APIs). Our SDSI supports two types of non-functional APIs: Service–Level agreement (SLA) and Account Management. The details for the APIs and corresponding protocols [11] are omitted due to limited space.
Hardware
4. Case Study: Secure Data Storing/Sharing
4xSSP Server
This section discusses the VSO-based storage architecture and SDSI with a case study of secure data storing and sharing in dynamic collaborations. Transferring and storing data “as is” to untrusted SSPs is risky in term of data integrity and privacy. However, if the data is firstly fragmented and then encrypted before it is stored on untrusted SSPs, this risk is reduced. . The basis for this secure storage service is illustrated in Figure 2 and described as follows: 1) Information content that needs to be stored is fragmented and then encrypted with a VSO-certified driver using a key generated for a specific collaboration that is valid for a specific period of time. 2) The encrypted fragments are distributed to multiple SSPs via SDSI so that no SSP has the complete data required to reconstitute the information. 3) A new pointer object containing necessary information required for accessing the stored data is generated when all of the fragments are written to the SSPs. The pointer object is also encrypted with a different key to provide another level protection and data-sharing mechanism with the collaboration. The data owner can send the encrypted pointer object to the collaborators to share the data. 4) The collaborators can use the pointer object to retrieve the data contents via certified VSO driver.
Figure 2. Using VSO for secured data storing/sharing
5. Prototyping and Benchmarking A prototype SDSI has been implemented using Microsoft .NET 1.1 as SSP web service. Microsoft SQL Server 2000 is used as the back-end storage system to host and manage data blocks. Four pairs of the SDSI web service and database are deployed onto four identical Dell machines for performance evaluation. The test driver was also prototyped using Microsoft .NET as clients of SSP and run on a separate Dell machine to read and write test data to the SSP storage servers. The detailed specifications of the testing environment are listed in Table 1.
Software
1xSSP Client (VSO)
Hardware
Software
Dell 1 x 3.0GHz Hyper-Thread P4 CPU, 1.0GB ROM, 1Gbps dedicated LAN Microsoft Window 2003 SP1 IIS, ASP.NET 1.1, SQL Server 2000 SP3 Dell 4 x 1.6GHz Xeon CPU, 3.5GB ROM, 1Gbps dedicated LAN Microsoft Window 2003 SP1 .NET 1.1
Table 1. Specifications of the testing environment Although multiple SSP Servers may be involved in data reading and writing, we intentionally tested the system performance using a single SSP Server to obtain the baseline value for an individual SSP for use in the scalability study. We used variable sized data (100KB, 1MB, 10MB, 100MB) with a single SSP via SDSI and measured the total time taken to complete the operations as the performance metrics defined in equation (1) and (2). The test results for writing are shown in Figure 3 (a) and (b). To evaluate the scalability of SDSI, we re-ran the above tests against 2 and then 4 SSPs. The results are summarized in Figure 3 (c) and (d). The following observations can be made based on these test results: •
These three data points indicate that the SDSI-based storage service should scale and have reasonable performance improvements as the number of SSPs increase.
•
Linear increase in performance is achieved by using dynamic and larger block sizes with more storage servers, which shows that multiple-servers-based distributed storage is able to provide security, availability and reliability, and reasonably high performance .
•
The performance benefits in using dynamic block sizes indicates the importance of data block sizes and dynamic scheduling, which could lead to interesting research.
In addition to the above tests, we conducted a set of tests with fixed data sizes using the same machines against the following alternative distributed storage system and transport protocols: Pastry v1.4 [16], FTP/FTPS, and HTTP/HTTPS. For FTP/FTPS testing, we installed a commercial FTP server on one server machine. We used the FTP client to ‘put’ and ‘get’ a large amount (3 GB) of data via plain FTP protocol and FTPS protocol. The performance metrics are calculated and normalized as follows: WallTime (1) Latency = DataAmount 1 Thoughput= Latency
(2)
where WallTime presents the overall time for writing a certain amount of data into the distributed storage.
2 Throuhput in MB/s
Latency in sec
80.00 60.00 40.00 20.00 0.00 100K
1.5 1 0.5 0
1M 10M 100M Data Size in Bytes
100K
(a) Latency for writing to single SSP
(b) Throughput for writing to single SSP 8
8
Fixed Block Size Dynamic Block Size
Fixed Block Size 6
6
Dynam ic Block Size
Speedup
Speedup
1M 10M 100M Data Size in Bytes
4
4 2
2
0
0 1
2 Num ber of SSPs
1
4
(c) Speedup for writing
2 Num ber of SSPs
4
(d) Speedup for reading
Figure 3. SDSI performance and scalability
The latencies and throughputs of different storage systems and transport protocols are compared in Figure 4 (a) and (b), respectively. As we can see, SDSI is able of achieving the similar performances (latencies < 1 sec) as FTP/FTPS for both put and get operations with throughputs of 1.24~7.36MB for the put operation and 1.04~9.43 for get. The performance is better than HTTP/HTTPS and Pastry, even our SDSI implementation using web services SOAP over HTTP and Pastry using the faster server machine as client for the ‘get’ test. We infer the possible reasons for the above results as follows:
Latency (Sec/MB)
100
SDSI min SDSI max
10
FTP 1
FTPS Write/Put
Read/Get
0.1
HTTP HTTPS
0.01
Operation
Pastry
(a) Latency for writing 100 Throughput (MB/s)
The performance results for all tests are compared in Figure 4. As we can see, SDSI is able of achieving good performance (latency < 1 sec / MB and throughput > 6MB/s) with FTP and FTPS, against HTTP, HTTPS and Pastry. Pastry 1.4 was deployed on 3 server machines for its performance tests. To measure the performance for writing operation, we started Pastry on one server machine as one peer. Then we started Pastry on another server machine as another peer, put one 1MB file to the group and measured the performance. After getting the key of that file, we ran another Pastry peer on the 3rd server machine to join the group, got the data file with the key and measured the latency for reading operation. Since all three peers used the server machines, the hardware configuration of this test is a little different from and better than the other tests. Because the performance obtained from this set of tests are slower than other tests, the relative performance orders and their comparisons remain valid.
SDSI max SDSI min
10
FTP 1
FTPS Write/Put
Read/Get
0.1
HTTP HTTPS
0.01
Operation
Pastry
(b) Throughput for writing Figure 4. Comparisons with alternatives Overall, SDSI is a compelling distributed storage interface compared to these alternative storage systems and file transport protocols. Higher performance is expected as: (a) more SSPs are involved; (b) the network and servers become faster; (c) binding SDSI to faster network protocols and dedicated network tunnels [1].
•
The Web Services transport protocol SOAP can more efficiently use HTTP/HTTPS than IIS;
6. Related Work
•
SDSI can scale its performance by using multiple storage servers, i.e. performance for SDSI max;
•
SDSI is simpler and supports dynamic block sizes allowing a tradeoff between concurrency and its cost comparing to Pastry.
Distributed storage has attracted increasing attention and efforts from both research communities and industries over the passed a few years, which led to a large body of previous work related to this paper. We review these works in two aspects: architecture and interface.
FARSITE [6] is Microsoft’s initiative to use a large amount of desktop PCs as data storage to replace the expensive storage devices. As a result, they are more likely to be used within an organization for non-mission-critical applications. OceanStore [8] and PAST [9] are both a largescale distributed storage using P2P technologies [10] to automatically maintain data consistency. But they may be not suitable for the dynamic coalitions that require highly secure and performance storage for large data sets sharing. SRB [11] is a widely deployed Data-Grid middleware by the Grid Computing community in an attempt to provide a uniform interface to heterogeneous data storage resources and other distributed data management functionalities. While SRB strives to provide a single-box-solution to online storage, we try to isolate each component/service in the virtual storage service architecture using SDSI and SOA. Our SDSI work is similar to iSCSI [13], FCIP [14] and iFCP [15] in terms of motivation and functionality, i.e. exposing and connecting storage devices over a WAN. But our work is different from these protocols in the following aspects: •
•
•
Different design principles: iSCSI, FCIP and iFCP were developed by extending the existing protocols, which therefore inherent the heterogeneousness and complexities, and fixed functionalities from their precedents. SDSI is a completely new, simple and uniform interface, yet extendable for new storage and business functionalities. Working at higher transport layers: iSCSI, FCIP and iFCP all tightly bind to TCP/IP, but SDSI is transportindependent and thus can work at higher transport layers, such as SOAP/HTTP. Requirement for hardware: Although iSCSI, FCIP and iFCP may transfer data faster than SDSI, they all require specific hardware, such as iFCP/FCIP gateways and switches, at both ends to support the corresponding commands/protocols. SDSI has no specific requirements for hardware/software, but provides similar functionality to the standard services.
There are two distributed infrastructures [7, 17] similar to our VSO architecture, but they are different in terms of scope, purpose and business model. While [7] is Google’s internal distributed file system to store and manage almost-read-only data, [17] is a context service provider to globally deliver high performance web contents hosting service. However, some design decisions and operational model of [7] and [17] can be referred when building an instance of VSO for a specific purpose.
7. Conclusion and Future Work In this paper, we presented the requirements for data storage and envisaged the future storage services to facilitate dynamic coalitions. We then introduced our virtual storage
architecture and a simple distributed storage interface (SDSI). SDSI plays an intermediate role for forming Internet-scale huge storage capacities and offering more opportunities to deliver new data storing/sharing services for a variety of dynamic collaborations. A secure data storing and sharing example is presented as a case study of new virtual storage service using SDSI. A Web Services based SDSI is prototyped and tested for concept proof and performance evaluation. The preliminary benchmarking results show reasonable performances and compelling scalability comparing to the alternative distributed system and data transport protocols. Currently we are working on incorporating the virtual storage prototype and other technologies into a secure dynamic collaboration framework prototype.
8. References [1]
[2] [3] [4] [5]
[6]
[7]
[8]
[9]
[10]
[11]
[12] [13] [14] [15] [16] [17]
J. Chan, G. Rogers, D. Agahari, D. Moreland, J. Zic. Enterprise collaborative contexts and their Provisioning for secure managed extranets. 15th IEEE International Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprises (WETICE'06) pp. 313-318 Bit Torrent. http://www.bittorrent.com/ Amazon S3 www.amazon.com H. Yu, G. Wang, G. Zhang, X. Wang: The Second-Preimage Attack on MD4. CANS 2005: 1-12 E. Jopling and D. Neill, VNO Phenomenon Could Shake Up the World's Telecom Market, Gartner Research report G00131283, November 2005 A. Adya et al. FARSITE: Federated, available and reliable storage for an incompletely trusted environment. In Proc. of OSDI, Dec. 2002 http://research.microsoft.com/~adya/pubs/osdi2002.pdf Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung, The Google File System, In Proc. of the 19th ACM SOSP, Bolton Landing, NY, USA, 2003 http://labs.google.com/papers/gfs.html J. Kubiatowicz et al. OceanStore: An Architecture for GlobalScale Presistent Storage. In Proc. of ASPLOS, Nov. 2000 http://oceanstore.cs.berkeley.edu/publications A. Rowstron and P. Druschel. Storage management and caching in PAST – A large-scale persistent peer-to-peer storage utility. In Proc. of ACM SOSP, 2001 http://research.microsoft.com/~antr/PAST/past-sosp.pdf B. Y. Zhao, J. Kubiatowice, A. D. Joseph. Tapestry: A faulttolerant wide-are application infrastructure. Computer Communication Review 32(1): 81, 2002 A. Rajasekar, M. Wan and R. Moore. MySRB & SRB components of a data grid. In Proc. of HPDC-11, July 24-26, 2002, www.sdsc.edu/dice/pubs/hpdc11-mysrb.pdf S. Chen, et al. SDSI: Simple Distributed Storage Interface. CSIRO ICT Centre Technical Report. No. 06/322, 2006 RFC3720 Internet Small Computer Systems Interface (iSCSI), April 2004 http://www.ietf.org/rfc/rfc3720.txt RFC3821: Fibre Channel over TCP/IP(FCIP) http://www.ietf.org/rfc/rfc3821.txt?number=3821 RFC 4172: iFCP - A Protocol for Internet Fibre Channel Storage Networking http://www.ietf.org/rfc/rfc4172.txt?number=4172 Pastry v1.4 http://research.microsoft.com/~antr/Pastry/ Akamia Technologies, Fast Internet Content Delivery with FreeFlow, Technical Report, April 2000 http://www.cs.washington.edu/homes/ratul/akamai/freeflow.pdf