Object Storage: Scalable Bandwidth for HPC Clusters - Linux Clusters ...

Object Storage: Scalable Bandwidth for HPC Clusters Garth A. Gibson, Brent B. Welch, David F. Nagle, Bruce C. Moxon { ggibson, bwelch, dnagle, bmoxon }@panasas.com Panasas Inc., 6520 Kaiser Dr., Fremont, CA 94555

This paper describes the Object Storage Architecture solution for cost-effective, high bandwidth storage in High Performance Computing (HPC) environments. An HPC environment requires a storage system to scale to very large sizes and performance without sacrificing cost-effectiveness nor ease of sharing and managing data. Traditional storage solutions, including disk-per-node, Storage-Area Networks (SAN), and Network-Attached Storage (NAS) implementations, fail to find a balance between performance, ease of use, and cost as the storage system scales up. In contrast, building storage systems as specialized storage clusters using commodity-off-the-shelf (COTS) components promise excellent price-performance at scale provided that binding them into a single system image and linking them to HPC compute clusters can be done without introducing bottlenecks or management complexities. While a file interface (typified by NAS systems) at each storage cluster component is too high-level to provide scalable bandwidth and simple management across large numbers of components, and a block interface (typified by SAN systems) is too low-level to avoid synchronization bottlenecks in a shared storage cluster, an object interface (typified by the inode layer of traditional file system implementations) is at the intermediate level needed for independent, highly parallel operation at each storage cluster component under centralized, but infrequently applied, control. The Object Storage Device (OSD) interface achieves this independence by storing an unordered collection of named variable-length byte arrays, called objects, and embedding extendable attributes, fine-grain capability-based access control, and encapsulated data layout and allocation into each object. With this higher-level interface, object storage clusters are capable of highly parallel data transfers between storage and compute cluster node under the infrequently applied control of the out-of-band metadata managers. Object Storage Architectures support single-system-image file systems with the traditional sharing and management features of NAS systems and the resource consolidation and scalable performance of SAN systems.

1

The HPC Storage Bandwidth Problem

A structural transformation is taking place within the HPC environment. Traditional low-volume, proprietary systems are being replaced with clusters of computers made from commodity, off-the-shelf (COTS) components and free operating systems such as Linux. These compute clusters deliver new levels of application performance and allow cost effective scaling to 10s and, soon, 100s of Tflops. The large datasets and main memory checkpoints of such science-oriented cluster computations also demand recordbreaking data throughput from the storage system. One rule of thumb is that 1 GB/sec of storage bandwidth is needed per Tflop in the computing cluster [SGSRFP01]. Complicating matters for the HPC community is the fact that storage bandwidth issues are given low priority by mainstream storage vendors because it is expensive and difficult to provide high bandwidth using traditional storage architectures and there is a limited market for systems that scale to HPC levels. As an example of this challenge, BP’s seismic analysis supercomputing, which cost as much as $80 million per Tflop in 1997, today costs about $2 million per Tflop. The 170 TB of storage on this 7.5 Tflop IntelLinux cluster today costs about $15 thousand per TB. BP hopes to cut the cost of each Tflop and each TB by as much as 50% by the end of 2003 [Knott03]. Combining BP’s example with the bandwidth rule of thumb and adjusting to round numbers gives us a simple model for science-oriented cluster computing requirements for early in 2004: •

per Tflop, a cluster roughly needs 10 TB storage sustaining 1 GB/sec and costing $100,000

2

G. Gibson, B. Welch, D. Nagle, and B. Moxon

Capital equipment costs are only a part of the total cost of ownership. The cost of operating, or managing a storage system often adds up to more than the capital costs over the lifetime of the system. Storage management tasks include installing and configuring new hardware, allocating space to function or users, moving collections of files between subsystems to load and capacity balance, taking backups, replacing failed equipment and reconstructing or restoring lost data, creating new users, and resolving performance problems or capacity requests from users. The cost of storage management tasks are driven by the loaded labor rates of experienced Linux/Unix cluster, network, server and storage administrators and are typically calculated as cluster nodes per administrator or terabytes per administrator. HPC clusters, in contrast to monolithic supercomputers, have many more subsystems to be managed. COTS clusters, which are typically built from comparatively small computers and storage subsystems, are likely to have the highest number of subsystems per Tflop provided. Additionally, computational algorithms for clusters usually decompose a workload into thousands or millions of tasks, each of which is executed (mostly) independently. This algorithmic strategy often requires the decomposition of stored data into partitions and replicas, whose placement and balancing in a cluster can be a time consuming set of tasks for cluster operators and users, especially in large cluster and grid computing environments shared amongst a number of projects or organizations, and in environments where core datasets change regularly. With these extra storage management difficulties compounding the cutting edge demands of HPC cluster scalability and bandwidth, HPC cluster designers need to carefully consider the storage architecture they employ. In this paper we review common hardware and software architectures and contrast these to the new Object Storage architecture for use in HPC cluster storage. Qualitatively, we seek storage architectures that: •

Scale to PBs of stored data and 100s of GB/sec of storage bandwidth.

•

Leverage COTS hardware, including both networking and disk drives, for cost-effectiveness.

•

Unify management of storage subsystems to simplify and lower operational costs.

•

Share stored files with non-cluster nodes to simplify application development and experiment pre- and post-processing.

•

Grow capacity and performance incrementally and independently to cost-effectively customize to a cluster application’s unique balance between size and bandwidth.

2

Storage Architectures for Cluster Computing

The fundamental tradeoffs in storage architectures are tied to two basic issues: 1) the semantics of the storage interface (i.e., blocks versus files), and 2) the flow of metadata and data (i.e., control and data traffic flow) between storage and applications. The interface is important because it defines the granularity of access, locking and synchronization and the security for access to shared data. Traffic flow fundamentally defines the parallelism available for bandwidth. The architectural flexibility and implementation costs of these two basic storage properties ultimately determine the performance and scalability of any storage architecture. Consider the common block-based disk interface, commonly referred to as Direct Attach Storage (DAS) and Storage Area Network (SAN). DAS and SAN have been historically been managed by separate file system or database software located on a single host. Performance is good at the small scale, but bottlenecks appear as these systems scale. Moreover, fine-grained data sharing on different hosts is difficult with DAS, requiring data copies that significantly reduce performance. Therefore, most file systems are not distributed over multiple hosts, except to use a single secondary host for failover, in which the secondary hosts takes over control of shared disks when the primary host fails. The high-level network file service interfaces, including NFS and CIFS protocols and broadly known as Network Attached Storage (NAS), overcomes many of the block-level interface limitations. Presenting a


3

file/directory interface, NAS servers can dynamically and efficiently mediate requests among multiple users, avoiding the sharing problems of DAS/SAN. The high-level file interface also provides secure access to storage and enables low-level performance optimizations, including file-based pre-fetching and caching of data.

Fig. 1. Traditional Scalable Bandwidth Cluster vs. Out-of- Band Scalable Bandwidth

However, the traditional NAS in-band data flow forces all data to be copied through the server, resulting in a performance bottleneck at the server’s processor, memory subsystem, and network interface. To overcome this limitation, recent storage architectures have decoupled data and metadata access using an out-of-band architecture where client’s fetch metadata from servers, directly accessing data from storage and avoiding the server bottleneck. Unfortunately, the client’s out-of-band data accesses must utilize the DAS/SAN block-based interface, working behind the NAS file-level interface and eliminating many of the security and performance benefits provided by NAS. The new object-based storage architecture combines the performance benefits of DAS with the manageability and data sharing features of NAS. The object-based storage interface is richer than the block-based interface, hiding details like sectors by providing access to objects, a named range of bytes on a storage device that are cryptographically signed to enable secure sharing among untrusted clients. Moreover, the object-based storage architecture efficiently supports out-of-band data transfer, enabling high bandwidth data flow between clients and storage while preserving many of the performance and security enhancements available for NAS systems. In the following sections, we discuss the tradeoffs between different block-based, NAS-based and objectbased systems with in-band and out-of-band data movement. 2.1

Scaling at the Disk Abstraction

Direct Attached Storage (DAS) and its evolution as Storage Area Network (SAN) storage are the dominant storage architectures in use today. 2.1.1

Disk Per Node

Because the commodity PC components used in the nodes of most HPC clusters usually come with a local disk, some cluster designers have chosen to use these disks as the cluster’s primary storage system [Fineberg99]. In some ways this is a superb HPC solution because today each disk in a node provides 10 to 20 MB/sec of storage bandwidth and 80 to 200 GB of storage capacity for $1 to $2 per GB. Given that compute nodes today provide 2 to 10 Gflops, depending on the number and type of CPUs, a one-disk per node storage solution offers 0.2 to 0.5 GB/sec bandwidth per Tflop and 2 to 4 TB capacity per Tflop. With two to five disks per node, this approach to building cluster storage systems meets our 2004 storage system target bandwidth and capacity, and has capital costs that are much less than our 2004 target cost.

4


Fig 2. HPC Cluster with Disk per Node Storage

The downside of this approach to cost-effective scalable storage bandwidth is the implications for programmer complexity, reliability and manageability. Programmer complexity arises because the data path between a disk and a processor is only fast and dedicated when both are on the same node. With a simple disk-per-node storage architecture there is no path at all when the data needed at a processor is on a disk on a different node. For these non-local accesses it is the application programmer’s job to move the data as an integral part of an application. Algorithms have to be tailored to this restriction, often specializing the uses of the cluster to only a very few applications that have been appropriately tailored, such as out-of-core database sort machines [Fineberg99]. Storage tailored applications unable to adapt computation on each node to where the data is dynamically found must first move the data. For instance, if input to a 32 node run is determined by output from a prior run on 16 different nodes, special transforming copy programs may be needed to transform the 16-node output files (on nodes 10-25, for example) to the 32-node input files (on nodes 4-35). This extra transforming work effectively reduces compute performance, storage bandwidth, and it costs scarce human development time. With data stored across the nodes of the computing cluster, compute node failure is also the loss of gigabytes of stored data. If applications must also be written to create, find and restore replicas of data on other nodes, development time may be significantly increased, and data dependability weakened. If a system service implements mirroring across nodes, then useful capacity and write bandwidth are at least halved. Additionally, disks inside a compute node add to the reasons that a node fails. Five disks significantly increase node failure rates, possibly causing users to take more frequent checkpoints, which also lowers storage bandwidth. Adding RAID across the disks inside a node can reduce the frequency that disk failure causes immediate failure of the node, but it lowers per node capacity, lowers per node bandwidth, and adds per node cost for RAID controllers. And because the rate of individual disk failure is not changed by RAID, administrators still have to get inside cluster nodes and replace failed disks, a potentially disruptive and error prone management activity. Administrators are also saddled with the management of a relatively small file system on every node. Balancing the available capacity without impacting programmer decisions about where data is stored is hard. And because small file systems fill quickly, administrators are likely to be called far more often to inspect and manipulate 1000s of small file systems. Moreover, the compute capabilities and storage capabilities of the cluster are not independently changeable. Based on characteristics of the important applications, and the sizes of cost-effective disks and CPUs, there may be a small number of reasonable ratios of disk to CPU, leading to over-designed systems, another way to pay more for what you need. For large clusters, this co-dependency leads to severe capacity increment scenarios. For example, to upgrade the overall storage capacity of a 1000 node cluster using 36 GB local disks, the smallest incremental capacity increment may be 36 TB (upgrading a 36 GB drive to a 72 GB drive on each of 1000 nodes). In addition, there can be very extensive periods of downtime for the cluster, measured in weeks, to accomplish an upgrade of the disk in every node of a large cluster.


5

Finally, if all data is stored on the nodes of the cluster, then pre-processing or post-processing of experimental data, or application development, reaches into the nodes of the cluster even if the cluster is currently allocated to some other application. This effectively turns the cluster into a massively parallel file system server that also timeshares with HPC applications. Since file servers enforce access control to protect stored data from accidental or malicious unauthorized tampering, massively parallel file systems sharing COTS nodes with all applications suffer from the file system analog of pre-multi-tasking operating systems – just as executing all applications and the operating system in a single address space exposes the stability of all applications and the node itself to bugs in or attacks on any one application, executing the access control enforcement code on the same machines as the applications whose access is being controlled exposes the entire storage system to damage if any node has bugs in its file system code or is breached by an attack. Since COTS clusters exploit rapidly evolving code, often from multi-organizational open source collaborations, bad interactions between file systems and operating system code, imperfect ports of system-level code, and trapdoors in imported code put the integrity of all stored data at risk. Inevitably, these major inconveniences drive HPC cluster administrators to maintain the permanent copy of data off the cluster, in dedicated and restricted function servers, devoting the disks on each compute node to replicas and temporaries. This means applications must perform data staging and destaging – where application datasets are loaded from a shared server prior to job execution, and results unloaded back to a shared server when done. Staging and destaging adds to execution time, lowering effective storage bandwidth and compute performance again. In some environments, particularly multi-user or multi-project environments, staging and de-staging can waste as much as 25% of the available time on the cluster. These issues, and the desire for a more manageable cluster computing environment, have driven many facilities to look at shared storage cluster computing models – where a significant portion of, if not all, application data is maintained and dynamically accessed directly. 2.1.2

SAN attached Disk per Node

Commercially, most high end storage systems are Storage Array Network (SAN) storage, big disk arrays offering a simple disk block service interface (SCSI) and interconnected with other arrays and multiple hosts systems for fault tolerance, so that after a host fails another host can see the failed host’s storage and take over the management of its data. SAN networking is usually FibreChannel, a packetized SCSI network protocol built on the same physical wires and transceivers as is Gigabit Ethernet and transmitting at 128 or 256 MB/s. While the first generation of FibreChannel only expanded on parallel SCSI’s 16 addresses with 126 addresses in an arbitrated loop, later versions, called Fabrics, can be switched in much larger domains.

Fig. 3. SAN Attached Disk per Node Cluster

SAN storage offers a technological basis for consolidating the disks of a disk per node solution outside of the nodes on a shared network. Using disk array controllers providing RAID and virtualized disk mappings

6


so that a set of N physical disks looks like a set of M logical disks, where M is the right number for the disk per node solution, allows SAN storage to overcome the reliability, availability and incremental growth problems of the disk per node solutions. Unfortunately SAN NICs, called host bus adapters, SAN switches and SAN RAID controllers and subsystems are far less cost-effective than PC disks. Because the high end commercial market has relatively small volume and relatively small numbers of relatively big nodes, FibreChannel SAN equipment is expensive: factors of four higher capital costs for individual components are not unusual and the external consolidated storage approach adds another switched network in addition to the cluster interconnection network, with additional NICs and switches, which the disk per node approach did not have. The cost of distributing a FibreChannel SAN storage system over all nodes of a COTS cluster is generally prohibitive. The recent IETF definition of iSCSI [Satran2003], a mapping of SCSI, the transport and command protocols used in FibreChannel SANs, onto the IETF’s TCP/IP transport and network protocols where it can be run on commodity Gigabit Ethernet, may renew interest in SAN storage systems. The cost of an iSCSI SAN may be much less than a FibreChannel SAN once iSCSI is deployed and in volume. Traditionally SAN storage executes any command arriving from any accessible node, because in the past the only nodes that could reach storage were the trusted primary and backup server hosts attached via a daisy-chained SCSI ribbon cable. With consolidated SAN storage, it becomes convenient to pool the storage of many computing systems on one SAN network where tape archive systems and SAN storage management servers are also available, exposing cluster storage to errors, bugs and attacks on non-cluster systems and vice versa. For example, in a past release of a server operating system, new manageability code tried to help administrators by formatting any disk storage it could see, even if these disks were owned and in use by another host! FibreChannel has provided a first step improvement to total lack of access control in SAN storage. Using the time honored memory protection key scheme, where a requestor presents a key (its host ID for example) with every request and the storage validates that this key is in the access list for a specific virtual disk, FibreChannel can detect accidental misaddressing of a request by an incorrect node. However, because the unit of access control, which is the granularity of accident detection, is large – an entire virtual disk – and because any node that might ever need to touch any part of a virtual disk must always be listed, FibreChannel access control is useful only for isolating completely independent systems attached to the same SAN. Finer grain controls, and dynamically changing access rules, required for true data sharing with robust integrity for storage metadata and files, must be provided by host-based cooperating software. 2.1.3

Cluster Networking and I/O Nodes

A key limitation to the external shared storage approach, particularly evident in FibreChannel SAN architectures, is the cost of interconnecting all cluster nodes with external storage. The cost of a SAN NIC and SAN switch port for every cluster node is comparable to the node’s cost. Moreover, HPC clusters usually already have a fully interconnected, high performance switching infrastructure, usually optimized for small packets and low-latency as well as high bandwidth, such as Quadrics, Myrinet or Infiniband. While these cluster-specialized networks are also expensive, they are needed for the computational purpose of the HPC cluster and are unlikely to be replaced by a storage optimized network such as FibreChannel. Instead, cluster designers often seek to transport storage traffic over the cluster network. Because storage devices, particularly cost-effective commodity storage devices, are not available with native attachment to HPC cluster-specialized networks, cluster designers often designate a subset of cluster nodes as I/O nodes and equip these with cluster and storage NICs, as illustrated in Figure 4a.


7

Fig. 4a. Storage Protocol Conversion in I/O Nodes

The primary function of such I/O nodes is to convert storage requests arriving on its cluster network NIC into a storage request leaving on its storage network NIC. This I/O node architecture avoids the cost of provisioning two complete network infrastructures to each node in the cluster. In order to limit the fraction of the cluster compute capacity lost to providing I/O service, the whole I/O node is usually devoted to this function and its capacity for data copying from one protocol stack to another is fully exploited. With a SAN-attached disk-per-node storage architecture, an I/O node’s protocol conversion is very simple: it terminates a connection in the cluster network’s transport layer, collects the embedded disk request, then wraps that request in the storage network’s transport layer and forwards it into the storage network. The cost of this protocol conversion is the I/O node, its two NICs (or more if the bandwidths of the cluster and storage network are not matched), and a network switch port for each NIC. While this can be much less expensive than equipping each node with both networks, it can still be a significant cost if high bandwidth is sought, especially in comparison to disks embedded in each node. The simple protocol conversion, or bridging, function of such I/O nodes lends itself to being offloaded into the networks themselves. As illustrated in Figure 4b, a multi-protocol switch contains line cards, or blades, with different types of network ports and employs hardware protocol conversion, allowing cluster networks such as Myrinet or Infiniband to efficiently and cost-effectively switch data with storage networks such as FibreChannel (FCP/FC) or Ethernet (iSCSI/GE) [Seitz02, Topspin360]. Instead of terminating a cluster network connection, parsing the embedded storage request, and proxying each request into the storage network, the storage connection can be “tunneled” through the compute cluster connection. In this approach the payload of a compute cluster network connection are routing layer packets of a storage connection.

Fig. 4b. HPC Cluster Network with Bridging Switch

8


For example, Myricom’s new M3-SW16-8E switch line card connects up to 8 Gigabit Ethernet ports into an 8-link Myrinet backplane fabric. This provides a seamless conversion between Ethernet and Myrinet’s physical layers, eliminating the need for multiple switch infrastructures and reducing by half the number of switch ports and NICs employed between the endpoint cluster node and storage device. To transport storage or internet data, a Myrinet client node encapsulates TCP/IP traffic inside Myricom’s GM protocol. Received by the protocol conversion switch, the storage or internet TCP/IP packets are stripped of their GM headers and then forwarded over Ethernet to an IP-based storage or internet destination. Multi-protocol switches such as described in this Myricom example, and similar products being introduced for Infiniband clusters from vendors such as Topspin, essentially eliminate the cost of I/O nodes that only convert storage requests from one network protocol to another. Because of the widespread use of TCP/IP on Ethernet it is the “second” protocol in most of these new products. IP is a particularly appealing protocol because it is ubiquitous, has a universal address space, and is media independent, allowing multi-protocol switches to reach other non-storage links using the same networking protocols inside the same compute cluster tunnels. While FibreChannel line cards will also become available, Ethernet’s cost-effectiveness and applicability to internet traffic as well as storage traffic makes iSCSI-based storage protocols very appealing. 2.1.4

Summary

Attaching a small number of disks to every node in an HPC cluster provides a low cost, scalable bandwidth storage architecture, but at the cost of: •

complexity for programmer to effectively use local disks,

•

complexity for administrators to backup, grow and capacity balance many small file systems,

•

susceptibility to data loss with each node failure,

•

competition for compute resources from pre- and post-processing, and

•

susceptibility of data and metadata to damage caused by bugs in a cluster node’s local software.

Network disk, or SAN storage, whether the dominant FibreChannel or the newcomer iSCSI, appears to each node as a logical disk per node, retaining the programming complexity, integrity, pre- and postprocessing competition and 1000s of small file system management weaknesses. But network disk storage does simplify the physical configuration and growth complexity. Network disk storage also decouples node and storage failure handling. Unfortunately, network disk storage also requires a significant additional storage network infrastructure investment, although this is greatly ameliorated by multi-protocol switching and iSCSI transport of disk commands over TCP/IP and Ethernet. 2.2

Scaling at the File Abstraction

A shared file system is the simplest approach for users/programmers, and most manageable approach for administrators, providing convenient organization of a huge collection of storage. 2.2.1

NAS Servers for Shared Repository

To date, the most common approach to providing a shared repository outside of the cluster entails the use of dedicated multi-TB Network-Attached Storage (NAS) systems. An external NAS system is almost the opposite of the disk-per-node solution; that is, a NAS server is a good solution for reliability, availability and manageability, but a weak solution for bandwidth.


9

Fig. 5. HPC Cluster with NAS Storage Repository

With a NAS repository, all data is accessed externally so transforming the data layout to match the compute layout gets done implicitly with every access. Storage redundancy for reliability is offloaded from the cluster to the NAS system. Since a NAS system is usually built as a primary/backup front end for one or more fault tolerant disk arrays, it is explicitly designed for high data reliability and availability. Moreover, because NAS systems simply distribute the single system image of single file system, a single NAS system is generally taken as the benchmark standard for simple, inexpensive storage management including incremental capacity growth independent of cluster characteristics. Finally, because a NAS system supports access control enforced file sharing as its most basic use case, application development, pre- and post-processing and data staging and destaging can all occur in parallel without interfering with the applications running on the HPC cluster. Unfortunately, a traditional NAS system delivers fractions of a 100 MB/sec per file, and aggregates of at most a few 100 MB/sec. When most NAS vendors advertise scaling, they mean that a few NICs and a few hundred disks can be attached to a single (or possibly dual failover) NAS system. Getting one GB/sec from the files in one directory is virtually unheard of in NAS products. For HPC purposes, scaling NAS performance and capacity means multiple independent NAS servers. But multiple NAS servers re-introduces many of the problems with administering multiple file systems in the disk-per-node solutions, albeit with many fewer and bigger file systems. For example, with two nearcapacity file systems online, an administrator would need to purchase an additional NAS server and migrate data (for both capacity and bandwidth balancing) from other servers. This typically requires significant down time while data is moved and applications are re-configured to access data on the new mount points. And any given file is on just one NAS server, so access is only faster if a collection of files is assigned into the namespace of multiple NAS servers in just the right way – a few files on each NAS server – reducing the primary advantage of NAS, which is its manageability. An external NAS repository for permanent copies of data staged to or destaged from an HPC cluster containing disks on each node is, unfortunately, not the best of both worlds. It does enable external sharing and management, while allowing cluster algorithms coded for disk-per-node to get scalable bandwidth. But the disks on each compute node still raise cluster node failure rates, still bind the unit of incremental bandwidth growth to the total size of the cluster and still present the administrator with 1000s of small file systems to manage, even if the disks at each node contain only replicas of externally stored data. And staging/destaging time can be significant because of NAS bandwidth limitations.

10


2.2.2

SAN File Systems

A SAN file system is the combination of SAN-attached storage devices and a multi-processor implementation of a file system with a piece of the file system distributed onto every compute cluster node. SAN file systems differ from NAS systems by locating the controlling metadata management function away from the storage. A SAN file system overcomes the management inconveniences of a consolidated disk per node solution like SAN storage. Rather than 1000s of independent small file systems, a SAN file system is managed as a single large file system, simplifying capacity and load balancing. For large HPC datasets and main memory checkpoints that need unrivalled bandwidth, this direct data access approach between cluster node and SAN storage device can provide full disk bandwidth to the cluster, limited only by network bisection bandwidth. SAN file systems seek to provide the manageability of NAS with the scalability of SAN, but suffer from the poor sharing support of a block-based interface, which must be compensated for with messaging between the nodes of the cluster. 2.2.2.1 In-Band SAN File Systems An in-band SAN file system, such as Sistina’s GFS [Preslan99], is a fully symmetric distributed implementation of a file system, with a piece of the file system running all of its services on each node of the cluster and no nodes differentiated in their privileges or powers to change the contents of disk storage. Each node’s piece of the file system negotiates with any other piece to gain temporary ownership of any storage block it needs access to. To make changes in file data, it obtains ownership and up-to-date copies of file data, metadata and associated directory entries. To allocate space or move data between physical media locations, it obtains ownership of encompassing physical media and up-to-date copies of that media’s metadata. Unfortunately allocation is not a rare event, so even with minimal data sharing, internode negotiation over unallocated media is a common arbitration event. Because of the central nature of arbitrating for ownership of resources, these systems often have distributed lock managers that employ sophisticated techniques to minimize the arbitration bottleneck. Fully symmetric distributed file systems have the same problems with data integrity as disk-per-node solutions. Bugs in compute node operating systems, or bad interactions between separately ported system services, can cause compute nodes to bypass access control and metadata use rules, allowing any and all data or metadata in the system to be damaged. This is perhaps controllable in a homogeneous HPC cluster with carefully screened updates to the cluster node operating systems, although COTS operating systems such as Linux evolve rapidly with independent changes collected by different integrators. Enabling pre- and post-processing from non-cluster systems concurrent with cluster computations is another problem. Enabling directly access from all non-cluster nodes to shared cluster storage is particularly prone to accidents because of the diversity of machines needing to run the in-band SAN file system software. Most often, SAN file systems run only on the server cluster, and not on all desktops and workstations in the environment, in order to limit the number of different ports that must be fully interoperably correct. But this forces at least some of the server cluster nodes to service storage requests from non-cluster nodes, re-introducing the I/O node approach for proxying a different set of storage requests. 2.2.2.2 Out-of-Band SAN File Systems Out-of-band SAN file systems, such as IBM’s Sanergy and EMC’s High Road [EMC03, IBM03], improve the robustness and administrator manageability of in-band SAN file systems by differentiating the capabilities of file system code running on cluster nodes from the file system code running on I/O nodes: only I/O node file system software can arbitrate the allocation and ownership decisions and only this software can change most metadata values. The file system software running on cluster nodes is still allowed to read and write SAN storage directly, provided it synchronizes with I/O nodes to obtain permission, up-to-date metadata and all allocation decisions and metadata changes. Because metadata control is not available in the data path from the cluster node to the storage device, these I/O nodes are called metadata servers or out-of-band metadata servers. Metadata servers can become a bottleneck because


11

the block abstraction of SAN storage is so simple that many cluster write commands will require synchronization with metadata servers [Gibson98]. Unfortunately, the isolation of metadata control on the I/O nodes is by convention only; the SAN storage interface will allow any node that can access it for any reason to execute any command including metadata changes. There is no protection from accidental or deliberate inappropriate access. Data integrity is greatly weakened by this lack of storage enforced protection; the block interface doesn’t provide the fundamental support needed for multi-user access control that is provided, for example, by separate address spaces in a virtual memory system. Out-of-band metadata file system software running on I/O nodes can also offer the same proxy file system access for non-cluster workstation or desktop clients. The proxy file system protocols used by non-cluster nodes is usually simple NAS protocols, leading the metadata servers to be sometimes called NAS heads.

Fig. 6. HPC Cluster with an Out-of-Band File System

2.2.3

Summary

While the file sharing interface provided by NAS is enjoyed by users, it has had difficulty scaling to meet the performance demands of the HPC environment. SAN solutions can provide good performance but are difficult and expensive to manage. SAN file systems can provide performance and data sharing, but the poor sharing support of the SAN block interface limits scalability. 2.3

Scaling at the Object Abstraction

Storage offering the Object Storage Device (OSD) interface stores an unordered collection of named variable-length byte arrays, called objects, each with embedded attributes, fine-grain access control, and encapsulated data layout and allocation [Gibson98]. The OSD interface, which is described in more detail in Section 3, coupled with an out-of-band storage networking architecture such as shown in Figure 6, improves the scalability of out-of-band SAN file systems because it encapsulates much of the information that an out-of-band SAN file system must synchronize with metadata servers. The OSD interface is richer than the block-based interface used in DAS and SAN, but not as complex as the file-base interface of a NAS (NFS or CIFS, for example) file server. The art in object storage architecture is finding the right level of abstraction for the storage device that supports security and performance in the I/O path, without limiting flexibility in the metadata management.

12


Storage objects were inspired by the inode layer of traditional UNIX local file systems [McKusick84]. File systems are usually constructed as two or more layers. The lower layer, inodes in UNIX and objects in OSD, encapsulates physical layout allocation decisions and per file attributes like size and create time. This simplifies the representation of a file in the upper layer, which handles directory structures, interpreting access control and coordinating with environmental authentication services, layering these on top of object storage. For example, consider file naming. An OSD does not implement hierarchical file names or content-based addressing. Instead, it allows those naming schemes to be implemented on top of a simple (group ID, object ID) naming system and an extensible set of object attributes. To implement a hierarchical naming scheme, some objects are used as directories while others are data files, just as a traditional UNIX file system uses inodes. The semantics of the directory tree hierarchy are known to unprivileged file system code, called clients, running on cluster nodes, and to privileged metadata managers. To the OSD, a directory is just another object. Object attributes include standard attributes like modify times and capacity information, as well as higher-level attributes that are managed by the meta-data managers (e.g., parent directory pointer, file type). The OSD operations include operations like create, delete, get attributes, set attributes, and list objects. Most importantly for scalability, changes in the layout of an object on media and most allocations extending the length of an object can be handled locally at the OSD without changing any metadata cached by unprivileged clients. To support various access control schemes, the OSD provides capability-based access enforcement. Capabilities are compact representations naming only the specific objects that can be accessed and the specific actions that can be done by the holder of a capability. Capabilities can be cryptographically secured. The metadata manager and the OSD have a shared key used to generate and check capabilities. For example, when a client wants to access a file it requests a capability from the metadata manager. The metadata manager enforces access control by whatever means it chooses. Typically it consults ownership and access control list information stored as attributes on an object. The metadata manager generates a capability and returns it to the client, which includes that capability in its request to the OSD. The OSD validates the capability, which includes bits that specify which OSD operations are allowed. Metadata managers can be designed to use various authentication (e.g., NIS, Kerberos, Active Directory) and authorization (e.g., Windows ACLs or POSIX ACLs) schemes to grant access, and rely on the OSD to enforce these access policy decisions without knowledge of the authentication or authorization system is use. Most importantly for scalability, capabilities are small, cacheable and revocable (by changing an attribute of the named object on an OSD), so file system client code can cache many permission decisions for a long time and an out-of-band metadata server can always synchronously and immediately control the use of a capability. Metadata servers can even change the representation of the file, migrating its objects or reconstructing a failed OSD, with no more interruption to client access than is required to update the small capability the first time the client tries to use it in a way that is no longer valid. With this higher-level interface, object storage clusters are capable of highly parallel data transfers between storage and compute cluster node under the infrequently applied control of the out-of-band metadata managers. Object Storage Architectures support single-system-image file systems with the traditional sharing and management features of NAS systems and the resource consolidation and scalable performance of SAN systems. 2.4

Cost Analysis Example

In this section we examine typical costs for storage systems built from Commodity, Off-The-Shelf (COTS) components, best-in-class server systems, and SAN storage components. Following the example of Beowulf-style HPC compute clusters we show that Object Storage systems should also be built as cluster of relatively small nodes, which we characterize as powerful disks, rather than thin servers.


13

To illustrate the tradeoffs between standard practices for building a shared multiple NAS server storage system and a comparable clustered COTS storage system, in April 2003 we priced the state-of-the-art hardware needed for a 50 TB shared storage system providing 2.5 GB/sec bandwidth. Our NAS solution prices were taken from Sun Microsystem’s online store. For the COTS pricing we looked at a number of online stores; we report the best pricing, which was for Supermicro 1U rack-mount Intel servers from ASA Servers of Santa Clara, CA. The results are shown in Table 1. For the multiple NAS server system we used five SUN V480 Server 4s with 4 GE ports each and built a SAN storage system for each server using 144 GB 10,000 rpm FibreChannel disk drives, Brocade 3800 FC switches and Qlogic 2340 adapter cards. For the COTS storage cluster to run an object storage file system, we priced five 2 GHz Xeon-based metadata servers and forty-one OSDs, each OSD with 6 ATA disks storing 200 GB and sustaining 10 MB/sec each. This is conservative; fewer metadata servers will be needed by most HPC workloads. The object storage cluster is connected through inexpensive 4-port gigabit Ethernet switches as concentrators and Ethernet-to-Myrinet protocol converter blades attaching into the computer cluster’s interconnect. Table 1. NFS NAS vs. COTS Object Storage Costs

MetaData Server 50TB storage Ethernet switch TOTAL

Sun NFS server $234,475 $508,090 $3,000 $744,565

COTS Object Storage $18,630 $134,904 $12,500 $166,034

Table 1 shows that the object storage COTS hardware cost is 4.5 times lower cost than a multiple NAS server solution sustaining the same bandwidth. The largest cost difference is the Fibre Channel storage system (including disks and FC switches) in the NAS solution versus ATA disk drives in the COTS hardware. Even if ATA drives are substituted into the NAS configuration (which is not done for performance reasons), the cost of the NAS solution is $371,379 which is 2.2 times more than the object storage hardware. The second largest cost difference is the expensive NAS server bottleneck; 4-way multiprocessors are used to deliver 4 Gbit/sec bandwidth per server. In contrast, object storage’s metadata servers only require a single commodity Xeon processor per server, because all data movement is offloaded, performed directly between clients and storage.

Table 2. Disk per Node + NAS vs. COTS Object Storage Costs

MetaData Server 50TB storage Extra 200GB drive/node Ethernet switch Total

Disk per Node + NAS $43,995 $508,090 $64,750 $3,000 $619,835

COTS Object Storage $18,630 $134,904 $12,500 $166,034

To reduce the cost of the original NAS system, we also priced a disk-per-node compute cluster with NAS as a shared repository incapable of the required bandwidth. All data needed for a computation are copied onto 200 GB of local storage attached to each cluster node, allowing local storage to provide each cluster node with sufficient bandwidth. This disk-per-node plus shared NAS solution allows us to eliminate all but one of NAS servers, reducing system cost to $619,835. With FibreChannel disks, this disk-per-node plus NAS solution, however, is still over 3.5 times more expensive than the object storage system. Replacing the FibreChannel disks with ATA disks and the object storage is still 33% cheaper. And by not having the disk-per-node data management problems, the object storage solution is more easily managed as well.

14

3


The Object Storage Architecture

The Object Storage Architecture (OSA) provides a foundation for building an out-of-band clustered storage system that is independent of the computing cluster, yet still provides extremely high bandwidth, shared access, fault tolerance, easy manageability, and robust data reliability and security. Most importantly, because object storage is designed for systems built from COTS components, its storage solutions will be cost-effective. There are two key elements of the OSA that enable exceptional scalability and performance: a highlevel interface to storage for I/O, and an out-ofband interface for metadata management. The architecture uses a cluster of Object Storage Devices (OSD) that is managed by a much smaller set of metadata manager nodes. I/O between the computing cluster and the OSD is direct; the metadata managers are not involved in the main data path. This is shown in Figure 7. 3.1

High-Level OSD Interface

Each OSD provides a high-level interface to its storage that hides traditional storage details like sectors, partitions, and LUNs. Instead, an OSD is a server for objects that have a range of bytes and an extensible set of attributes. Objects can represent files, databases, or components of files. The highlevel interface is necessary in a large scale system where storage devices are shared by many clients. Traditional block device interfaces do not have support for data sharing and access control, making it more difficult to optimize I/O streams from multiple clients with block storage.

Fig. 7. Object Storage Architecture

For example, consider the case where multiple clients are reading large files from the storage cluster simultaneously. When each client issues a READ command for an object, the OSD knows exactly how big that specific object is and where it is located on disks. The OSD can schedule read-ahead operations on behalf of its clients, and balance buffer space and queue depths among I/O streams. In contrast, in traditional storage systems the operating system or client manage read-ahead by issuing explicit read requests. That approach does not scale well in a distributed storage system. By implementing intelligent read-ahead logic on the storage device, the client is simpler, fewer network operations are required, and the storage device can stream data to multiple clients efficiently. A WRITE command example may be even more illuminating. When a write is done beyond the end of a file in a block-based SAN file system, the writing client needs to synchronize with its metadata server to allocate additional space on media and modify the file’s metadata in its cache and on disk. With object storage however, the metadata server that issued to the client the right to issue write command to an OSD can mark the client’s capability with a quota far in excess of the size of the file. Then the OSD can increase the size of the file without synchronizing with metadata servers at the time of the write and still not bind actual media to the newly written data until performance optimizations such as write behind decide it is time to write the media. The Object Storage Architecture supports high bandwidth by striping file data across storage nodes. Clients issue parallel I/O requests to several OSDs to obtain an aggregate bandwidth in a networked


15

environment that is comparable to the bandwidth obtained from a locally attached RAID controller. In addition, by distributing file components among OSDs using RAID techniques, the storage system is protected from OSD failures. Thus we see that the Object Storage Architecture lets us create a system where many clients are simultaneously accessing many storage nodes to obtain very high bandwidth access to large data repositories. In addition, balanced scaling is built into the system because each storage node has a network interface, processor, and memory, as well as disks. 3.2

Objects and OSD Command Set

Drawing on the lessons of iSCSI and FibreChannel, the OSD protocol is designed to work within the SCSI framework, allowing it to be directly transported using iSCSI and providing cluster nodes with a standard protocol for communicating with OSDs. The OSD object model and command set is being defined by a SNIA (www.snia.org/osd) and an ANSI T10 (www.t10.org) OSD working group. The basic data object, called a user-object, stores data as an ordered set of bytes contained within the storage device and addressed by a unique 96-bit identifier. Essentially, user-objects are data containers, abstracting the physical layout details under the object interface and enabling vendor specific OSD-based layout policies. OSDs also support group-objects, which are logical collections of user-objects addressed using a unique 32-bit identifier. Group-objects allow for efficient addressing and capacity management over collections of user-objects, enabling such basic storage management functions as quota management and backup.

Fig. 8. Typical OSD SCSI CDB and Read Service Action

Associated with each object is an extensible set of attributes, which store per-object information. The OSD predefines and manages some attributes such as user-object size (physical size), object create time, objectdata last modified time, and object-attribute last modified time. The OSD also provides storage for an extensible set of externally managed attributes, allowing higher-level software (e.g., file systems) to record higher-level information such as user names, permissions, and application-specific timestamps on a per user- or group-object basis. All OSD attributes are organized into collections (called pages), with 216 attributes per page and 216 attribute pages; each attribute can be a maximum of 256 bytes. The OSD interprets attributes that are defined by the standard (e.g., last access time) while treating vendor- or application-specific attributes as opaque blobs that are read or updated by higher-level software. OSD operations include commands such as create, delete, get attributes, set attributes, and list objects, as well as the traditional read and write commands. Commands are transported over a SCSI extended-CDB (i.e. operation code 0x7F) and include the command, the capability, and any application-defined attributes that are to be set or retrieved as a side effect of the command. When commands complete, they return a status code, any requested data and any requested attributes. This coupling of attribute get and set processing with data access enables atomic access to both data and attributes within a single command,

16


significantly decreasing the complexity of higher-level applications while increasing overall performance by reducing the number of round-trip messages. To ensure security, OSD commands include a cryptographically signed capability, granting permission to perform a specified set of operations on an object or set of objects. The capability is signed with an SHA-1 digital signature derived from a secret shared between the OSD and manager. The capability defines the minimum level of security (i.e. integrity and/or privacy on the command header and/or data) allowed, a key identifier that specifies which secret OSD key was used to sign the capability, a signature nonce to avoid replay attacks, an expiration time for the capability, a bitmap of permissible operations (e.g., read data, set attribute), and user object information, including the user object id, length, and offset of data over which the capability can be applied, and an object creation time. Embedding the object creation date ensures that after a user object is deleted, any reuse of the object identifier does not create a security hole. To illustrate the use of an OSD command, consider the following READ example that fetch 255 bytes from object 0x47. The CDB is comprised of a 10-byte header plus the Service Action Specified Fields; the OSD CDB also includes a security capability and specifies any attribute retrieval/updates that are done as a side effect of the command. The prototype READ command CDB is shown below. To issue this command, the initiator (i.e., client) generates the CDB command blocks with the following information: Byte 0 Byte 6 Byte Byte Byte Byte Byte Byte Byte Byte Byte Byte

7 8-9 10 12-15 16-23 28-35 36-43 44-55 56-75 76-176

OPERATION CODE =0x7F (OSD command) IS_CDB =0; IS_DATA =0; PS_CDB =0; PS_DATA=0 (no on-the-wire tests) Additional CDB length = 176 SERVICE ACTION =0x8805 (READ) OPTIONS BYTE =0x00 OBJECT_GROUP_ID =0x01 USER_OBJECT_ID =0x47 LENGTH =0XFF (255) STARTING ADDRESS =0x00 (beginning of the file) GET ATTRIBUTE ={0x03, 0x01} (return the create time) SET ATTRIBUTE ={0x03, 0x04, time = 2/21/2003, 10:15 pm} CAPABILITY = {objected 0x47, accessPermission read object + write attribute {0x03, 0x04} + read attribute {0x03, 0x01}, version 0x123, nonce = 0x221} || SHA1(Secret||CAPABILITY)

The READ command specifies that after the object is read, the last time accessed (attribute page 0x03, attribute number 0x04) should be set while the create time (attribute page 0x03, attribute number 0x01) should be returned along with the data. Appended to the command is a capability that defines which object can be accessed, what operations can be performed (access Permission), and which attributes can be read or written. The entire capability is signed with a secret, allowing the OSD to verify that the capability has not been tampered. Upon receiving this command, the OSD would: 1) verify the capability signature; 2) fetch the request object’s data, 3) update/fetch the specified attributes; 4) return the data and specified attributes.

4.

Related Work

The Object Storage Device interface standardization effort can be traced directly to the DARPA sponsored research project "Network Attached Secure Disks" conducted between 1995 and 1999 at Carnegie Mellon University (CMU) by some of the authors of this paper [Gibson98]. Building on CMU's RAID research at the Parallel Data Lab (www.pdl.cmu.edu) and the Data Storage Systems Center (www.dssc.ece.cmu.edu), NASD was charted to "enable commodity storage components to be the building blocks of high-bandwidth, low-latency, secure scalable storage systems."


17

From prior experience defining the RAID taxonomy at Berkeley in 1988 [Patterson88], the NASD team understood that it was industry adoption of revolutionary ideas that yields impact on technology, so in 1997 CMU initiated an industry working group in the National Storage Industry Consortium (now www.insic.org). This group, including representatives from CMU, HP, IBM, Seagate, StorageTek and Quantum, worked on the initial transformation of CMU NASD research into what became, in 1999, the founding document of "Object Storage Device" working groups in the Storage Networking Industry Association (www.snia.org/osd) and the ANSI X3 T10 (SCSI) standards body (www.t10.org). Since that time the OSD working group in SNIA has guided the evolution of Object Storage Device interfaces, as member companies experiment with the technology in their R&D labs. Today the SNIA OSD working group is co-led by Intel and IBM with participation from across the spectrum of storage technology companies. CMU's NASD project was not the only academic research contributing to today's understanding of Object Storage. In the same timeframe as the NASD work, LAN-based block storage was explored in multiple research labs [Cabrera91, Lee96, VanMeter98]. Almost immediately academics leapt to embedded computational elements in each smart storage device [Acharya98, Keeton98, Riedel98]. A couple years later more detailed complex analyses of transparency, synchronization and security were published [Amiri00, Anderson01, Burns01, Miller02, Aguilera03]. Initiated from CMU during the NASD project, Peter Braam’s ambitious and stimulating open source object–based file system continues to evolve [Lustre03]. And just this year a spate of new object storage research have addressed scalability for both the object storage and the metadata servers [Azagury02, Brandt03, Bright03, Rodeh02, Yasuda03].

5.

Conclusion

High Performance Computing (HPC) environments make exceptional demands on storage systems to provide high capacity, high bandwidth, high reliability, easy sharing, low capital cost and low operating cost. While the disk-per-node storage architecture, which embeds one or a few disks in each computing cluster node, is a very capital-cost-effective way to scale capacity and bandwidth, it has poor reliability, high programmer complexity, inconvenient sharing and high operating costs. SAN-attached virtual-diskper-node, multiple NAS server, NAS repositories with data replicas in disk-per-node storage, and in-band and out-of-band SAN file systems are alternatives with a variety of advantages and disadvantages but none clearly solves the problem for HPC cluster storage. From a capital cost viewpoint, it is clear that scalable storage should be constructed from COTS components specialized to storage function and linked into the cluster through multi-protocol conversion in the inter-processor-optimized cluster switch. Object Storage Devices (OSD) are a new storage interface developed specifically for scaling shared storage to extraordinary levels of storage bandwidth and capacity without sacrificing reliability or simple, low cost operations. Coupled with COTS cluster implementation, OSD storage systems promise a complete solution for HPC clusters. The key properties of a storage object are its variable length ordered sequence of addressable bytes, its embedded management of data layout, its extensible attributes and its fine grain device-enforced access restrictions. These properties make objects closer to the widely understood UNIX inode abstraction than to block storage and allow direct, parallel access from client nodes under the firm but infrequently applied control of an out-of-band metadata server. Object Storage Architectures support single-system-image file systems with the traditional sharing and management features of NAS systems and the resource consolidation and scalable performance of SAN systems.

References [Acharya98]

Acharya, A., Uysal, M. Saltz, J. "Active Disks," International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 1998.

18


[Anderson01]

Anderson, D., Chase, J., Vahdat, A., “Interposed Request Routing for Scalable Network Storage,” Fourth Symposium on Operating System Design and Implementation (OSDI), ACM 2001.

[Aguilera03]

Aguilera, M., Minwen, J., Lillibridge, M., MacCormick, J., Oertli, E., Anderson D., Burrows, M., Mann, T., Thekkath, C., “Block-Level Security for Network-Attached Disks,” USENIX File and Storage Technology Conference (FAST03), April 2003.

[Amiri00]

Amiri, K., Gibson, G.A., Golding, R., "Highly Concurrent Shared Storage," Int. Conf. on Distributed Computing Systems (ICDCS00), April 2000.

[Azagury02]

Azagury, A., Dreizin, V., Factor, M., Henis, E., Naor, D., Rinetzky, N., Satran, J., Tavory, A., Yerushalmi, L, “Towards an Object Store,” IBM Storage Systems Technology Workshop, November 2002.

[Bright03]

Bright, J., Chandy, J., “A Scalable Architecture for Clustered Network Attached Storage,” Twentieth IEEE / Eleventh NASA Goddard Conference on Mass Storage Systems and Technologies, April 2003.

[Brandt03]

Brandt, S., Xue, L., Miller, E., Long D., "Efficient Metadata Management in Large Distributed File Systems," Twentieth IEEE / Eleventh NASA Goddard Conference on Mass Storage Systems and Technologies, April 2003.

[Burns01]

Burns, R. C., Rees, R. M., Long, D. D. E., "An Analytical Study of Opportunistic Lease Renewal," Proc. of the 16th International Conference on Distributed Computing Systems (ICDCS), IEEE, 2001.

[Cabrera91]

Cabrera, L. Long, D., Swift., “Using Distributed Disk Striping to Provide High I/O Data Rates,” Computing Systems 4:4, Fall 1991.

[EMC03]

"EMC Celerra HighRoad," 2003, http://www.emc.com/products/software/highroad.jsp.

[Fineberg99]

Fineberg, S. A., Mehra, P., “The Record-Breaking Terabyte Sort on a Compaq Cluster,” Proc. of the 3rd USENIX Windows NT Symposium, July 1999.

[Gibson98]

Gibson, G. A., et. al., “A Cost-Effective, High-Bandwidth Storage Architecture,” International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), October 1998.

[IBM03]

"Tivoli SANergy," 2003, http://www.ibm.com/software/tivoli/products/sanergy/.

[Keeton98]

Keeton, K., Patterson, D. A. and Hellerstein, J. M., "A Case for Intelligent Disks (IDISKs)," SIGMOD Record 27 (3), August 1998.

[Knott03]

Knott, T., "Computing colossus," BP Frontiers magazine, Issue 6, April 2003, http://www.bp.com/frontiers.

[Lee96]

Lee, E., Thekkath, C. Petal, “Distributed virtual disks,” ACM 7th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) October, 1996.

[Luster03]

“Lustre: A Scalable, High Performance File System,” Cluster File System, Inc., 2003, http://www.lustre.org/docs.html.

[McKusick84]

McKusick, M. K., et. al., “A Fast File System for UNIX,” ACM Transactions on Computer Systems vol. 2, August 1984.

[Miller02]

Miller, E. L., Freeman, W. E., Long, D. E., Reed, B. C., "Strong Security for NetworkAttached Storage," USENIX Conference on File and Storage Technologies (FAST), 2002.


19

[Patterson88]

Patterson, D. A., Gibson, G. A., Katz, R. H., "A Case for Redundant Arrays of Inexpensive Disks (RAID)," Proceedings of the International Conference on Management of Data (SIGMOD), June 1988.

[Preslan99]

Preslan, K. W., O’Keefe, M. T., et. al., “A 64-bit shared file system for Linux,” Proc. of the 16th IEEE Mass Storage Systems Symposium, 1999.

[Riedel98]

Riedel, E., Gibson, G., Faloutsos, C., "Active Storage for Large-Scale Data Mining and Multimedia," VLDB, August 1998.

[Rodeh02]

Rodeh, O., Schonfeld, U., Teperman, A., “zFS - A Scalable distributed File System using Object Disks,” IBM Storage Systems Technology Workshop, November 2002.

[SGSRFP01]

SGS File System RFP, DOE NNCA and DOD NSA, April 25, 2001.

[Seitz02]

Seitz, Charles L., "Myrinet Technology Roadmap," Myrinet User's Group Conference, Vienna, Austria, May 2002, http://www.myri.com/news/02512/.

[Topspin360]

“Topspin 360 Switched Computing http://www.topspin.com/solutions/topspin360.html.

[VanMeter98]

Van Meter, R., Finn, G., Hotz, S., “VISA: Netstation's virtual Internet SCSI adapter.” ACM 8th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) Oct, 1998.

[Yasuda03]

Yasuda, Y., Kawamoto, S., Ebata, A., Okitsu, J., Higuchi, T., “The Concept and Evaluation of X-NAS: a Highly Scalable NAS System,” Twentieth IEEE/Eleventh NASA Goddard Conference on Mass Storage Systems and Technologies, April 2003.

System,"

2003,

Object Storage: Scalable Bandwidth for HPC Clusters - Linux Clusters ...

Object Storage: Scalable Bandwidth for HPC Clusters - Linux Clusters ...

Suggest Documents

Linux-based virtualization for HPC clusters

HPC Clusters Health Check

HPC Clusters Health Check

Full Circle: Simulating Linux Clusters on Linux Clusters - Microsoft

HPC System Call Usage Trends - Linux Clusters Institute

Highly Reliable Linux HPC Clusters: Self-awareness ... - CiteSeerX

LaTeX2e - Linux Clusters Institute

Evaluating ARM HPC Clusters for Scientific

Lenovo Managed Services for HPC Clusters

Lenovo Managed Services for HPC Clusters

REM-Rocks - Linux Clusters Institute

REM-Rocks - Linux Clusters Institute

VIRTUALIZATION OF HETEROGENEOUS HPC-CLUSTERS BASED

Geographical Workflow System over HPC Clusters ... - GeoComputation

Disparity: Scalable Anomaly Detection for Clusters

Disparity: Scalable Anomaly Detection for Clusters - CiteSeerX

JSEB (Java Scalable sErvices Builder): Scalable Systems for Clusters

RNS: Remote Node Selection for HPC Clusters - Semantic Scholar

A first look at scalable I/O in Linux commands - Linux Clusters Institute

A first look at scalable I/O in Linux commands - Linux Clusters Institute

A Smart HPC interconnect for clusters of Virtual Machines

Using Loosely Coupled Clusters for HPC Software ...

Towards cluster survivability - Linux Clusters Institute

The Space Simulator - Linux Clusters Institute