Lerna1 - An Active Storage Framework for Flexible Data Access and Management S. V. Anastasiadis∗
R. G. Wickremesinghe∗
J. S. Chase∗
[email protected]
[email protected]
[email protected]
∗
J. S. Vitter
∗∗
[email protected]
Department of Computer Science, Duke University ∗∗ School of Science, Purdue University Duke CS Technical Report TR-2003-7 May 6, 2003
Abstract
In order to address these issues, researchers usually have to do on-demand replication of entire files locally to the sites where computational resources exist for running the applications [9]. Alternatively, there are studies advocating to run application code within the storage devices themselves, capitalizing on the anticipated technological advances in processing power and memory density [27]. In general, we believe that entire file downloading is nonpractical in several cases due to the excessive storage space required, the long delays involved in bulky data transfers, and the consistency issues arising when creating multiple file copies. Also, the promising approach of migrating application software into the disk drives is not likely to become practical very soon, due to the challenges involved in enhancing the currently simple disk interface to allow execution of application-specific code in a general, safe and inexpensive way.
In the present paper, we examine the problem of supporting application-specific computation within a network storage server. Our objectives are i) to introduce an easy to use yet powerful architecture for executing both customdeveloped and legacy applications close to the stored data, ii) to investigate the potential performance improvement in I/O-intensive processing, iii) to exploit the I/O-traffic information available within the file server for more effective resource management. One main difference from previous active storage research is our emphasis on the expressive power and usability of the storage server interface. We describe an extensible active storage framework that we built in order to demonstrate the feasibility of the proposed system design. Our conclusions are substantiated through experimentation with a popular multi-layer map warehouse application.
1
Nevertheless, data communication through files has traditionally served as a convenient way for applications to receive input and generate output that is easily accessible. Distributed file systems have generalized this model across different hosts without the need to change the file system interface. Of course, transfer of control to local or remote computation follows its own separate path usually through local or remote procedure calls. The picture is blurred in database systems and web servers that dynamically interpret the data requests and return the selected content to the user. However, this functionality is only reachable through specialized access interfaces or query languages, and applications should be custom-developed in order to take advantage of them. On the contrary, offering remote computation through the general file system interface would enable any application that communicates through files to make use of distributed services.
Introduction
Recently deployed high-speed connectivity across widearea networks encourages collaboration and resource sharing among autonomous groups at an unprecedented scale. Scientific and commercial data are undoubtedly recognized as invaluable resources, and several new systems have been proposed to facilitate wide-area information sharing in a secure, flexible and cost-effective way [6, 7, 10, 15]. On the other hand, even though network bandwidth is not at scarcity, its cost is not negligible either. When also considering latencies due to the speed of light, it makes sense to move data processing close to the storage servers.
In this paper, we demonstrate with a prototype implementation that running entire applications within the
1 Lerna: Haunt of a gigantic monster with nine heads (Greek legend).
1
network file server is both feasible and desirable, leading to significant performance advantages and resource management flexibility in comparison to other approaches. We show that the widely-used file system interface can be functionally and semantically enrinched to create the environment for invoking application code as a side-effect of I/O requests and returning back to the client the generated data. Additionally, a network file server proves to be a critical junction where semantically-rich I/O traffic can be intercepted and statistically analyzed in order to offer important intuition about the usage pattern of the server resources and help improving their management.
• The systems community previously described objectoriented distributed file systems that can attach data transformation methods to the stored data, and migrate the execution close to the client or server depending on the run-time environment conditions [2, 32]. Despite the elegance of the existing approaches, most likely none of them is suitable for generally solving the data tranformation problem, and individual users are still more confortable with developing customized solutions. Reasons include preference towards using a particular programming environment, interface limitations of the applications involved, software availability on specific hardware platforms, and performance degradation as a result of general data transformations. Under such circumstances, we believe that application development would be facilitated considerably, if the underlying system provided the necessary framework to allow efficient execution of application-specific processing close to the stored data, direct access to the generated data by both legacy and custom-built applications, and resource utilization tracking for informed data reorganization.
The remainder of this paper is organized as follows. In Section 2 we provide additional motivation for the functionality that we introduce, in Section 3 we get into details about the architecture that we propose, and in Section 4 we explain the operation of a multi-layer map warehouse. In Section 5 we go over the experimentation environment that we used, and in Section 6 we provide experimental results. Finally, in Section 7 we discuss our plans for future work, and in Sections 8 and 9 we summarize previous related work and outline our conclusions.
2
We strive to achieve these objectives given the limitations mentioned above by hiding the data transformation operators behind regular file access requests. When an access request to a particular file name fails, the server can intercept the failure, decode appropriately the name, and produce on-the-fly the requested content. Additionally, for every request we can record its type and the time it takes to complete. A privileged user can register methods in advance with the server to recognize specific types of access failures. Subsequently, the method and the parameters of a remote execution are encoded within the file name of the requested access. The approach that we propose is i) intuitively straighforward because data transformation operators are trigerred by usual file accesses, ii) efficient due to the software maturity of the file system access paths, and iii) extensible because failure interception modules can be added dynamically to the system.
Motivation
Experience with building non-trivial I/O-intensive applications shows that a significant amount of programming effort is typically spent in designing system interfaces for reading and transforming data. It is very likely that the application user does not control the specific format in which data are becoming available, and has to run locally or remotely commodity or custom-built software for providing to the application a desirable data view. Typically, such software provides flow-control and filtering, or aggregation functionality between the data-producing source and the data-consuming application. If the data is made available at a remote site, extra effort may be required for manually copying the data locally, and keeping the local replica up-to-date.
For solving the general problem of remotely accessing Despite the similarity of the proposed functionality to heterogeneous data, several alternative solution have been that of dynamic-content web servers, there is a main difproposed. For instance : ference between the two. The difference arises from the inherent interface simplicity of a network file server that • The database community advocates placing middlemakes it appear as a local filesystem to the applications. ware software close to the data storage to unify acAs a result user authentication and access authorization cesses through some relational, or object-oriented are simplified, while data transfers can be done efficiently. data model [6, 26]. It is exactly this advantage of the file servers that we are • The scientific community has developed approaches trying to preserve in our approach so that we can allow for automating the replication of entire files between the transformation and transfer of remote data to be both cost-effective for the applications and backwards compatsites that produce and consume the data [9]. ible with legacy software. Nevertheless, several critical 2
questions remain open about i) what is the system overhead involved in file access demultiplexing and data processing initiation, ii) how to provide to both the client and the stateless server a common view of the file system structure, iii) what type and amount of I/O traffic information makes sense to gather on the server in order to improve the data organization. It is the objective of the present paper to quantitatively examine these issues and clarify the resource tradeoffs involved.
3
Storage Node
WAN
SAN Storage Node
The Lerna System Architecture
Processing Node
Active Storage System
Figure 1:
An active storage system consists of one or multiple nodes where data access and processing takes place. Enhanced file system accesses can be used for both the communication among different nodes within the server and the data transfers to the outside world.
In this section we outline the general features of userlevel file servers, and describe how we can extend their functionality for allowing seamless remote application invocation through file accesses.
3.1
Processing Node
systems run outside the kernel and thus make straightforward the control transfer to application code. In addition, they have access to important information about user credentials that can be used for offering basic protection against unauthorized file object access. More generally, the interface semantics of file systems retain important object attributes about the I/O traffic, which makes it easy to account for the access patterns of both individual objects and entire file systems. In combination with userlevel extensibility, it is possible to infer resource usage in a configurable way which can be used for reorganizing the data in the system.
User-level Network File Servers
Even though, for performance and protection reasons, file systems typically run as part of the system kernel, there has been recent interest in implementing them at user space. Research in the context of the Direct Access File System (DAFS) has shown that combination of remote direct memory access with user-level networking can lead to data throughput that saturates the underlying network links [23]. The user-level implementation allows data caching with lower overhead and block replacement that can be easily customized. Client nodes register memIn the rest of the present paper, we describe our expeory regions with the local kernel before requesting direct I/O, while the server nodes control both the actual data rience with offering application-specific computation and resource usage accounting over a user-level file system. transfers and the related buffer management. The implementation that we describe is based on extenAlternatively, the SFS toolkit proposed by Mazi`eres sions that we made to the publicly available source code [25] uses a user-level library to provide secure asynchronous of the SFS toolkit. One main reason for using SFS was communication with existing file systems via the Sun Net- the secure communication channel between the server and work File System (NFS). Applications on the client side the client that it provides [19]. Even though our current submit file access requests to the local NFS client. From prototype is limited to modifying the behavior of read acthere the requests are passed to a user-level client daemon cesses, and measuring the number and server-side delay of and over the network to a user-level server daemon that read and write requests, we believe that other operations communicates directly with the NFS server. The task of can be intercepted and enhanced in similar ways. the NFS server is to translate the received requests to accesses of local file systems. The SFS toolkit makes it relatively easy to add stackable layers that modify the 3.2 Internal System Structure NFS server behavior. For instance, Mazi`eres describes layers that fix bugs of the system kernel, support multi- An active storage framework combines two different types ple mount points over a single server socket, or add data of nodes in a multi-tier organization (Figure 1). Backencryption to a local file system. end storage nodes access data directly from storage devices and run application-specific data filtering or transformation tasks. Front-end processing nodes talk directly to clients, initiate data access requests to the storage nodes, and combine the results for final presentation to
Given the above well-documented advantages of userlevel file systems, we decided to explore another potential use which has not been studied previously: that of supporting application- specific computation. User-level file 3
ASCn
ASC0
Lerna
Application Demux
Application
System Calls
Resource Managementk
Resource Management0
User-level File System Client
Net
User-level File System Server
User Kernel
User-level File Server
NFS client
Next Node
Previous Node
NFS server
Disk
Figure 3:
File access requests are intercepted on the server side either for resource accounting, or for dynamic creation of missing files. Lerna can access the local data either directly or through the user-level file system.
Lerna Node Structure
Figure 2:
Each node of the Lerna active storage server consists of application-specific computation and resource management modules running on top of a local user-level file system.
file descriptor of a new process that is spawned for running the requested service. Also, a callback is registered to be invoked when the service process exits. The callback resubmits the original lookup request to the local NFS server, and passes the returned file handle to the NFS client. At this point, the client can proceed with a read request to the NFS server, and the returned data block is delivered to the application. A blocked read call returns to the application as if the accessed file was there before the request. This is the control sequence that we actually implemented and about which we report performance measurements later on. In our prototype, we left file writing unmodified so that written data are passed verbatim to the specified file.
the clients. Regardless of their different roles, both types of nodes can have the same general multi-layer structure. A user-level file system passes data access requests to vertically- stacked resource management layers. At the top, we horizontally deploy application- specific computation (ASC) modules (Figure 2). To simplify the presentation, from now on we assume that processing and storage node functionality runs on the same physical host, unless otherwise specified. Clients mount the active storage server as a remote file system (Figure 3). Access requests to existing files and directories appear to the client as normal file system accesses. However, every request can potentially be intercepted by the resource management layer in Lerna for resource accounting or scheduling. Application-specific computation is triggered by accesses to nonexistent files. The file name can pack the identifier and the parameters of the requested service in ways similar to naming conventions of either uniform resource locators for dynamic web pages or virtual directories [17]. In our prototype, we used the official URL-encoding format [8] to transform arbitrary command-line statements into permissible file name strings. Based on information encoded within the file name, the top resource-management layer demultiplexes the failed request to invocation of an ASC module registered at the top.
Resource Usage Accounting In our current implementation, we keep track of the number and response time of individual block read and write requests. The total number of read and write requests along with their average response time are reported for each individual accessed file categorized by file system. This information makes it evident which files are responsible for most of the system load, and can be used for distributing files across different storage devices to improve the system throughput. When application computation runs locally on the server, file data can be accessed directly for improved performance or through the user-level file system for detailed resource accountability. We examine this tradeoff in our performance evaluation study.
Application Demultiplexing When a name lookup fails, the demultiplexing layer of Lerna decodes the file name to the service identifier and parameters that will be used to create the missing file (Figure 2). The file is created according to the access permissions of the requesting user, and its file descriptor becomes the standard output
Directory Tree Visibility The client can only see remote file systems explicitly exported by servers, while each server receives requests for files identified through their NFS handle rather than their complete path. When a nonexistent file has to be created, the server needs to re-
4
4
construct the entire file path and pass it on to the service application involved. One way to do that is by keeping track of lookup requests for individual path components and their position in the directory tree. Every time a new edge of the directory tree is successfully looked up, a new record can be created that associates the file handle with its name and a pointer to its parent. Thus, it is possible to gradually build a data structure on the server for looking up file handles and returning file path names. Alternatively, the path of the current working directory has to be explicitly passed to the server with each user request.
A Multi-Layer Map Warehouse
Online generation of geographical, astronomical or biomedical maps generally can consume a large amount of processing and bandwidth resources depending on the size of the datasets involved [5, 10, 30]. In the past, the problem of handling terabyte-sized datasets has been kept manageable by i) limiting the accessible number of layers to one, and ii) producing raster images of all the supported resolutions offline so that map rendering would be reduced to data retrieval. In that sense, large-scale management of multi-layer map warehouses remains an interesting challenging problem with significant applications to both basic scientific research and everyday life. Offline preparation of multi-resolution images solves only partially the problem, because it can become non-practical with frequently changing datasets, or when multiple layers of information can be selectively requested, due to the substantial online processing involved.
Internode Communication In our execution model, an application can receive data input through files specified as parameters during its invocation. The invoking client places the input files in the file system already mounted from the server. When the application is called on the server, it stores the output in the file that the client requested to read. In a multi-tier organization, computation initiated in a processing node is likely to request files generated by storage nodes. In that case, the above access model can be generalized to also cover communication among different nodes of the server. Currently, a read access returns control to the client when the entire requested file becomes available. Additionally, our model can also handle the case (currently not implemented) where file blocks are generated in a streaming fashion as a result of sequential access requests from the client. For this feature to work, the service application should support flow control in producing file blocks, which generally implies keeping state in the server across different client requests.
Mapserver is an open-source application developed in a NASA- sponsored project to provide configurable web access to geographical data in the form of map images [24]. Although not a fully-fledged geographical information system, it has sufficient functionality to integrate several different types of data that include raster or vector images, and relational tables from commercial database systems. The Mapserver library can transform map configuration text files into raster images that render the specified data layers within a particular geographical region. The configuration files can be generated either manually or through a graphical user interface. The files corresponding to each data layer are accessed sequentially, and the geometrical objects that fall within a specified geographical region are selected and rendered in an internal raster image format. The objects from each layer are superimposed on the internal raster image according to the order with which the layers appear in the configuration file. Depending on the contained amount of detail each data layer can occupy arbitrary amount of storage space that can reach several gigabytes. Disk accesses of layer data can be reduced by splitting each layer into tiles and creating index structures for making faster the spatial searches. When the objects rendering is complete, the image is transformed into a supported format specified by the user.
Quality of Service and Buffer Caching Here, we briefly mention additional possibilities for resource management layers that we don’t study further in the present paper but consider topics of future work. The utilization information accumulated in resource management modules can be used for making advanced decisions about the rate of forwarding requests to other parts of the system. Such forwarding can be based, for example, on the identifier of the user requesting the service or the particular application invoked, and be done selectively according to previously registered quality-of-service objectives. Another type of resource management module could organize the server buffer space for holding data generated by application modules according to user-specified parameters. Specialized block replacement policies can be used according to particular workloads that the server is expected to handle. Blocking or asynchronous semantics can be offered to the clients for accessing files depending on the desirable degree of flexibility and efficiency.
In this paper, we use the Mapserver as a testbed to study the performance advantage from the processing proximity to the stored data, and to investigate the application and system structure implications of invoking computation through file accesses. Our experimentation is limited to servers consisting of a single node that offers both local access to secondary storage and processing
5
Data Layer Description States Cities and Towns County Boundaries Hydrologic Unit Boundaries Roads Federal Lands/Indian Reserv Streams and Waterbodies Shaded Relief of N.America Public Land Survey System Total
Format Shapefile Shapefile Shapefile Shapefile Shapefile Shapefile Shapefile GeoTIFF Shapefile
Size (KB) 6,750 8,832 13,346 20,132 22,624 56,288 46,908 73,794 74,448 323,122
Figure 4:
Example of a four-layer map from the Great Lakes region, that has been generated from the following datasets : i) shaded relief, ii) roads, iii) state limits and iv) county boundaries.
Table 1:
These are datasets containing U.S. geographical information, which are made publicly available by the U.S. Geological Survey (Reston, VA). Their total storage footprint of 323MB is large enough to create an I/O intensive execution environment over a system with 256MB real memory.
We used nine geographical datasets made available by the U.S. National Survey (Table 1). Among them there is a topographic map of North America in colormapped GeoTIFF raster format occupying 74 MB. It covers the region with longitude and latitude ranges [+166E,-4W] and [15N,83N] degrees respectively. The remaining eight datasets contain U.S. geographical information in ESRI Shapefile vector format and have size between 7 MB and 75 MB. They span a region with approximate longitude and latitude ranges [+170E,-64W] and [17N,72N] degrees (as reported in the file header). Projection to a common system of coordinates (e.g. using the Lambert Azimuthal Equal Area method) allows aligned data merging of different layers into a single map.
power for image rendering. Our expectation is that good understanding of the client communication interactions with single-node servers will provide important experience towards building more complex servers consisting of multiple tiers.
5
Performance Evaluation
In our performance evaluation study, we consider the system throughput and response time when creating maps with Mapserver by i) running the application on the client and accessing the data from a local filesystem (Local), ii) running the application on the client and getting the data from an NFS-mounted remote file system (NFS), iii) invoking the application to run through Lerna on a remote file server that provides local access to the datasets (Lerna). We report how the system resource utilization is affected by system parameters related to the intensity of the workload, and the features of the communication system.
The queried images are requested in compressed PNG format, have 600x300 pixels and occupy map space 20x10 degrees. Each query involves generating an image map according to a text configuration and scanning the returned image on the client side. The total storage footprint of the dataset is 323 MB, which exceeds the 256 MB of real memory in the server. The queries are uniformly distributed over the entire region of the North America raster, and have configurable number of layers specified in the individual experiments. Independent runs of each experiment are repeated until the half-length of the 95% confidence interval of the response time lies within 5% of 5.1 Experimental Environment the estimated mean value. Each run lasts 10 minutes, All the nodes that we used are Dell PowerEdge 4400 and the measurements of the first three minutes are disservers with 733 MHz Intel Xeon processors and 256MB carded. Map requests arrive to the system following a main memory, running FreeBSD 4.5. Each node is equipped Poisson process. with 100MBit/s and 1GBit/s network cards, and is connected to multiple Seagate Cheetah 10 KRPM 18.2 GB SCSI disks over two 160MB/s channels. Each disk has 5.2 Baseline Measurements nominal formatted internal transfer rate within the range 26-40MByte/s, but practically the individual disk through- We begin our performance experimentation using the baseput drops to about 12 MB/s when head seeks are involved. line case of round-trip latency less than 1ms between the We control the round-trip delay of the client/server path client and the server. We experiment with two-layer map through Dummynet [28] running on the server, while we queries, where one layer is always the raster image of N.A., while the other is uniformly chosen among the releave the network bandwidth fixed to 100Mbit/s. 6
Baseline Performance
3
2
1
0 0
1
2
3
3
2
1
0
4
0
Request Rate (maps/s) Response time becomes longer as the request arrival rate increases. We assume less than 1ms round-trip delay in the client/server path for Lerna and NFS. In the Local case, data are accessed directly from a disk attached to the client.
Baseline Performance
Bandwidth (MB/s)
CPU Utilization (%)
Local Lerna NFS
8
70 60 50 40 30
7 6 5 4 3 2
20
1
10
0 0
0 2
3
4
9
NFS Local Lerna
1
3
The Local case achieves higher throughput across different request rates in comparison to Lerna and NFS, but the difference is not substantial even at high loads.
100
0
2
Figure 6:
Baseline Performance
80
1
Request Rate (maps/s)
Figure 5:
90
Local Lerna NFS
4
Throughput (maps/s)
4
Response Time (s)
Baseline Performance
NFS Lerna Local
1
2
3
4
Request Rate (maps/s)
4
Request Rate (maps/s)
Figure 8:
The disk bandwidth in the Local case becomes flat at increasing request rates, unlike the disk bandwidth of Lerna that increases to higher than 8 MB/s. For NFS we show the output network bandwidth from the server to the client, which grows linearly with system load. The somewhat higher throughput of the Local case reduces the number of jobs concurrently processed in the system leading to improved locality in the data accesses and better caching behavior overall.
Figure 7:
The processor utilization of the client host is relatively higher for NFS, in comparison to the client processor utilization of the Local case and the server processor utilization of the Lerna case. The difference can be attributed to the network and protocol overhead involved in every remote block transfer with NFS.
maining eight datasets. In this section, we investigate how the response time and the throughput of the system is affected by the system load, and try to explain our observations by measuring the processor utilization and the disk or network bandwidth. In order to keep the system behavior reasonable, we limit the request rate to avoid saturating the system.
At low or moderate loads the throughput remains the same across the three cases. At high loads, the system throughput is higher in the Local case, with Lerna following closely, and NFS distinctly beneath the other two, as shown in Figure 6.
Getting a closer look at the processor utilizations in Figure 7, we verify that the NFS client is almost satuFrom Figure 5 we notice that the average response rated at injected rate of 4 requests/s. Instead, the meatime of generating a map and scanning it is highest for sured processor utilization on the Lerna server is close to the NFS case, lowest for the Local case, and between the that of the Local case. Part of the Lerna server processtwo for the Lerna case. This is not unreasonable given ing is due to protocol transformations for the SFS toolkit, the network and fileserver protocol processing involved in but during our experiments we observed the SFS server data accesses and image transfers with NFS and Lerna. daemon to occupy 2-3% of the processor utilization. The 7
Server Proximity NFS Lerna Local
3
Server Proximity
4
Bandwidth (MB/s)
Response Time (s)
4
2
1
0
Local Lerna NFS
3
2
1
0 1
10
100
1
Round-Trip Delay (ms)
10
100
Round-Trip Delay (ms)
Figure 10:
Figure 9:
Due to the synchronous I/O accesses, the utilized network bandwidth of NFS drops as the round-trip delays from the server become longer. This phenomenon is much less prevalent with Lerna, because only the relatively small output of the computation is scanned by the client, rather than the entire collection of datasets.
Increasing the round-trip delay of the client/server path has a detrimental effect to the performance of a remotedly mounted filesystem. For example, in our experiments, response time with NFS surges exponentially when delay is set to 10ms and 100ms. On the contrary, Lerna remains relatively insensitive to the distance from the server, since only the output of the computation is directly accessed by the client.
5.3
Effects of Server Proximity
fileserver overhead is incurred during transfers of complete maps to the clients, rather than during every data Over a wide-area network, accessing large amounts of block transfer that would be the case with plain NFS. data directly from a remote file system might not be Quite interestingly, the disk bandwidth on the server the best approach to do processing, either because fineflattens out at high loads in the Local case (Figure 8), granularity accesses can be affected by long latencies, or while it increases almost linearly with the load when us- because local caching of gigabytes occupies valuable reing Lerna. In fact, our measurements show that the num- sources. Here, we show that running application comber of concurrently processed requests in the Local case putation remotely and close to the data is both feasible is significantly lower than that with Lerna and NFS. This and beneficial, especially as the propagation delay of the indicates that only a few datasets are needed at the same client-server path increases beyond a few milliseconds. In time in memory with Local processing, which translates Figure 9, we illustrate the response time for generating into less contention for buffer space in the server, and two-layer maps when the processing takes place either on fewer disk requests overall. For NFS, things are more a remote server through Lerna, or locally with the dataset complicated due to the data caching on both the server accessed from a remote server via NFS. The Local case and the client side. As a result, we show the network is added for comparison, although when client and server bandwidth from the server to the client rather than the run on the same host, the round-trip delay obviously beserver disk bandwidth, because we consider the network comes irrelevant. bandwidth a more representative measure of the actual data traffic in the system in that case. As we see, the utilized network bandwidth of NFS is lower than the disk bandwidth in Lerna, which can be explained by the processor saturation with NFS.
The most striking observation from that figure is the dramatic increase of the response time during NFS-based processing at delays of 100ms. Such delays can be typical in research collaboration networks spanning multiple continents. The main reason for this behavior is the fact that most applications have not been designed for such environments, and therefore do not use asynchronous I/O for improving concurrency between communication and computation. In the context of such restrictions, Lerna is shown to perform much better, actually quite close to the Local case. The minor discrepancy between these two methods can be attributed to the scanning of the computation output that unavoidably is done by the client. Consistently with this explanation, the network throughput of NFS drops drastically with longer delays as is shown in Figure 10. In constrast, the disk bandwidth utilized by
Overall, we note that at minimal round-trip delay in the client/server path Lerna demonstrates a behavior that is similar from several aspects to the Local case, unlike NFS that is penalized by the overhead involved in individual data block transfers. The advantage of NFS in terms of double aggregate caching space across the server and the client does not seem to create a measurable benefit in our experiments.
8
1.5
Original Reorganized
9
Disk Bandwidth (MB/s)
Response Time (s)
Data Reorganization
1.0
5
8 7 6
Data Reorganization Original Reorganized Disk 1 Disk 2
5 4 3 2 1
0
1
0
4
Request Rate (maps/s)
Figure 11:
By reorganizing our datasets according to the statistics from Table 2, it was possible to reduce the response time at high load by about 30%.
Dataset Cites States Counties Hydrologic Federal Lands Shaded Relief Public Lands Streams Other Total
Datafile citiesx020.shp statesp020.shp countyp020.shp hucs00p020.shp fedlanp020.shp shdrlfi020.tif plss00p020.shp hydrogp020.shp files all
Requests 21,675 60,554 215,702 325,945 342,785 420,188 438,107 486,331 258,795 2,570,082
4
Figure 12:
The statistics that we gathered in Table 2 correctly guided us into moving three datasets onto a different disk. By doing so, the measured bandwidth is equally split across the two disks in the reorganized case.
ms/IO 5.52376 7.71550 5.87547 5.70801 7.80292 8.36483 7.89520 8.97421 N/A 7.49016
requests/s. From Figure 8, we already know that the measured disk bandwidth exceeded 8MB/s at high loads. This is relatively high in comparison to the maximum 12MB/s that we expect from each disk when head seeks are involved in the accesses. Based on such evidence we decided to move to a different disk three of the datasets: i) the public land survey, ii) the streams and waterbodies, iii) the county boundaries. By doing so, we anticipate the number of disk requests to be equally split between the two disks. We ran experiments at rates of 1 and 4 requests/s before and after the data reorganization. As we notice in Figure 11, the response time at high-load drops by more than 30%. More importantly, an almost equal amount of disk bandwidth is measured across the two data disks after the reorganization. We believe that instrumentation functionality should be a standard option in modern filesystem. The valuable feedback that it provides can be critical for the maintenance and tuning of the system, whether it is done by professional administrators or automatically by system management software.
Table 2:
These are statistics of file usage that we gathered during a two-hour system operation at rate 0.5 requests/sec. We change the random-number generator seed every ten minutes. Each I/O request corresponds to an 8 KB data block. Based on the information shown, we moved the datasets related to the Public Land Survey System (plss00p020), the Stream and Waterbodies (hydrogp020) and the Counties (countyp020) to a separate disk.
Lerna remains relatively insensitive to the delay increase.
5.4
1
Request Rate (maps/s)
Accounting and Data Reorganization
In our current implementation, we maintain all the filesystem statistics on a hash table in memory. Several In this section, we examine the possibility of gathering acoptimizations are possible in terms of summarizing the counting information related to the usage of the datasets accumulated information or storing the statistics on the served by Lerna. Based on the statistics that we accumudisks. late when our entire dataset collection is stored on a single disk, we subsequently split our data across two separate disks, and measure the average utilized disk bandwidth Discussion and Future Work on each of them. Our goal is to achieve load balancing 6 within the storage system, in order to improve the response time. Lerna has been built upon the SFS toolkit by adding no In Table 2, we show the number of block read accesses more than 1000 lines of C++ code. Our experience with across the nine main datafiles that were recorded during the system has been very positive in terms of robustness an operation period of two hours at rate equal to 0.5 and stability during our experiments. In our future work, 9
we plan to investigate the streaming mode of operation that will enable additional flexibility in potential partial accessing of remote files with reduced response time. In other words, clients should be able to initiate remote operations, the output of which will gradually become available as processing proceeds. Another issue is structure of distributed processing across multiple storage nodes that replicate the Lerna internal organization introduced here. A critical question that deserves further study is also that of security. In our present implementation, the user-initiated computation is limited through the permissions assigned to the user by the file access request. It is important to examine in detail what are the authorization implications of that, and also how resources can be utilized by different users according to predefined agreements.
they are mainly focusing on the enhancement of the file system itself rather than the facilitation of application code execution close to the server. In clusters of clients and servers, run-time support can be provided that keeps track of resource utilization across the system [2]. Application and file system functions can be partitioned and automatically placed on the server or the client according to dynamically changing conditions of application behavior and resource availability.
Virtual file views extend traditional data files with application processing carried out at the storage device level [22]. Virtual disks are attached to block address ranges outside the physical storage space of existing devices. Subsequently, services associated with virtual files are bound to the block address space of the virtual disks. Even though application execution close to stored data is demonstrated to offer performance advantages, we find the direct attachment of virtual files to arbitrary virtual 7 Related Work disk address spaces incompatible with the traditional file system interface where device block allocation is transparIn previous work, Wickremesinghe et al. introduced the ent to the user. More importantly the proposed virtual notion of computation-enabled storage nodes called Ac- file model limits application processing to individual files tive Storage Units (ASU) [33]. Their architecture allows rather than allowing a service to access multiple of them. processing capacity to scale with storage space, and data movement to be reduced by running data searching, filtering, or transcoding operations directly on the ASU. Also, Data Grid. The Data Grid architecture specifies sevthey introduce a programming model based on streaming eral design principles for managing large data sets potenprimitives for flexible resource management across hetero- tially distributed in a wide area [13]. It defines the file as geneous environments. The present paper expands that the basic storage unit, and introduces interfaces for acwork towards examining related system interfacing and cessing operations on data files, and their metadata. Nest is a storage system specifically designed for Grid environperformance evaluation issues. ments [7]. It supports multiple different data transfer protocols, models of concurrency, and request schedulActive Disks. In network-attached storage, Gibson et ing policies that can adapt to varying operating condial. demonstrate improved file system scalability by trans- tions. The Chimera architecture refers to another sysferring data directly from storage devices to clients [16]. tem for managing Grid data [15]. It includes relational The network server load is reduced by splitting the file data schemas for representing the entities involved in data system software between the file manager and the network- derivations, and a data language for expressing the data attached drives. Also, several studies propose active disk derivations themselves. Experiments with physics simsystems that run application-level code on disk drives in ulation and astronomical imaging demonstrate the usorder to reduce data traffic and improve parallelism [1, 20, ability of the system in managing complex dependencies 27]. Related experiments show significant speedup with between application invocations. Datacutter is a middledata mining and image processing applications. However, ware system that enables handling of spatial range queries all these studies are fundamentally different from ours, and data filtering operations close to a multidimensional because they advocate running file system or application dataset archive [10]. code directly on the disk drive. Bester et al. proposed the GASS service for remote The performance and services of file server operations data accesses [9]. Local reference counts automate replican be enhanced through a remote procedure call frame- cation of a remote file into the local storage space, or work that allows running scripts securely on the server flushing of the local copy. LegionFS is an object-oriented [29]. Information about the storage system can be ex- file system that offers location-independent object namported to the file system running above in order to im- ing, secure object access, system load distribution across prove parallelism and fault-tolerance in the disk accesses multiple resources, and service extensibility [32]. A col[14]. These approaches are different from ours because lection of system-wide metadata allow adaptation of the
10
system to dynamically changing conditions. The Storage Resource Broker middleware provides unified attributebased access to data and metadata that are distributed across file systems, databases, and archives [6]. A centralized catalog allows accessing the metadata through a standard relational schema. The SDSS Science Archive provides access to astronomical object information via a three-tier architecture consisting of a user interface, an intelligent query engine, and a data warehouse [30]. A diverse set of I/O-intensive queries demonstrate several challenges involved in scanning, classifying, or processing in a streamlike fashion the data using scalable clusterbased infrastructure.
mance variations [3, 4].
8
Conclusions
In the present paper, we introduced an innovative approach for running application-specific computation close to stored data. We preserve the traditionally simple interface of file system accesses, and experimentally demonstrate significant performance advantages in comparison to alternative remote data access methods.
9
Acknowledgments
Extensible File Systems. Heidemann and Popek proposed stackable filing as a layered structure for constructWe are thankful to Christoph Spoerri who provided useing file system services [18]. Clean interfaces between difful feedback on current geographical information systems ferent layers can lead to better code reusability and extentechnology. sibility, while data and metadata caching coherence can improve system operation correctness and performance. Development of stackable file systems can be facilitated References through the FiST language [34]. It allows high-level description of file system features, and automates the gen- [1] Acharya, A., Uysal, M., and Saltz, J. Active Disks: Programming Model, Algorithms and Evaluation. In ACM ASPLOS eration of the corresponding kernel modules across sevConference (San Jose, CA, Oct. 1998), pp. 81–91. eral operating systems. General data manipulation is also [2] Amiri, K., Petrou, D., Ganger, G. R., and Gibson, G. A. Dypossible through extra layers that can be statically emnamic Function Placement for Data-intensive Cluster Combedded within the file system stack. In the Hurricane puting. In USENIX Annual Technical Conference (San Diego, File System complex file structures can be composed usCA, June 2000), pp. 307–322. ing simple building blocks [21]. Even though all the above [3] Arpaci-Dusseau, A. C., Arpaci-Dusseau, R. H., Culler, D. E., approaches potentially make easier the development and Hellerstein, J. M., and Patterson, D. High-Performance Sorting on Networks of Workstations. In ACM SIGMOD (Tucson, portability of file system services, they are limited to the AZ, May 1997), pp. 243–254. file system itself, and are not dealing with efficient sup[4] Arpaci-Dusseau, R. H., Anderson, E., Treuhaft, N., Culler, port of application code execution. External Memory I/O and Database Middleware. In the past, several middleware systems have been proposed by the database research community to allow integrated data access across multiple types of data repositories through a unified object-oriented data model [12, 26]. The BeSS system is a database storage manager providing extensibility, database model configurability, and support for local or remote application execution [11]. Instead, we focus on specifying the systems infrastructure necessary for facilitating remote access of data, rather than enforcing a particular data model. I/O-efficient algorithms are designed for processing of large datasets by minimizing the number of data block transfers between primary and secondary storage [31]. Performance of data-intensive applications can be improved by taking into account performance heterogeneities when designing their programming environment. A combination of distributed data queues and redundancy in the layout and access of disks have been shown to achieve robustness against system perfor11
D. E., Hellerstein, J. M., Patterson, D., and Yelick, K. Cluster I/O with River: Making the Fast Case Common. In ACM Workshop on I/O in Parallel and Distributed Systems (Atlanta, GA, May 1999), pp. 10–22.
[5] Barclay, T., Gray, J., and Slutz, D. Microsoft TerraServer: A Spatial Data Warehouse. In ACM SIGMOD (Dallas, TX, May 2000), pp. 307–318. [6] Baru, C., Moore, R., Rajasekar, A., and Wan, M. The SDSC Storage Resource Broker. In IBM CASCON (Toronto, ON, Nov. 1998). [7] Bent, J., Venkataramani, V., Leroy, N., Roy, A., Stanley, J., Arpaci-Dusseau, A. C., Arpaci-Dusseau, R. H., and Livny, M. Flexibility, Manageability, and Performance in a Grid Storage Appliance. In IEEE Intl Symp on High Performance Distributed Computing (Edinburgh, Scotland, July 2002), pp. 3– 12. [8] Berners-Lee, T., Masinter, L., and McCahill, M. Uniform Resource Locators (URL). In Request for Comments: 1738. Network Working Group, Dec. 1994. [9] Bester, J., Foster, I., Kesselman, C., Tedesco, J., and Tuecke, S. GASS: A Data Movement and Access Service for Wide Area Computing Systems. In ACM Workshop on I/O in Parallel and Distributed Systems (Atlanta, GA, May 1999), pp. 78–88.
[25] Mazieres, D. A toolkit for user-level file systems. In USENIX Annual Technical Conference (June 2001), pp. 261–274.
[10] Beynon, M., Ferreira, R., Kure, T., Sussman, A., and Saltz, J. DataCutter: Middleware for Filtering Very Large Scientific Datasets on Archival Storage Systems. In IEEE Symposium on Mass Storage Systems (College Park, MD, Mar. 2000), pp. 119–133.
[26] Papakonstantinou, Y., Garcia-Molina, H., and Widom, J. Object Exchange across Heterogeneous Information Sources. In IEEE Intl Conference on Data Engineering (Mar. 1995), pp. 251–260.
[11] Biliris, A., and Panagos, E. A High Performance Configurable Storage Manager. In IEEE Intl Conference on Data Engineering (Taipei, Taiwan, Mar. 1995), pp. 35–43. [12] Carey, M. J., Haas, L. M., Schwarz, P. M., Cody, M. A. W. F., Fagin, R., Flickner, M., Luniewski, A., Niblack, W., Petkovic, D., II, J. T., and Wimmers, J. H. W. E. L. Towards Heterogeneous Multimedia Information Systems: The Garlic Approach. In IEEE RIDE-DOM (Mar. 1995), pp. 124–131. [13] Chervenak, A., Foster, I., Kesselman, C., Salisbury, C., and Tuecke, S. The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Datasets. Journal of Network and Computer Applications 23 (2001), 187–200. [14] Denehy, T. E., Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H. Bridging the Information Gap in Storage Protocol Stacks. In USENIX Annual Technical Conference (Monterey, CA, June 2002), pp. 177–190. [15] Foster, I., Vockler, J., Wilde, M., and Zhao, Y. Chimera: A Virtual Data System for Representing, Querying, and Automating Data Derivation. In Intl Conf Scientific and Statistical Database Management (Edinburgh, Scotland, July 2002), pp. 37–46. [16] Gibson, G. A., Nagle, D. F., Amiri, K., Chang, F. W., Feinberg, E. F., Gobioff, H., Lee, C., Ozceri, B., Riedel, E., Rochberg, D., and Zelenka, J. File Server Scaling with Network-Attached Secure Disks. In ACM SIGMETRICS Conference (Seattle, WA, June 1997), pp. 272–284.
[27] Riedel, E., Faloutsos, C., Gibson, G. A., and Nagle, D. Active Disks for Large-Scale Data Processing. IEEE Computer, 6 (June 2001), 68–74. [28] Rizzo, L. Dummynet: A Simple Approach to the Evaluation of Network Protocols. ACM Computer Communications Review 27, 1 (Jan. 1997), 31–41. [29] Sivathanu, M., Arpaci-Dusseau, A. C., and Arpaci-Dusseau, R. H. Evolving RPC for Active Storage. In ASPLOS (San Jose, CA, Oct. 2002), pp. 264–276. [30] Szalay, A. S., Kunszt, P. Z., Thakar, A., Gray, J., Slutz, D., and Brunner, R. J. Designing and Mining Multi-Terabyte Astronomy Archives: The Sloan Digital Sky Survey. In ACM SIGMOD (Dallas, TX, May 2000), pp. 451–462. [31] Vitter, J. S. External Memory Algorithms and Data Structures: Dealing with Massive Data. ACM Computing Surveys 33, 2 (June 2001), 209–271. [32] White, B. S., Walker, M., Humphrey, M., and Grimshaw, A. S. LegionFS: A Secure and Scalable File System Supporting Cross-Domain High-Performance Applications. In ACM/IEEE Conference on Supercomputing (Denver, CO, Nov. 2001), pp. 59–59. [33] Wickremesinghe, R., Chase, J. S., and Vitter, J. S. Distributed Computing with Load-Managed Active Storage. In IEEE Intl Symp on High Performance Distributed Computing (Edinburgh, Scotland, July 2002), pp. 13–23.
[17] Gifford, D. K., Jouvelot, P., Sheldon, M. A., and James W. O’Toole, J. Semantic File Systems. In ACM Symposium on Operating Systems Principles (Asilomar,CA, Oct. 1991), pp. 16–25. [18] Heidemann, J., and Popek, G. Performance of Cache Coherence in Stackable Filing. In ACM Symposium on Operating System Principles (Copper Mountain Resort, CO, Dec. 1995), pp. 127–142. [19] Kaminsky, M., Peterson, E., Fu, K., Mazieres, D., and Kaashoek, M. F. REX: Secure, modular remote execution through file descriptor passing. Tech. Rep. TR-884, MIT-LCS, January 2003. [20] Keeton, K., Patterson, D. A., and Hellerstein, J. M. A Case for Intelligent Disks (IDISKS). Sigmod Record 27, 3 (Sept. 1998), 42–52. [21] Krieger, O., and Stumm, M. HFS: A Performance-Oriented Flexible File System Based on Building-Block Compositions. ACM Transactions on Computer Systems 15, 3 (Aug. 1997), 286–321. [22] Ma, X., and Reddy, A. L. N. Implementation and Evaluation of an Active Storage System Prototype. In Workshop on Novel Uses of System Area Networks held in conjunction with HPCA-8 (Cambridge, MA, Feb. 2002). [23] Magoutis, K., Addetia, S., Fedorova, A., Seltzer, M. I., Chase, J. S., Gallatin, A. J., Kisley, R., Wickremesinghe, R. G., and Gabber, E. Structure and Performance of the Direct Access File System. In USENIX Annual Technical Conference (Monterey, CA, June 2002), pp. 1–14. [24] Mapserver 3.6. Environmental Resources Spatial Analysis Center, University of Minnesota, http://mapserver.gis.umn.edu/.
12
[34] Zadok, E., and Nieh, J. FiST: A Language for Stackable File Systems. In USENIX Annual Technical Conference (San Diego, CA, June 2000), pp. 55–70.