Cost/Performance Control in SNOWBALL Distributed File Manager Radek Vingraleky
Yuri Breitbarty
Gerhard Weikumz
Abstract Networks of workstations are an emerging architectural paradigm for highperformance parallel and distributed systems. Exploiting networks of workstations for massive data management poses exciting challenges. We consider here the problem of management of record-structured les in such an environment. The le records are accessed by a dynamically growing set of clients based on a search key. To scale up the throughput of client accesses with approximately constant response time, the les and thus also their access load are dynamically redistributed across a growing set of workstations. The redistribution method is capable of an explicit control of system cost/performance. Namely, the system maintains its cost/performance at a prescribed constant level for a wide spectrum of workloads as con rmed by experimental simulation results. Consequently, the system is capable of providing soft guarantees on the record retrieval times.
1 Introduction The current computing environment is characterized by a variety of workstations and PC's interconnected by a LAN or WAN computer network. Such an environment is frequently called workstation farm or network of workstations, shortly NOW [ACP95]. The aggregate processing power of such networks of workstations is comparable, and often exceeds, that of mainframes and super This material is based in part upon work supported by NSF under grant IRI-9221947. y Department of Computer Science, University of Kentucky, Lexington, KY 40506, E-mail:
[email protected],
[email protected] z Department of Computer Science, University of the Saarland, D-66041 Saarbruecken, Germany, E-mail:
[email protected]
1
computers. For example, a farm consisting of 50 Sun SPARCstations 20 is capable of delivering approximately 2 GFLOPS, 2 GB of RAM, 100 GB of disk storage capacity, 500 MB/s disk bandwidth and 1 MIPS per $250 which is 100 times less than the cost of 1 MIPS on the mainframe [S95, ACP95, DG92]1 . These costs are a consequence of a simple economic fact: the more units of computers are sold, the more easily the R&D costs can be amortized among them. Besides these economic considerations the networks of workstations have an additional advantage. More and more applications become I/O bound and the overall system performance improvement is impeded by the relatively slow growth of the disk bandwidth. The I/O parallelism has emerged as one of the possible ways to bridge the performance gap between the processors and disks. For this reason, developing distributed I/O systems which would allow to increase the available I/O bandwidth is crucial to most data intensive applications. To fully exploit the I/O capabilities of a network of workstations, any distributed le management system must be able to ensure cost/performance scalability. By cost/performance scalability [DG92] we understand the following: whenever the total volume of load in the system grows by factor of N , then by increasing the system size by factor of N the system throughput grows by factor of N while the data access time remains approximately constant. If the system's cost/performance is not scalable then the system might have to grow over a signi cant number of additional nodes to achieve even a small increase in its throughput. Consequently, the aforementioned cost/performance advantage of data distribution in networks of workstations would be lost. For most delay-sensitive applications such as multimedia or WWW storage servers [GC92, CK95, KMR95] or real-time databases [AG92, HJC93] the cost/performance scalability is not sucient. Namely, also explicit cost/performance control is necessary to deliver a required quality of service. To address these points, we proposed in [VBW95] a SNOWBALL (Scalable Storage on Networks Of Workstations with BALanced Load) method for distribution of keyed, record structured les over the nodes of a network of workstations. Several records are grouped into variable sized segments which are distributed over a dynamically changing set of servers. The records are accessed by clients which issue insert, delete, update or query operations on the records. We assume that the same workstation may act as both server and client. Each client sends an operation request to the server which, in the client's view, holds the requested record. The client's view of the system is represented by its private copy of address table (the address table represents a mapping from key address space to server address space and consists of a dynamic indexing structure built 1 The data are based on 1992 pricing, it is likely that today's workstations gained even stronger cost/performance advantage over the mainframes.
2
on top of a lookup table). Updates to the address table are lazily propagated to the clients upon commitment of addressing errors. The address table information is updated autonomously by each server, there is no materialized \correct" view of the system. The servers process the clients' requests independently of each other and thus the clients can bene t from inter-request parallelism. The servers dynamically balance the load of the system and adjust the system size by means of segment migrations which are directed either onto lightly loaded or previously unused servers (in the latter case the system size is expanded). Such a dynamic system explansion is especially important for applications like WWW or multimedia document storage systems where both bandwidth and storage space requirements might grow substantially over the time. The decision on whether the system should be expanded with a new server aims at keeping the average server load (and thus also the average query response time) on a constant, prespeci ed level. This way, the system provides an explicit control of its cost/performance. Both addressing maintenance and load management are fully distributed, there is no single point of failure in SNOWBALL. The SNOWBALL data distribution method seems to be the only one that considers data load balancing and system size adjustment with an explicit cost/performance control in shared nothing architecture which would be robust to skewed and time-evolving data access patterns. It builds on ideas presented in [SPW90, LNS93, LNS94, JK93, Dev93, KW94, VBW94, BVW95, BS85, TLC85, ELZ86, LLM88, WSZ91, SWZ93, SWZ94, SGK85, OCD88, HKM88]. The rest of the paper is organized as follows. Section 2 gives an overview of the new SNOWBALL method. Section 3 introduces a simulation testbed that we developed for performance evaluation. Section 4 presents experimental results on performance and cost/performance control in SNOWBALL. Finally, Section 5 concludes the paper.
2 Overview of the SNOWBALL Algorithm
2.1 Server Load Management Policies
We assume that the load of each server can grow up to a certain threshold which we call the maximal server load and denote by lL . Exceeding lL on server s results in a response time degradation exceeding given performance constraints and therefore the load of the server must be alleviated by means of migrating some of its segments (and thus also the load associated with them) onto another server. The migration is directed onto a new, previously unused server s whenever the average server load (or global system load), with an additional server factored in, is above a pre-speci ed threshold called the requested global system 0
3
load and denoted by lG (this indicates that the load in the system is already high and growth of the system onto a new server is necessary for performance reasons). If, however, the global system load is below or at the pre-speci ed threshold lG , the least loaded server s0 is selected and the excess load of server s is redistributed equally between s and s0 . In general, the segment migration onto a new server s can be triggered even before load of any server reaches lL, namely whenever the most loaded server in the system discovers that the global system load (with a new server factored in) is above the threshold lG, then it migrates some of its segments onto s . The servers can be also released from the system: if the least loaded server in the system nds that the global system load (with one server factored out) is below lG , then it attempts2 to migrate all its segments onto the second least loaded server s1 . By maintaining the global server load on a constant level lG , the system guarantees a constant average query response time irrespective of the system size and client load. To this end, it is absolutely necessary that the interference of segment migrations with processing of client requests is limited to minimum3 . While such an approach might not be suitable in systems requiring hard (i.e. worst case) guarantees for purely sequential retrieval of continuous media (as in e.g. video-on-demand systems), we believe that the soft performance guarantees provided in SNOWBALL are more appropriate in systems requiring ecient random access to record structured les characterized by a priori unknown and highly dynamic access patterns such as WWW servers[KMR95] or multimedia databases[GC92, CK95]. Furthermore, providing soft performance guarantees leads a better cost/performance since the disk bandwidth is not reserved based on the worst case scenario. Each server needs to nd out the load of all other servers before it can reach the decision where to migrate its segments. There are several ways in which a server s can learn about the load of other servers. In SNOWBALL we have adopted an approach when each server distributes its load among other servers whenever it changes by more than from the last distributed value. Moreover, each server can adjust its value of based on the network utilization to prevent the load reports from the request response time degradation. Although more sophisticated policies (e.g. [BS85]) could be easily accommodated in SNOWBALL, such an increase in complexity does not lead to a better response time for our parameter settings. The policy which determines which segments should be migrated from server s to another server aims at an equal distribution of load among all servers. Since nding the optimal solution is an NP-complete problem[GJ79], we use a heuristic algorithm BSF (Biggest Segment First) which is an extension of the algorithm described in [Gr69]. A detailed description of the BSF algorithm can 0
0
2 It might happen that s cannot sustain the load of all segments of the least loaded server 1 in which case the migration is not realized. 3 In Section 2.2 we describe the necessary segment migration mechanisms.
4
be found in [VBW95].
2.2 Server Load Management Mechanisms
Once a server triggers segment migration to another server, it is important for scalability and control of cost/performance that the interference of the migration with the processing of client operation requests is minimal. In addition, we require that segments remain available during the migration process and the access time to these segments is not aected to any signi cant extent. To achieve the former objective we require that accesses to critical resources of each server (disk) done on behalf of segment migrations have lower priority than accesses on behalf of client requests. It could, however, happen that the requests on the migration source server arrive with such a rate that its disk approaches saturation and the migration would be postponed inde nitely. To guarantee that each migration completes once it started, we designed a policy which dynamically increases the the priority of migration disk accesses (with respect to the operation request disk accesses) depending on server load [VBW95]. The latter objective, i.e. segment availability during its migration, is achieved by keeping a copy of each migrated segment on the source server and processing all requests to the migrated segment on the copy until the entire segment has been received by the destination server. All updates, inserts and deletes of the records which have been already migrated to the destination server are forwarded along with other migrated data. Additional details can be found in [VBW95].
2.3 Load Tracking
We de ne the server load as the sum of the loads of all segments located at the server. Similarly to [SWZ93, SWZ94], we de ne the segment load as the sum of ratios of arrival and service rates of each operation class (i.e., record insertion, deletion, update and query). Consequently, the load of a segment is given as
X o
(1)
o O o 2
where O is the set of all operations on the segment, o is the arrival rate of operation o, and o is the service rate of operation o given as the reciprocal of its service time (o 2 O). The load of a server is the accumulated load of its segments. The arrival rate of each operation o 2 O is estimated by maintaining a \sliding window" of arrival times of the last n operations o performed on a segment[SWZ93]. The arrival rate o is thus approximated as
o t ?n t 5
1
(2)
where t is the current time and t1 is the arrival time of the oldest operation kept in the window. Whenever less than n operations have been performed on the segment, the arrival rate o is not de ned. The load of the segment is de ned whenever the arrival rate of at least one operation class is de ned. For eciency, the load of each segment is recalculated only when an operation request to that segment has been received or at xed time periods in absence of any request arrivals. Information on selection of parameter n can be found in [VBW95].
3 Simulation Model We developed a detailed simulation testbed for performance experimentation based on CSIM [Sch92]. A detailed description of the testbed can be found in [VBW95], here we describe only the key characteristics of the testbed. We model a homogeneous system with each server having CPUs and disks with identical characteristics. Namely, each server has a disk which can fetch block of size sizeblock within time tIO and processor with MIPS rate rateCPU . We assume a uniform CPU cost for all servers to service a client request (instrreq ) and prepare a packet (instrmsg ) for transmission. The network is shared by all servers and clients. We model a system storing large records (sizerecord) such as WWW or multimedia documents. Consequently, the data is written to disk in large contiguous blocks (sizeblock ) to eciently utilize the disk bandwidth. Since we concentrate here on ecient data redistribution on servers' disks, we assume, for simplicity, no main memory caching of records on either servers or clients. The hardware setup (rateCPU , tIO , bandwnet and sizemax pack ) aims at what we expect to be a commodity network of workstations in ve to ten years (i.e., a 100 MIPS processor, a 10 GByte disk with an access time of 15 ms for reading or writing an entire track, and communication over an ATM-based LAN with a bandwidth in the order of 1 Gb/sec). Each client generates requests with interarrival times exponentially distributed with arrival rate req . We concentrate on workloads consisting only of query and insert operations since we assume that this is a typical scenario in most WWW or multimedia repositories. The workloads are dominated by queries, while the fraction of inserts in each workload is xed at pI . Keys of both insert and query requests are generated within range [min; max]. The keys of queries are characterized by a Zipf-like distribution [Knu73] with a skew parameter corresponding to 80-20 skew. This captures e.g. the situation when a multimedia repository contains some hot data (e.g. NCSA WWW home page) and cold data (e.g. NCSA's copyright information). The access patterns might evolve in time, i.e. previously hot data might become cold and vice versa what is modeled by shifting the range [min; max]. Each experiment starts with a single server containing no records and a single 6
Notation
rateCPU instrreq instrmsg tIO sizeblock sizerecord latnet bandwnet sizemax pack lG lL
nc c req pI
Meaning Value server CPU instruction rate 100 MIPS CPU cost to serve a client request 10000 instructions CPU cost for each message 5000 instructions mean single-block disk I/O time 15 msec I/O block size 50 KB record size 10 KB network latency 0.1 msec network bandwidth 1 Gb/sec maximum packet size 100 KB required global system load 40% - 80% maximal allowed server load 50% - 90% load report granularity 1 - 100 % total number of created clients 225 client growth rate 0.5 - 2% req request arrival rate 5.797 1/sec probability of insert 0.1 skew of query key distribution 0.1386 Figure 1: Simulation parameters.
client. Throughout each experiment, the workload is increased by generating additional clients. The time between the appearance of new clients is exponentially distributed with the arrival rate c . The clients generate requests with the xed arrival rate req until the total of nc clients has been created. The experiment terminates after all requests are acknowledged. In [VBW95] we consider also workloads with decreasing number of clients to test the system's ability to shrink its size. The table in Figure 1 summarizes the con guration parameters and their values for the simulation model considered in the paper. In [VBW95] we consider a broader spectrum of applications and various hardware setups. [VBW95] also contains additional experimental results on cost/performance scalability of various load management policies, detailed study of load management mechanisms and a study of eects of the most important parameters on system performance.
4 Experimental Results The load management must cope with reconciling of two contradictory goals: to maximize the model's performance and at the same time to minimize the costs paid for achieving the given performance. The performance in our model is 7
expressed in terms of the average query response time4 . The lower the response time, the better the performance. The costs are expressed by the number of acquired servers for a xed number of clients. The more servers are acquired, the lower the global system load and thus the better the query response time. However, if the number of acquired servers is large, any further increase does not lead to any appreciable improvement in query response time and consequently, the cost/performance deteriorates. At the same time, if the number of acquired servers is small, the global system load is high and thus any further decrease in the number of servers creates a substantial increase in the query response time without comparable savings in number of servers. Again, the cost/performance degrades. To be able to quantify the cost/performance, we need to de ne some explicit metric. Typically (e.g. [G91]), the performance of a system is given by its throughput. We thus de ne the performance as the average query throughput of a server (the aggregate query throughput grows with the number of servers), which is given as a reciprocal of the average query response time. The costs are typically expressed by ve-year cost-of-ownership[G91]. Since this is relatively dicult to express in our model, we approximate such a cost by the average number of servers acquired per 100 clients. Consequently, the cost/performance metric in our model is given by cost/performance = #servers per 100 clients query response time (3) In [VBW95] we analyze the cost/performance metric on an analytical level. Namely, we derive a lower bound on cost/performance based on M/G/1 model. The lower bound assumes a perfectly balanced load and ignores the extra costs of segment migrations. Figure 2 shows the lower bound on cost/performance as a function of global system load. The Figure also shows the measured cost/performance for several settings of the requested global system load lG and maximal allowed server load lL. The results indicate that SNOWBALL is indeed capable to control its cost/performance and, in addition, its achieved cost/performance is relatively close to the derived lower bound. Consequently, the load management mechanisms described in Section 2.2 are indeed able to absorb the extra costs of segment migrations without signi cant eects on the system cost/performance. However, the extra migration costs grow together with the global system load. Firstly, the same increase in the global system load results in a more signi cant query response time increase when the level of the global load is high. Secondly, the higher the requested global system load lG, the more migrations onto acquired servers are needed to achieve it. Finally, the same server load imbalance aects the query response time more signi cantly when the global system load is high. 4 We assume that record query response time is most critical to the majority of delaysensitiveapplications.
8
cost/performance
600
G1 G2 G3 G4
500
l_G = 40%, l_L = 80% l_G = 50%, l_L = 80% l_G = 60%, l_L = 80% l_G = 70%, l_L = 80%
G4 G1 G2 400 20
40
G3
60
80
measured global system load (%)
Figure 2: Cost/performance Control. The delay-sensitive applications must consider also the quality of performance control in addition to cost/performance control. Figure 3 shows the average query response time for several dierent values of requested global system load lG . For each setting of lG , the Figure shows the lower bound on the response time as obtained using M/G/1 model. It con rms the cost/performance control results in that the quality of control (in terms of the dierence between the measured and predicted response times) is excellent for moderate values of global system load, but degrades as the global system load grows. The same argument regarding the increasing costs of segment migrations mentioned in the cost/performance considerations above apply also here. Figure 3 shows that also the predicability of the system's performance (in terms of query response time variance) degrades as the system global load grows. This is a basic consequence of M/G/1 model. In addition, Figure 3 also shows scalability of query response time which remains practically on the same level in presence of system size increase5.
5 Concluding Remarks In this paper we described a novel approach to distributed data management, the SNOWBALL method, which is completely decentralized and provides scalability and explicit control of cost/performance in presence access skew and evolving access patterns. It involves dynamic load redistribution by means of segment migrations and directly considers the global system load. SNOWBALL provides only soft performance guarantees in that it keeps the average query response time on a constant level. Experiments conducted for a large spectrum 5
More detailed results on cost/performance scalability can be found in [VBW95]
9
average document retrieval time (msec)
50
40
30
l_G = 70%, l_L = 80% l_G = 55%, l_L = 80% l_G = 40%, l_L = 80%
20
10
0 0
50
100
150
200
number of clients
Figure 3: Performance Control. of workloads con rm that the costs of segment migrations can be indeed successfully absorbed by an appropriate design of load management mechanisms for low and moderate global system loads. Consequently, the achieved average query response time is very close to the predicted value under these conditions. In our future work we plan to extend our approach to multilevel storage model where each server can store the data in its main memory, disk or tertiary storage units. Since some delay-sensitive applications require hard query guarantees (e.g. continuous media), we intend to integrate provision of such guarantees into our model.
References [AG92]
R. Abbott, H. Garcia-Molina, Scheduling Real-Time Transactions: A Performance Evaluation, IEEE Transactions on Database Systems, Vol. 13, No. 3, 1992. [ACP95] T. E. Anderson, D. E. Culler, D. A. Patterson, the NOW team: A Case for NOW (Networks of Workstations), IEEE Micro, Vol. 15, No. 1, 1995. [BS85] A. Barak, A. Shiloh, A Distributed Load Balancing Policy for a Multicomputer, Software Practice & Experience Vol.15 No.9, September 1985, pp. 901-913. [BVW95] Y. Breitbart, R. Vingralek, G. Weikum, Load Control in Scalable Distributed File Structures, Technical Report 254-95, Department of Computer Science, University of Kentucky, Lexington, KY, February 1995 [CK95] S. Christodoulakis, L. Koveos, Multimedia Information Systems: Issues and Approaches, in W, Kim (ed), Modern Database Systems, ACM Press, 1995. [CACR95] P. E. Crandall, R. A. Aydt, A. A. Chien, D. A. Reed, Input/Output Characteristics of Scalable Parallel Applications, available at http://www-pablo.cs.uiuc.edu. [Dev93] R. Devine, Design and Implementation of DDH: A Distributed Dynamic Hashing Algorithm, 4th International Conference on Foundations of Data Organization and Algorithms (FODO), Chicago, 1993.
10
[DG92] [ELZ86] [GJ79] [GC92] [Gr69] [G91] [HJC93] [HKM88] [HLY93] [JK93] [Knu73] [KW94] [KMR95] [LLM88] [LNS93] [LNS94] [OCD88] [P94] [SGK85] [SWZ93] [SWZ94] [SPW90]
D.J. DeWitt, J.N. Gray, Parallel Database Systems: The Future of High Performance Database Systems, Communications of the ACM Vol.35 No.6, June 1992, pp. 85-98. D.L. Eager, E.D. Lazowska, J. Zahorjan, Adaptive Load Sharing in Homogeneous Distributed Systems, IEEE Transactions on Software Engineering Vol.12 No.5, May 1986, pp. 662-675. M.R. Garey, D.S. Johnson, Computers and Intractability, Freeman and Co., 1979. J. Gemmel, S. Christodoulakis, Principles of Storage and Retrieval for Delay Sensitive Data, ACM Transactions on Information Systems, 1992. R.L. Graham, Bounds on Multiprocessing Timing Anomalies, SIAM Journal on Applied Mathematics, Vol. 17, No. 2, 1969, pp. 416-429. J. Gray (Editor), The Benchmark Handbook for Database and Transaction Processing Systems, Morgan Kaufmann, 1991. D. Hong, T. Johnson, S. Chakravarthy, Real-Time Transaction Scheduling: A Cost Conscious Approach, SIGMOD Conference, 1993. J.H. Howard, M.L. Kazar, S.G. Menees, D.A. Nichols, M. Satyanarayanan, R.N. Sidebotham, Scale and Performance in Distributed File System, ACM Transactions on Computer Systems, Vol. 6, No. 1, 1988. K.A. Hua, C. Lee, H.C. Young, Data Partitioning for Multicomputer Database Systems: A Cell-based Approach, Information Systems Vol.18 No.5, 1993, pp. 329342 T. Johnson, P. Krishna, Lazy Updates for Distributed Search Structure, ACM SIGMOD Conference, Washington, 1993. D. Knuth, The Art of Computer Programming, Addison-Wesley, 1973. B. Kroll, P. Widmayer, Distributing a Search Tree Among a Growing Number of Processors, ACM SIGMOD Conference, Minneapolis, 1994. T. T. Kwan, R. E. McGrath, D. A. Reed, User Access Patters to NCSA's World Wide Web Server, available at http://www-pablo.cs.uiuc.edu. M.J. Litzkow, M. Livny, M.W. Mutka, Condor - A Hunter of Idle Workstations, 8th International Conference on Distributed Computing Systems (DCS), San Jose, 1988. W. Litwin, M.-A. Neimat, D.A. Schneider, LH - Linear Hashing for Distributed Files, ACM SIGMOD Conference, Washington, 1993; extended version published as: Technical Report HPL-93-21, Hewlett-Packard Labs, 1993. W. Litwin, M.-A. Neimat, D.A. Schneider, RP : A Family of Order-Preserving Scalable Distributed Data Structures, VLDB Conference, Santiago de Chile, 1994. J.K. Ousterhout, A.R. Cherenson, F. Douglis, M.N. Nelson, B.B. Welch, The Sprite Network Operating System, IEEE Computer, Vol. 21, No. 2, 1988. C. Partridge, Gigabit Networking, Addison-Wesley, 1994. R. Sandberg, D. Goldberg, S. Kleiman, D. Walsh, B. Lyone, Design and Implementation of the Sun Network File System, Usenix 1985 Summer Conference, 1985. P. Scheuermann, G. Weikum, P. Zabback, Adaptive Load Balancing in Disk Arrays, Int. Conf. on Foundations of Data Organization and Algorithms, Chicago, 1993. P. Scheuermann, G. Weikum, P. Zabback, Disk Cooling in Parallel Disk Systems, IEEE Data Engineering Bulletin Vol.17 No.3, Sept. 1994, pp. 29-40. C. Severance, S. Pramanik, P. Wolberg, Distributed Linear Hashing and Parallel Projection in Main Memory Databases, VLDB Conference, Brisbane, 1990.
11
[Sch92] [S95] [TLC85] [VBW94] [VBW95] [WSZ91]
H. Schwetman, CSIM Reference Manual (Revision 16), Microelectronics and Computer Technology Corporation, Austin, 1992. Sun Microsystems, Inc., SPARCstation Desktop Product Line Overview, available at http://www.sun.com, 1995. M. M. Theimer, K. A. Lantz, D. R. Cheriton, Preemptable remote execution facilities for the V-System, Proceedings of the 10th ACM Symposium on Operating Systems Principles, 1985. R. Vingralek, Y. Breitbart, G. Weikum, Distributed File Organization with Scalable Cost/Performance, ACM SIGMOD Conference, Minneapolis, 1994. R. Vingralek, Y. Breitbart, G. Weikum, SNOWBALL: Scalable Storage on Networks of Workstations with Balanced Load, Technical report, University of Kentucky, 1995. G. Weikum, P. Scheuermann, P. Zabback, Dynamic File Allocation in Disk Arrays, ACM SIGMOD Conference, Denver, 1991.
12