from distributed disk based storage managers. .... or dedicated server designs, Our designed makes use ... client of the DERBY storage management system.
DERBY: A Memory Management System for Distributed Main Memory Databases James Grioen Todd Anderson Yuri Breitbart
Radek Vingralek
Department of Computer Science University of Kentucky Lexington, KY 40506
Department of EECS Northwestern University Evanston, IL 60208
Abstract
This paper describes a main memory data storage system for a distributed system of heterogenous general purpose workstations. We show that distributed main memory storage managers are qualitatively dierent from distributed disk based storage managers. Speci cally, we show that load balancing, which is crucial to disk-based systems[26], has little eect on the performance of a memory-based system. On the other hand, we show that saturation prevention in cases where the server exceeds its memory capacity or becomes overload is crucial to smooth performance. Finally, we show that distributed memory-based storage results in performance more than one order of magnitude better than their disk-based counterparts.
1 Introduction
In recent years, workstations and personal computers have become extremely powerful and have seen impressive increases in storage capacities. The cost eectiveness of these systems combined with emerging high-speed network technologies have led to the development of high-performance data processing environments based on networks of workstations. A typical network environment consists of hundreds of workstations capable of delivering several GFLOPS and storing upwards of 10 GB of memory resident data. The aggregate capacity and processing power of these systems are grossly underutilized. Indeed, some studies indicate that about 91% of the time the majority of workstations sit idle [8]. This not only wastes valuable processing power (CPU time) but also highperformance storage space (memory). This work supported in part by NSF grant numbers IRI92121301, CCR-9309176, CDA-9320179, and CDA-9502645.
Huge amounts of available and underutilized memory space coupled with high bandwidth local area networks have made memory-to-memory communication much faster than memory-to-disk transfers rates. Such massive high-performance memory systems are ideally suited for distributed main-memory database systems. However, a key to the success of such systems lies in the development of a highly ecient data storage system that fully utilizes the aggregate memory capacities of individual workstations. Moreover, the assumptions on which conventional algorithms such as load balancing are based are not true of memory-based systems and must be reinvestigated. To guarantee high transaction throughput in a memory resident design, disk accesses should be minimized or totally eliminated from the critical path of user applications. This is a dicult task because modi cations and insertions to the database must be stored with guaranteed persistence (i.e., a copy of the data must reside on disk). This paper describes the design of the DERBY data storage system used to support a distributed memoryresident database system. DERBY utilizes unused capacity throughout the network to provide high transaction throughput. It dynamically redistributes the load via a saturation prevention algorithm that differs from the conventional practice of load balancing. DERBY guarantees data persistence, but yet removes disk accesses from the storage system's critical paths, signi cantly improving the response time of insert, update, and delete operations. This is achieved by supplying a subset of the nodes of the network with Uninterrupted Power Supply (UPS) to temporarily hold data while propagating it to disk. Finally, DERBY supports sequential data consistency when accessing records concurrently.
The remainder of this paper is organized as follows. Section 2 discusses related work. Section 3 introduces the derby architecture, while section 4 describes the systems operation and algorithms. Finally, section 6 describes our simulation model and results.
2 Related Work
Memory-resident database systems and distributed memory systems have been widely studied by several researchers. An excellent survey of issues pertaining to memory-resident databases is given in [11]. The survey discusses the impact of memory-resident databases on concurrency control, commit processing, access methods, data representation, query processing, recovery, performance issues, the application programming interface, and protection. Prototypes of several memory resident database system have been described in [5, 24, 23, 12, 14, 15, 13]. [10] introduces the use of client caches to reduce transaction response time in a disk-based database system. Several papers have investigated methods for using underutilized CPU capacities of workstations [18, 20, 8, 6, 7] or load balancing for distributed database applications [25]. Many distributed shared memory systems provide consistent shared access to memory resident data[16, 3, 1, 21, 4]. [22, 17] and others have proposed using battery backed memory or uninterruptable power supplies to improve write performance. Our design extends these past works by reinvestigating main-memory database systems designed around a general purpose distributed computing environment. Unlike large main memory database systems or dedicated server designs, Our designed makes use of existing general purpose workstations to create a massive, cost-eective main memory storage system. Furthermore, our design completely eliminate disk access bottlenecks from the system via new algorithms and cost eective hardware.
3 DERBY Architecture Overview
The DERBY system consists of general purpose workstations connected via a high-speed network. Although we assume the network contains a large number of hosts, the network will typically be restricted to a local area radius to limit the network latency between any two machines. DERBY assumes that each workstation has one or more users. Users execute general purpose programs at random times, including the DERBY database application. The DERBY database application is a client of the DERBY storage management system. Rather than dierentiate, we call all applications executed by users of a workstation client applications. If
the workstation is currently executing any client applications we say the workstation is acting as a client. Servers execute on machines with idle capacity and provide the storage space for the DERBY memory storage system. Servers have the lowest priority and are essentially \guests" of the machine they borrow resources from. At any given time, each workstation will be operating as a client, server, both, or neither and may change over time. When clients need the local machine's resources, the server component will shrink or disappear. Likewise, when no clients are active, the server will acquire the idle resources. This con guration models a realistic data processing environment where users of the workstations come and go over time. The primary role of a DERBY client is to forward read/write requests from the database application to the DERBY server where the record is located. After receiving the data from the server, the client caches the record to speed future local references to the record. The primary role of the DERBY servers is to keep all records memory resident and to respond to client requests without incurring any disk accesses. Servers insure long-term persistence by eventually propagating newly written or modi ed records to disk. However, to insure short-term persistence without disk storage, we propose that approximately 20% of all workstations be out tted with Uninterruptable Power Supplies (we call these workstations WUPS)1 . Although there are other ways to insure persistent storage, DERBY uses WUPS for several reasons. First, we want a persistent storage mechanism that has the same performance as standard remote workstation memory. Conventional disks and many ash memory/disks do not provide the desired performance. Secondly we want a cost eective way to achieve reasonably large amounts of persistent storage. Alternatives such as non-volatile RAM or ash memory are not yet cost eective and have limited capacity. Uninterruptable power supplies essentially change all the memory in the workstation into non-volatile RAM. Moreover, because the sole purpose of the UPS is to give the workstation adequate time to write the memory contents to disk and shutdown (not to keep the machine running until the power is restored), the most inexpensive UPS will suce (e.g., a standby power supply). To provide high-speed persistent storage, we associate a set of n WUPS (denoted SUP S ) with each 1 20% represents an insigni cant system cost increase but yet provides sucient non-volatile storage capacity
server S . Each WUPS may be a member of several dierent SUP S . Each SUP S provide temporary stable storage for the server they are associated with. client
client
server
server
Node 2
Node 1
UPS Node (Node 5)
Node 4
Node 3 client
client server
server
UPS server
client UPS
server
Figure 1: The DERBY Architecture. Node 5 serves as the UPS for nodes 1, 2, 3, and 4. The DERBY architecture is depicted in Figure1. We assume that users are discouraged from using the UPS node. This assumption is not germane to our system, but ensures a more responsive system. Additional studies are planned to analyze the cost/bene t of this assumption. All other nodes dynamically partition the memory between client and server depending on the overall use of the system. The logical DERBY organization and interactions are shown in Figure 2. Aoolication1
Application 2
Application n
DBMS (Client /Server)
Lock Cache
Buffer Pool of the Client
Server
Buffer pool of the Server Persistent Storage
UPS Server
Buffer pool of the Server Persistent Storage
Figure 2: The logical organization and interaction of DERBY components
3.1 DERBY Storage System Services
DERBY's basic data storage abstraction provides simple key-based xed-length record storage. Any
database can be built on this basic storage abstraction. The abstraction assumes that there is exactly one server responsible for each record. A server where the record is stored is called the primary location of the record. Because servers come and go, the primarily location of a record may change over time. To guarantee data consistency and regulate concurrent access, DERBY provides a basic locking mechanism. The lock granularity is a single record. Larger granules are feasible but system performance with larger granules is yet to be investigated. Clients support read-lock caching[10]. If a client holds a readlock on behalf of one of its database applications, the client may grant the lock to another of its applications without communicating with the primary server. All write-lock requests go to the primary server. If a client requests a write-lock from a primary server, the server will send a release-lock requests to all clients that currently have the record cached in their memory. DERBY does not support write lock caching [9]. Recovery from lost data is implemented on the top of the DERBY storage manager. The majority of the recovery log is kept on server disks. However, recently modi ed log buers are stored on SUP S much like write requests and thus can be written to with low delay. Transaction locks are released as soon as the transaction log records its changes into k of n SUP S log buers. Log buers are then asynchronously propagated to disk. Since the data is distributed among many servers, the recovery mechanism is designed for individual site recovery. We assume that if a site fails for any reason, all its memory is lost. We use log records from disk and UPS's log buers to restore the database state.
3.2 The DERBY Data Model
Records on each server are grouped into segments to limit the space and CPU overhead associated with managing records (e.g., managing address tables). Segments are the basic unit of load and vary in size but never exceed the size of the the server's memory. The \memory load" of a segment is simply its size. The \processing load" on a segment is derived from the average number of accesses to records within the segment. Thus, segments measure the CPU and memory load generated by user's data requests. If the load on a segment exceeds some predetermined limit, the segment is subdivided into two or more subsegments which may be moved to other servers. Each time a new segment is created, its server address is recorded in the address table. For any given record DERBY uses a two-level process to nd its associated server. DERBY rst applies an address-
ing function to convert the record's key to a segment number. It then consults an address table to map the segment to a server address [25]. Every client and server maintain a copy of the address table. However, these copies may be inconsistent and thus may require request forwarding to locate the desired record. When a server receives a request for a record it no longer has, it uses its address table to nd a new address for the segment containing the record. After a limited number of forwardings the correct address for the record will be found [2]. Each time that an addressing error occurs, the client/server address table is updated with the new correct information.
4 Algorithms
4.1 General Operation
In the simplest case (i.e., the case void of complications such as lost messages, concurrent accesses, migrated records, etc.) the derby storage system functions as follows. Clients desiring read access to a particular record locate the primary server for the record and transmit a query request. The server, locates the desired record in its memory and returns it to the client where the data is cached for future accesses. Clients desiring write access to a record begin by locating the primary server for the record. A writelock request is issued by the client and granted by the server. Each time a client writes data item d to primary server S , it also multicasts d to all WUPS in SUP S to insure persistence of the newly written record. The client considers the data written only after it has received an acknowledgment from S and k of the workstations in SUP S . Data item d is then asynchronously migrated to S 's disk. Periodically S informs the SUP S of committed data. Only after d reaches S 's disk will the SUP S discard d from their cache. Both k and n are tunable parameters that can be adjusted to ensure the desired level of reliability. In most cases k = n = 1 will suce.
4.2 Local Buer Management
We de ne Cc , and Cs as the amounts of memory allocated to client and server processes respectively. If the client requests additional storage space in excess of Cc , the space must be stolen from the server or allocated on disk. Similarly, when a server exceeds its allocated space it must either migrate data to another server or move it to disk. Several possible strategies exist for rearranging the memory allocation when the client or server require additional memory. The decision depends largely on the cost of migrating data to another server and the cost of moving data to disk. As we show in section6, migrating data to another
server can be done much more quickly than moving data to disk. However, migration overhead imposes a load on the servers and the network and can increase a client's request response times. Thus, the system must be careful not to move the memory boundaries too often. In particular, it should avoid reducing the server allocation to the point where load balancing would be required. Note that the memory required by the client has larger uctuations over time than the memory requirements of the server. Consequently, our current algorithm allows clients to temporarily borrow unused space from Cs . This can be done at no cost. Beyond borrowing, our algorithm takes a simple approach based on page faults. If the client begins to experience a large number of page faults, we assume Cc is insucient and move the boundary to provide the client with additional space. This may result in migration of some data from the server. However, it appears that such migrations resulting from boundary movements do not occur too often and thus only minimally aect clients' response time. As server usage increases, the algorithm examines the client page fault rate and the amount of free memory to determine whether Cs can be increased. Setting limits on the page fault rate avoids frequent oscillations between the client and the server. Within a region the system employs an LRU-k replacement algorithm to reclaim space[19]. We are currently investigating other alternatives to decide when to move the boundaries and by how much.
4.3 Clients
When a database application submits a request for a record identi ed by a key K , the client rst applies its addressing algorithm to determine a node address where the record resides and sends the request to that server. When a server receives a request for a record identi ed by K , it applies the same algorithm to K to determine whether the requested record is resident in its memory. If the record is in its memory, the server performs the requested operation and sends an acknowledgment (which might contain a record) to the client. If the record has been migrated and is no longer resident in the server's memory, it forwards the request to another server and then sends its addressing table to the client that committed the addressing error. To guarantee sequential data consistency, every client must hold locks for the records it operates on. DERBY uses lock requests to obtain read and write locks2. To limit the communication overhead, clients 2 For eciency, lock request messages can be combined with the corresponding read/write request messages.
cache read-locks across transaction boundaries until the read-locks are revoked by a server or are no longer needed by the client (e.g. when the record is evicted from the client's cache). Clients do not cache write locks since others have shown that write-lock caching often degrades system throughput [9]. Whenever a client receives a query request from a transaction, it rst checks whether the requested record is resident in its local cache. If so, the query is served locally from the client's cache. If the record is write locked by another local transaction, the query request is queued by the client until the write lock is released. Upon a cache miss, the record is fetched from a server and installed in the client's cache (possibly evicting some of the records determined by LRU-k). The update, delete and insert requests, on the other hand, cannot be satis ed locally since write-locks must be obtained from the server. Once a client obtains a write lock, it sends the requested operation both to the server and to n WUPS servers. The transaction issuing the operation will not commit until acknowledgments from the server and k WUPS's are received.
4.4 Primary Servers
To provide high-speed read access, the server uses all its available memory space to store records. In order to support conventional database models that support transactions, concurrent access, and other features, the underlying storage system must provide consistency and locking features. To ensure consistency and mutual exclusion for writes, the server maintains a list with each record of the clients caching the record as read-only and as write-locked. Read-only locks are granted implicitly in response to a read request assuming the client will read the record again in the near future. When client wishes to modify a particular record, it must rst obtain a write lock by sending a write lock request to the server. When the server receives a write request it sends a write lock request to all clients currently read-caching the record. If the record is currently write-locked by another client, the server enqueues the write lock request and waits for the other client to unlock the record. If the record is not write-locked and all read-only copies have been released by clients, the write lock is granted to the client. When the server receives a read request, it checks whether a write-locked or pending write exists for the record. If so, the client is placed on a read-waiting queue. Otherwise, the record is retrieved from memory, sent to client, and the client is added to the list of read-lock holders for the record. When a server receives a write request, it veri es
that a write lock has already been granted and places the data in memory. It then immediately returns an acknowledgment to the client indicating that the record has been stored in the server's memory. Sometime after the acknowledgment has been sent to the client, the server transfers the newly written record to its disk (typically writing several records at a time). As soon as the record has successfully reached the disk, the server sends a message to the associated UPS server informing the UPS that it may remove the record from its memory because the record is now on disk. Although the server can delay writing data to the disk, it should attempt to transfer records to disk relatively soon after receiving the write request to avoid over owing the UPS servers' memory.
5 Load Balancing in a Memory-based System
Load imbalances are a potential problem in any distributed system and may require redistribution of load in order to achieve optimal performance. Conventional load balancing systems try to equitably redistribute the processing load. Memory-based systems dier from conventional distributed systems in that both the memory space and processing power (cpu speed) are limited resources. That is, server load consists of two factors: the number of read/write operations per second executed at a server, and the amount of data stored in a server's memory. Given the dynamic nature of a system where machines may be either clients, servers, neither, or both at any given time, the system must be able to quickly and eectively react to dynamic changes in workstation roles. In addition, because the access patterns of clients change over time, the distribution of load may become uneven causing some servers to be heavily loaded while others are very lightly loaded. Because remote memory access times are so fast in a memory-based system, load balancing overhead can easily become a dominant cost. Achieving a distribution of records that optimizes for both space and processing load is a dicult problem. In addition, both the load imposed by clients and the resources available to servers are dynamically changing. Obtaining an optimal load distribution is simply not practical. In light of this fact, we modi ed the algorithm presented in [25] that has been shown to approximate the optimal distribution with acceptable accuracy. Our initial algorithm considered processing load to be a more important factor than memory usage. Thus the algorithm attempted to to optimize processing load constrained by (but not otherwise in uenced by) memory space availability. That
is, the load balancing algorithm tries to nd the rst acceptable distribution of processing load that does not exceed a prede ned percentage (mem threshold) of the available memory space on any machine. The mem threshold percentage prevents thrashing. The algorithm also migrated when memory capacity was reached.
5.1 Saturation Prevention vs. Load Balancing
During initial testing of our load balancing algorithm, we noticed that the response time of heavily loaded servers was not signi cantly reduced after of oading portions of their load (i.e., after load balancing). As a result, our load balancing algorithm resulted in slightly higher average response times than no when no balancing occurred (see gure 3). After further analysis, it became clear that the maximum possible performance improvements were not signi cant enough to justify load balancing. Load balancing in a memory-based storage system plays a fundamentally dierent role than load balancing in a disk-based storage system. In particular, diskbased storage systems can achieve a signi cant performance boost by dynamically balancing the load among servers. However, this is not true of memory-based systems. As we show in the following, the performance dierence between a heavily loaded memory server and a lightly loaded memory server is insigni cant. Consequently, memory-based systems only need to consider reassigning or migrating load when a limited resource such as memory or the CPU is near saturation. We call this saturation prevention. To understand this dierence, consider the time required for a server to process a request. Disk access times dominate costs in a disk-based system, resulting in processing times on the order of tens of milliseconds. In a memory-based system, server processing time is only limited by memory speeds and are typically on the order of microseconds. Also note that network communication costs are (at least) an order of magnitude less than disk access times and (at least) an order of magnitude more than memory access times. As a result, the roundtrip time to send a request and receive the response is dominated by server processing time in a disk-based model. However, network latency is the dominant cost in a memory-based model. Consequently, the dierence in performance between a heavily loaded disk-based server and a lightly loaded disk-based server will be signi cant (as much as a factor of 20 (or 2000%) [26]. In this case, load balancing can yield signi cant performance improvements. However, the dierence in performance
between a heavily loaded memory-based server and a lightly loaded memory-based server is relatively insigni cant3. As a result, even the most ecient load balancing algorithm (one that produces optimal balance with minimal overhead) will, in the best case, produce insigni cant improvements in performance.
5.2 Saturation Prevention
Although an imbalanced load does not aect memory-based systems as dramatically as disk-based systems, a memory-based system should still incorporate a load migration mechanism to avoid saturation. When a resource becomes saturated performance can degrade quickly. Saturation can occur for a variety of reasons. First, users may return to their workstation at any time and demand use of their previously idle resources. To maintain acceptable performance in this situation, the system must migrate the records to another machine. Second, as workloads change, the distribution of keys may become skewed so that a server no longer has suf cient memory space to hold the records it has been assigned. The server must either migrate records to another server or store them on the local disk. Given the high cost of accessing a disk, migrating the data to another server typically results in better performance than swapping data to the local disk. Finally, if the rate of requests being sent to a server exceeds the server's CPU capacity (i.e., the maximum number of requests per second that the server can process), queueing delays can quickly become signi cant and may warrant migrating records to another server. All of these cases describe a situation where the load exceeds the capacity of some server resource. Whenever a server resource becomes saturated, performance can degrade rapidly. To avoid this situation, the system should migrate load from one server to another before the server becomes saturated. That is, the system should try to adjust the load across server's to prevent any server from becoming saturated. This has similarities to the goals of load balancing systems. However, saturation prevention does not necessarily need to \balance" the load. A saturation prevention algorithm is successful as long as it migrates the load before saturation occurs and keeps the overhead and the amount of time spent migrating data to a minimum. These requirements may be met without achieving a balanced load.
3 Our results show that a lightly loaded server (processing load=22%) has an average response time of 1.17 ms, while a heavily loaded server (processing load=80%) has an average response time of 1.57 ms (only a factor of 1.34 or 34%).
5.3 The Saturation Prevention Algorithm
As we have shown, the load (either processing load or memory usage) on a server has relatively little affect on the server's performance unless the load saturates either the processor or the memory storage capacity. Consequently, our system employs a saturation prevention algorithm rather than a load balancing algorithm. The saturation prevention algorithm uses a processing load saturation point and a data storage saturation point. We de ne these points as the points where performance begins to degrade rapidly. The data storage saturation point is easily de ned as the size of the memory. Any addition data implies disk access which degrade performance quickly. The processing load saturation point is not as clearly identi ed, however, all our tests show that it is greater than 100% CPU load, in many cases signi cantly higher than 100%. Simply stated, the saturation algorithm attempts to keep each server from reaching its saturation points. To evaluate the best time for saturation prevention to occur, we incorporated a danger level parameter into our algorithm that de nes how close to saturation we will allow a resource to come before migrating some portion of the server's load. Whenever a server exceeds the danger level, the load redistribution algorithm described below is invoked. The algorithm also monitors the system for sustained processor loads of more than 100% (but less than the processing load saturation point) as this can eventually lead to signi cant performance degradation. Unlike our early algorithm, the current redistribution algorithm considers memory usage (not processing load) to be the primary factor in uencing load balancing. There are several reasons for this. First, performance plumets as soon as disk access becomes necessary, whereas, performance degrades at a much slower rate when the processing load of a server exceeds 100%. Second, our hashing function used to address records tends to evenly distribute the processing load among available servers so that, with as few as 20% of the machines acting as servers, the processing load rarely rises above 100%. Even at levels slightly above 100%, response time does not degrade rapidly. Third, although a system consisting of as many as 80% clients may have diculty saturating the server's processing capacity, the active portion of a large database can easily be large enough to stress the large but limited aggregate memory capacity of the workstations acting as servers. Given memory space as the primary factor, the load balancing algorithm takes a sender-initiated approach
that attempts to equitably distribute a server's excess data to underutilized servers. To keep the size of addressing tables in check, the algorithm applies bucket splitting similar to that used in [25] but uses the size of the bucket rather than the load on the bucket as the determining factor. The new algorithm also incorporates a check that keeps lightly loaded servers from trying to migrate even if they are above the danger level. Servers above the danger level may migrate data to other servers above the danger level provided the dierence in their loads is above some threshold delta. If a server remains above the danger level after migrating, the algorithm employs an exponential backo before trying to migrate again to avoid thrashing.
6 Simulation Results
To evaluate the performance of our memory-based storage architecture, we constructed a simulator4 that models a network of workstations where each machine functions as either a client or a server. In the simulation, each machine has a processor speed of 100 MIPS and 64 MB of memory for storage of data. The machines are interconnected by an ATM network with 155 Mbps links between the switch and each machine and a latency of 100 usec. For all the experiments described here, we hardwired each machine as either a client or a server. All simulations were run with 90 client machines and anywhere from 20 to 60 server machines. The clients generate a workload consisting of 90% queries and 10% inserts. We have used other distributions of query, update, insert, and delete and gotten similar results. Each client generates 200 10KB requests per second (i.e. 2 MB/s) which is comparable with other main memory database systems [24].
6.1 Memory Based Performance
Overall, the average access times are substantially better than that of disk-based systems [25]. Table 1 shows the average response time for query and insert operations without load balancing. With as few as 20 (18%) of the total machines acting as servers, the average query response times are only a factor of 1.6 higher than the minimal network roundtrip time, and at least one order of magnitude better than the best performance of a disk-based systems[26]. With as few as 40 servers (31% servers), performance is within a factor of 1.3 of the minimum network roundtrip time. Additional servers provided little improvement. In this case DERBY was capable of delivering reasonably high performance with as few as 18% of the total machines acting as servers which is well below the measured 4
The simulator was implemented using CSIM.
number of servers op. type 20 30 40 50 60 70 query 1.57 1.31 1.23 1.20 1.18 1.16 insert 1.90 1.62 1.56 1.53 1.50 1.49 Table 1: Request response times (in ms).
fraction of idle machines in real distributed systems[8]. Finally, note the dierence in performance between a lightly loaded server (i.e, the 70 server case) and a heavily loaded server (i.e., the 20 server case). Unlike disk-based systems where heavily load servers execute several times slower than lightly loaded servers, a heavily loaded server only executes 1.35 times slower than a lightly loaded server under our system.
6.2 Load Imbalances
To determine the aect of load balancing on performance, we measured the average response time under a system that uses load balancing against a system that does not use load balancing. For the experiments shown, all servers had sucient memory to store all data. Consequently, load balancing only occurred as a result of processing load imbalances. A server invokes the load balancing algorithm when it exceeds a processing load threshold. We used 40%, 60%, 80%, and 100% as the threshold values. 2
Average Response Time (ms)
1.8
20 Servers 40 Servers
1.6
1.4 60 Servers
1.2
and 60 server case, servers never reached 80% of their processing capacity and so no migrations occurred at the 80 and 100 percent settings. For each run, we recorded the average response time of requests issued during a migration periods vs. the average response time of requests issued when no migrations were occurring. Figure 2 lists the average (and maximum) response time of requests issued during a migration interval vs. requests issued when no migration was occurring. Num Servers 20 (100%) 40 (60%) 60 (60%) Avg MG 1.67 1.35 1.23 Time NMG 1.49 1.24 1.18 Max MG 13.40 6.67 9.71 Time NMG 8.31 4.07 4.05 Table 2: The aect of load balancing on response time. Average and maximum query response times measured while migration was occurring (MG) as opposed to periods of no migration (NMG) are shown (in ms). The number in () indicates the point at which load balancing was triggered. We draw two observations from the results. First, the overhead of load balancing has relatively little effect on response time during a load balancing interval. The average response time increases by less than 12% (see Table 2). Second, although the overhead of load balancing is minimal, the performances gains are not signi cant enough to oset the overhead. The average response time during non-migration intervals is at best 5% faster than the average access time when no migrations occur which does not oset the 12% overhead. In all cases, not load balancing resulted in better performance than load balancing. 6.2.1
W/O
100%
W/O
40%60%80% 100%
W/O
40%60%80% 100%
1 Max load allowed before migration occurs for 20, 40, and 60 servers (W/O=never)
Figure 3: Average query response time (in ms) with and without load balancing. Query response times with and without load balancing are shown in Figure 3. Given our con guration of 90 clients transmitting at a rate of 200 requests per second, the clients were able to almost saturate 20 servers. Consequently, the 20 server case resulted in thrashing for thresholds less than 100%. In the 40
Preventing Saturation
The previous section shows that load balancing is not cost-eective in our simulation environment. However, as discussed in section 5, the system must redistribute the load when any of the server's resources has been saturated. To evaluate the best time for saturation prevention to occur, we ran the simulator using danger levels ranging from 40% to 99%. To make sure saturation occurs, we set the number of inserts so that 95% of the aggregate memory capacity of the system is required to store all the records. Consequently, saturation prevention occurred in all runs. Lower danger
levels caused more frequent load migrations but typically moved smaller amounts of data (records) with each migration. Higher danger levels resulted in fewer migrations, but moved signi cantly larger amounts of data. Table 3 shows the average response time in each case. The results indicate that the response time is not signi cantly aected by the danger level setting. In all but the 20 server case, the best response times occur when prevention delayed until the last possible moment (99%). Lower settings increase the number of migrations and add to the overhead but do little to improve access times. Num Servers 20 40 60
Danger level that causes migration 40% 60% 80% 90% 99% 1.59 1.59 1.56 1.60 1.60 1.26 1.26 1.25 1.25 1.24 1.21 1.21 1.21 1.20 1.18
Table 3: The average response time (in ms) when migration occurs at various danger levels.
7 Conclusions
In this paper we have presented a memory-based storage system designed to capitalize on the massive computing power and memory storage capacity of existing networks of general purpose workstations. The system removes disk accesses from the critical path via a small number of UPS units and a commit algorithm with a tunable paranoia parameter. We show that a main memory storage management system is fundamentally dierent than its disk-based counterpart. In addition to the need for new protocols to insure reliable storage of data, we show that load balancing, which is crucial to a disk-based system, has little impact on the performance of memory-based systems. Instead, a saturation prevention algorithm is required. Finally, we show that memory-based systems can delivery performance more than one order of magnitude better than that of disk-based systems with as few as 20% of the machines being idle.
References
[1] Brian Bershad, Matthew Zekauskas, and Wayne Sawdon. The Midway Distributed Shared Memory System. In Proceedings of the IEEE CompCon Conference, 1993.
[2] Y. Breitbart, R. Vingralek, and G. Weikum. Load Control in Scalable Distributed File Structures.
Technical Report Technical report 254-95, Dept. of Computer Science, Univ. of Kentucky, February 1995. [3] John B. Carter, John K. Bennett, and Willy Zwaenepoel. Implementation and performance of Munin. In Proceedings of 13th ACM Symposium on Operating Systems Principles, pages 152{64. Association for Computing Machinery SIGOPS, October 1991. [4] G. Delp. The Architecture and Implementation of MemNet: A High Speed Shared Memory Computer Communication Network. PhD thesis, Univ. of Delaware, 1988. Dept. of Computer Science. [5] D. DeWitt, R. Katz, F. Olken, L. Shapiro, M. Stonebraker, and D. Wood. Implementation Techniques for MainMemory Database Systems. In Proceedings of the SIGMOD Conference, 1984. [6] F. Douglis and J. Ousterhout. Process Migration in the Sprite Operating System. In Proceedings of the 7th International Conference on Distributed Computer Systems, pages 18{25, 1987. [7] D.L. Eager, E.D. Lazowska, and J. Zahorjan. Adaptive Load Sharing in Homogeneous Distributed Systems. IEEE Transactions on Software Engineering, 12(5):662{675, May 1986. [8] K. Efe and V. Krishnamoorthy. Optimal Scheduling of Compute-Intensive Tasks on a Network of Workstations. IEEE Transactions on Parallel and Distributed Systems, 6(6):668{673, 1995. [9] M. Franklin and M. Carey. Client-Server Caching Revisited. In Proceedings of the International Workshop on Distributed Object Management, August 1992. [10] M. Franklin, M. Carey, and M. Livny. Global Memory Management in Client-Server DBMS Architectures. In Proceedings of the VLDB Conference, 1992. [11] H. Garcia-Molina and K. Salem. Main Memory Database Systems. IEEE Transactions on Knowledge and Data Engineering, 4(6):509{516, December 1992. [12] D. Gawlick and D. Kinkade. Varieties of Concurrency Control in IMS/VS Fast Path. Data Engineering Bulletin, 8(2):3{10, 1985.
[13] N. Gehani, D. Lieuwen, and S. Sudarshan. Mmode: A main memory database system. 1994. [14] T. Lehman and M. Carey. A Study of Index Structures for Main Memory Database Management Systems. In Proceedings of the VLDB Conference, 1986. [15] T. Lehman and M. Carey. A Recovery Algorithm for a High-Performance Memory Resident Database Systems. In Proceedings of the ACM SIGMOD Conference, pages 104{117, 1987. [16] Kai Li and Paul Hudak. Memory Coherence in Shared Virtual Memory Systems. ACM Transactions on Computer Systems, 7(4):321{ 359, November 1989. [17] B. Liskov and L. Shrira. Escaping the Disk Bottleneck in Fast Transaction Processing. In Proceedings of the Third Workshop on Workstations Operating Systems, 1992. [18] M.J. Litzkow, M. Livny, and M.W. Mutka. Condor - A Hunter of Idle Workstations. In Proceedings of te 8th International Conference on Distributed Computing Systems (ICDCS), 1988. San Jose. [19] E. O'Neil, P. O'Neil, and G. Weikum. The LRU-K Page Replacement Algorithm For Database Disk Buering. In Proceedings of the ACM SIGMOD Conference, 1993. [20] J. Ousterhout, A. Cherenston, F. Douglis, M. Nelson, and B. Welch. The Sprite Network Operating System. Computer, pages 23{36, February 1988. [21] Umakishore Ramachandran and M. Yousef A. Khalidi. An Implementation of Distributed Shared Memory. Technical Report GIT-ICS88/50, School of Information and Computer Science, Georgia Institute of Technology, December 1988. [22] K. Salem S. Akyurek. Management of Partially Safe Buers. IEEE Transactions on Computers, 44(3):394{407, 1995. [23] K. Salem and H. Garsia-Molina. System M: A Transaction Processing Testbed for Memory Resident Data. IEEE Transaction on Knowledge and Data Engineering, March 1990.
[24] H. V. Jagadish D. Lieuwen R. Rastogi A. Silberschatz S. Sudarshan. Dali: A High Performance Main Memory Storage Manager. In Proceedings of the VLDB Conference, 1994. [25] R. Vingralek, Y. Breitbart, and G. Weikum. Distributed File Organization with Scalable Cost/Performance. In ACM SIGMOD Conference, 1994. [26] R. Vingralek, Y. Breitbart, and G. Weikum. SNOWBALL: Scalable Storage on Networks of Workstations with Balanced Load. Technical report, Department of Computer Science, University of Kentucky, 1995.