signed failure detection, fault masking, recovery, and repair mecha- nisms. The paper ... of the debit/credit and the joinABprime benchmark is shown to be linear. .... 2. Bottlenecks in DBMS functions like transaction processing, data locality, data .... current with the fuzzy table read, and produce a derived log using an algebra.
A Continuously Available and Highly Scalable Transaction Server: Design Experience from the HypRa Pro ject
1
Svein O. Hvasshovd2, Tore Ster2, ystein Torbjrnsen3, Petter Moe3, Oddvar Risnes2 Abstract HypRa is a multiprocessor database server designed to meet very high requirements for continuous service availability and scalability. The continuous service availability is ensured through carefully designed failure detection, fault masking, recovery, and repair mechanisms. The paper focuses on the scheduling of dierent types of transactions when the goal is to oer continuous availability of simple transactions services. HypRa uses a combination of multiple data replicas, distributed location transparent locking and logging, and fuzzy mechanisms for synchronizing the dierent replicas. This is shown to enable a scheduling which allow non-simple transactions to run in parallel with continuously available simple transactions. Non-simple transactions are: Complex, DB maintenance, recon guration, and server self-repair transactions. The paper also describes how scalability is implemented in HypRa by showing how basic resources are organized and utilized, and by showing how simple and complex transactions scale. The scalability of the debit/credit and the joinABprime benchmark is shown to be linear. A comparison of the HypRa design with other database servers and systems is also given.
1 Introduction HypRa is a multiprocessor SQL database server designed to meet requirements set by the most demanding application areas like telecommunications and high-end transaction systems. HypRa is speci ed as a fault tolerant system ([18]). In this context, HypRa may be viewed as a server composed of servers where basic servers are termed resources. The HypRa database server oers a service speci ed by: 1Supported by the Norwegian Telecom and the Royal Norwegian Council for Scienti c and Industrial Research 2Database Technology Group, SINTEF DELAB, N-7034 Trondheim, Norway 3Division of Computer Systems and Telematics, Norwegian Institute of Technology, N-7034 Trondheim, Norway
1. Service semantics divided into standard and failure semantics according to the ISO-SQL ([32]) standard. 2. Performance speci cations divided into response time and workload. 3. Stochastic speci cation given by the mission time, the service availability, the maximum service unavailability, and the disaster concept. The HypRa development started out with a customer requirement for a fault tolerant transaction server with open interfaces. In detail the requirements were: ISOSQL interface, performance speci cations according to the debit/credit benchmark ([24]) when scaled to 1000 transactions per second, and a maximum unavailability of 1 hour during a 30 year mission time. A HypRa feasibility study documents that a 64 node server can oer these capabilities ([13]). These requirements de ne what we term a continuously available SQL database server. The performance and stochastic speci cations together give the server's capabilities. A server is scalable if the server's capabilities can be enhanced by adding more resources to the server while keeping its service semantics. A server is linearly scalable if there is a linear relationship between the capability enhancement and the amount of added resources of the limiting resource(s). A server is online scalable if the scaling is done without interrupting the service availability. A scalability is of ne granularity if resources can be added in small portions relative to the current total amount of server resources. A server has: 1. Scaleup ([22]) if it can handle an increased workload by adding server resources (the response time and service availability are xed). 2. Speedup ([22]) if it can reduce the response time of requests by adding server resources (the workload and service availability are xed). 3. Scalable availability if the service availability can be increased by adding server resources (workload and response time are xed). The workload on a server is a mix of basic types of workloads like: 1. Database volume. 2. Number of simultaneously active database users. 3. Simple transaction (ST) volume (STs/second). 4. Non-simple transaction volume (complex transactions (CTs), database maintenance transactions (DMTs), self-repair transactions (SRTs), and con guration management transactions (CMTs) per second). No distinction is made between ad-hoc queries and batch queries, both types of queries are regarded as operations in complex transactions. STs and CTs are user induced workload, DMTs and CMTs are requested by the operating organisation, and SRTs are caused by the failure intensity of the server resources. Database benchmarks ([24], [9]) are concrete examples of such workload mixes. A server can be scaled by replacing server resources. This can be done by adding server resources or by upgrading server resources. The main units of replacement in HypRa are the
node boards and the individual discs. These are ne granularity replacements and can be performed online by the customer. This paper presents the overall HypRa design and then focuses on the design decisions mainly aecting the service availability and the server scaleup. Service availability and scaleup is emphasized because the primary goal of the HypRa development has been to build a continuously available server for simple transactions. The HypRa workload will be dominated by the simple transaction volume and the selfrepair transactions. It is assumed that the normal workload (simple transactions) and the failure induced workload (self-repair transactions) will dominate during normal operations. From own previous DBMS design experience and experience from using other DBMSs, the following bottlenecks are known to prevent current DBMS servers from giving continuous database service and from being scalable. Points threatening the service availability: 1. A hardware architecture with a single point of failure and shared components like CPU, bus, or memory. 2. Oine hardware and software maintenance routines (replacing hardware, installing new software releases). 3. Lack of online failure masking, online fault recovery, and online self-repair for hardware and software failures. 4. Scheduling unable to handle a mix of dierent types of transactions while giving continuously simple transaction availability.. Points threatening the scalability: 1. Bottlenecks in basic services like processing power, memory allocation, persistent storage, internal communication, and external communication. 2. Bottlenecks in DBMS functions like transaction processing, data locality, data processing, transaction coordination, resource locking, lock distribution, and relational algebra.
2 Overall HypRa design Given the HypRa design requirements, it was decided to use the following implementation mechanisms: 1. A homogeneous and balanced architecture (hardware and software) to achieve fault tolerance, scalability, to reduce complexity, and to avoid bottlenecks. 2. Coarse grained, asynchronous, shared nothing parallelism based on message passing to achieve performance, scalability, isolation, and the basic redundancy needed for fault tolerance ([5]). 3. Failure detection in hardware; fault masking, recovery, and self-repair in software ([29]).
HypRa server
Application platform Application SQL
HypRa Client RDA
Figure 1: The HypRa database client/server architecture. The server is drawn as a simpli ed 3D-hypercube. In the shown con guration a communication co-processor is connected to each node.
4. Implicit load distribution through hash based data distribution ([10], [16], [11], [19]). 5. Internal use of a nested transaction mechanism to ensure consistency and correctness in the internal resource management ([28], [26]). HypRa ts into the server role in a typical client/server architecture (Figure 1). The multiprocessor is interconnected in a hypercube topology. In the current design, each node in the multiprocessor has an Intel i486 CPU, 16 MByte DRAM, up to four SCSI bus I/O channels, and discs. The neighbour communication in the hypercube is either implemented with dual ported mail box memory or FIFOs interconnected with ber optics ([4]). Figure 2 shows the dependson relations ([18]) of the HypRa hardware architecture. The HypRa DBMS is a homogeneous distributed database management system with one DBMS per node. All database tables are distributed over all nodes using hash-based horizontal fragmentation. A table fragment has a primary replica and one or more loosely synchronized hot-standby replicas. Table fragment replicas with their primary index, locks, and log are clustered, i.e. stored at the same node. The transaction processing uses inter- and intra-transaction parallelism. The node OS and communication SW is special purpose and adapted to the HypRa DBMS. Figure 2 shows the depends-on relations of the HypRa software architecture.
3 Continuous Service Availability The availability of simple transaction services are threatened from a number of sources both during normal operations and as an eect of hardware and software failures. Failures reduce the processing and communication capacity, and can cause table fragment unavailability. For a presentation to the fault tolerance aspects of HypRa see [29]. During normal operations the continuous availability of simple transaction services are threatened by the mix of simple and non-simple transactions. Mechanisms that solve the con ict between data availability and consistency ([23], [35]) in this context are given in this paper.
DBMS P .. P
Com
P .. P
OS F .. F
F .. F
NC .. NC
C
F .. F
F .. F
C
..
D
..
HW server ,
D
Mem CPU
B
..
SW server
Server group Customer online replaceable unit
P .. P
P .. P
Field online replaceable unit
Figure 2: The depends-on relations of the HypRa database server hardware and software architecture. The gure shows how hardware and software servers are organized into server groups and the depends-on relation between servers. Also shown is the units of online replacement, and whether replacement can be done by a customer of a eld engineer. Note that the software servers called DBMS and Com make up two global server groups (illustrated by the ring of nodes making up a server group). The servers are: Power supply units (P), cooling fans (F), node boards (B), central processing units (CPU), main memories (Mem), neighbour communication channels (NC), disc channels (D), cabling units (C), operating system servers (OS), communication software servers (Com), and node DBMS servers (DBMS).
This section concentrates on the HypRa design decisions taken to ful ll the simple transaction service availability for dierent workloads. More details are given in section 3.1, to 3.5. In short the workloads are: 1. A mix of simple and complex transactions Simple and complex transactions accessing the same data items may threaten the service availability of simple transactions due to the long lifetime and numerous data accesses of the complex transactions. 2. A mix of simple and DB maintenance transactions Online DB maintenance transactions may increase both the response time and abortion rate of the simple transactions due to the DMT's coarse granularity locking and their possibly massive abortion of simple transactions. 3. A mix of simple and con guration transactions A con guration management transaction is part of planned hardware and/or software maintenance. The scheduling of online con guration transactions must avoid massive abortion of simple transactions active on servers involved.
4. A mix of simple and DBMS group recovery and self-repair transactions A DBMS group recovery transaction recovers the consistent state of the DBMS server group and data availability after a hardware or software server failure. Online self-repair transactions reestablish the initial fault tolerance level after a failure and may represent similar threat as con guration transactions. Group recovery may delay and cause massive abortion of simple transactions active at a failed server.
3.1 A Mix of Simple and Complex Transactions The major threat to service availability during normal operation are incompatible accesses to the same data items by concurrent transactions. This may cause unacceptably long transaction response times or unacceptable transaction abortion rates. HypRa handles these threats by multi granularity semantically rich locking types, and multiple replica based scheduling. Simple transactions have a short lifetime, require access to few tuples, and their concurrent number is very high. To avoid con icts among simple transactions, a ner-than-tuple locking granularity is provided so that each transaction only locks what is logically required. To enhance concurrent access to hot-spot data items a delta lock type equivalent to increment and decrement locks are supported ([8]). HypRa uses strict, well-formed two-phase locking among simple transactions. HypRa uses a loosely synchronised primary/hot standby transaction execution scheme to maintain the hot standby replicas. In a distributed locking scheme ner-than-tuple locking is favourable to tuple locking due to lower message overhead in the lock distribution. Complex transactions have dierent characteristics from simple transactions. They have longer lifetime, require read access to a high number of tuples, but the concurrent number is low. If a complex transaction is allowed access to the same data item replicas as simple transactions, this may cause signi cant delays for several simple transactions. To reduce the read versus write con ict between complex and simple transactions, read access of CTs can be issued on a hot standby replica if so speci ed in the dictionary. If a hot standby replica can be read, the SQL compiler decides if an operation is executed on the hot standby or primary replica. Simple and complex transactions are globally serialised. To allow complex transactions to read from hot standby replicas a particular scheduling scheme has been developed to support globally serialised execution of operations on primary replicas with read operations performed on hot standby replicas. Primary locks, i.e. locks on primary replica data items, are downgraded to retainer locks at the transaction commit time. The retainer locks are removed when the hot standby operations are completed, i.e. when the transaction has been executed on the hot standby replicas. Transactions accessing the primary replica of a fragment only, do not pay attention to retainer locks. Transactions reading hot standby replicas perform con ict analysis on both primary and retainer locks when accessing primary replicas. The scheduling of hot standby operations preserves the execution sequence from the primary replica execution per tuple. Hot standby write and delta locks are set by the hot standby operations. Hot standby locks do not con ict with each other
since the sequence of operations is already serialised. Read locks may however con ict with hot standby write and delta locks. Hot standby operations of an active simple transaction may be allowed to preceede operations by a complex transaction to resolve a deadlock if this preserves global serialisability, otherwise the complex transaction is aborted. HypRa does also support non-serialisable table read operations (fuzzy read). A read of this type will only latch a block while reading its tuples. This does not require multiple table replicas.
3.2 A Mix of Simple and DB Maintenance Transactions A DB maintenance transaction is characterised by: One or more dictionary update operations (DDL [32]); long lifetime; infrequent execution; and usually read access to numerous tuples. Dictionary tables are horizontally fragmented and fully replicated with one primary and multiple hot standby read replicas. If strict synchronisation among dictionary replicas are used, DB maintenance transactions can cause massive abortion of simple transactions. They can also cause similar response time delays as complex transactions if executed on the same table replicas as simple transactions. HypRa avoids these situations by using a particular scheduling strategy for DDL operations. This scheduling ensures continuous availability of data and dictionary replicas to simple transactions at the expense of DB maintenance transaction response time. If required, DDL operations can be executed with ordinary priority to gain lower response times at the expense of simple transaction service availability. The scheduling of DB maintenance transactions view the hot standby replicas of a dictionary fragment as belonging to one of two replica regions. The active node set of a HypRa server is divided into two node regions. A replica region coincides with a node region. A transaction coordinator reads dictionary data from hot standby replicas within one node region. DDL operations are executed on primary dictionary replicas during the transaction, and node region serialised ([35]) on the hot standby replicas during the commit processing. HypRa supports precompiled queries and transactions. Recompilation of the primary query or transaction replica is done as part of a DB maintenance transaction. Installation of the recompiled query or transaction into the hot standby replicas is included in the commit processing. This avoids simple transactions from being blocked while waiting for recompilation as will be the case if on-the- y recompilation is used [17], [31]. Some DDL operations will produce a derived DB object, e.g. a secondary index. Traditionally, the consistency of the derived DB object is guaranteed by locking the source for the derived object, e.g. the indexed table ([31]). In HypRa a general non-blocking fuzzy production mechanism has been developed for derived objects which allows access to the source object while the derived object is created. This mechanism also maintains the consistency of the derived object until all transactions become aware of its existence.
3.3 A General Fuzzy Non-blocking Production Mechanism This mechanism is used for production of derived DB objects from a table ([26]) and is based on similar principles as fuzzy snapshots ([8]). A create index example of use
Begin-fuzzyproduction log mark
Initial locks copied
End-fuzzyproduction log mark
Current image produced
Remove update channel
Fuzzy read Log read Lock snapshot Update channel Update channel operations Ordinary transaction operations
Figure 3: The general fuzzy production mechanism: The dierent activities are partially executed in parallel
is given in section 3.3.2. The fuzzy production mechanism includes a combination of a fuzzy snapshot, a lock snapshot, and an update channel. The fuzzy snapshot is produced by (see Figure 3): 1. Setting a begin-fuzzy-production log mark. 2. Doing a fuzzy read of the involved table, and generating the derived object using a relational algebra operation ([15]). 3. Reading the table's log previous to the begin-fuzzy-production log mark concurrent with the fuzzy table read, and produce a derived log using an algebra operation. The derived log must be able to handle node and transaction failures related to the derived object. 4. Setting an end-fuzzy-production log mark when both the derived object and the derived log are produced. A lock snapshot produces a derived copy of the locks relating to the table at the begin-fuzzy-production log mark point (see Figure 3). These are the initial locks for the derived object. Since locks are not logged, the copy must be consistent. The lock snapshot is produced concurrently with the fuzzy snapshot. The locks set on the derived object are of the primary replica lock type, i.e. the derived object is of primary status. An update channel is established between the table and the derived object. Its function is to transmit the state changes made to the table which are of relevance to the derived object from the begin-fuzzy-production log mark, see Figure 3. This is done by sending the log records produced after the begin-fuzzy-production log mark so that they can be applied to the derived object if needed. The received log records constitute the derived log. The received log is self-contained so that transaction and node failures can be handled local to the derived object. When the fuzzy snapshot is established, the operations in the derived log can be applied on the derived object. A tuple and a state identi er is used to resolve the operations needed on the derived object. Each tuple contains a tuple and a state identi er, and a derived tuple inherits both properties from the original tuple. The state identi er identi es which state-changing-operations that have been applied to the tuple and is used to determine if an operation must be executed on the derived object. The operation is not applied if the state identi er of the operation is less
than or equal to the state identi er of the derived tuple. A current image of the derived object exists when the update channel is empty at the receiver side. The receiver must have additional capacity compared to the sender to be able to reach a current image. The fuzzy production strategy requires node independent log records and locks. When a current image of the derived object has been produced, the derived object is opened for ordinary transaction processing in addition to processing of the operations sent through the update channel. A version number is included in the dictionary representation of every table and in every operation. A compensation mechanism for the transactions based on the original dictionary version uses the update channel. If a state-changing-operation is based on the original dictionary version and is thus performed on the source object, then a derived log record is sent through the update channel to re ect the remaining update on the derived object. The update channel is kept until all active transactions at the commit time of the DB maintenance transaction at every node are terminated. This is a conservative approach. This general mechanism is, in addition to DDL operations, used for con guration transactions (e.g. SW replacement), and self-repair transactions (e.g. production of a new fragment replica replacing an unavailable replica). 3.3.1
Commit Processing of DB Maintenance Transactions
The commit processing of the DB maintenance transactions is done in two stages, one to each node region. This gives simple transactions continuous access to node region serialised dictionary data. In the rst stage, the DB maintenance transactions request dictionary read locks on the nodes in the rst node region and dictionary write locks on the nodes in the second node region. A consequence of this is that commit processing of DB maintenance transactions is globally strictly serial. When the locks are granted, the operations are executed on the hot standby replicas in the write locked node region. In the second stage of the commit processing, the roles of the two node regions are reversed. Since transactions have read dictionary data from within one node region only and thus have set an intention-read dictionary lock on the nodes involved, the dictionary write locks are not granted before all active transactions reading dictionary data from the node region are terminated. New transactions requesting dictionary reads from a dictionary write locked node turns to a node in the other node region. 3.3.2
Create Index Example
The create index operation is using the fuzzy production mechanism as follows. It write locks the primary dictionary replica representation of the table to be indexed. Then the index is produced by the fuzzy production mechanism. Concurrently, the primary replicas of the precompiled queries and transactions aected by the index are locked and recompiled. The update channel is needed to compensate for the remaining index updates resulting from operations from transactions based on the original version of the dictionary representation of the table. The commit processing of the DB maintenance transactions is done to the two node regions in a xed sequence.
3.4 A Mix of Simple and Con guration Transactions Con guration transactions are used when performing planned online eld replacements of hardware and software servers. Customer HW replacements are handled as online self-repair transactions. Con guration transactions do not involve data recon guration if the maximum duration of the resulting reduced fragment fault tolerance level is within the acceptable time limits. If not, data recon guration is done as for self-repair transactions. A con guration transaction is soft if no active transactions are aborted or delayed by the con guration transaction, else it is hard. Online eld replacement of SW is planned, does not involve data recon guration, and is either soft or hard. Soft SW replacements do not aect the availability of simple transaction services. Hard SW replacements can make simple transaction services unavailable because active transactions are aborted and automatically restarted after the replacement is done. SW replacements can be done locally on a node, a subset of nodes, or globally. The units of software replacement are the OS server, the communication server, the node DBMS server, or a mix of them, see Figure 2. Online, planned, eld replacements of HW is soft and requires data recon guration since no guarantee can be given of its duration. For more details on data recon guration see [29]. Con guration transactions involve table fragment replicas changing status, i.e. the replica either changes status from primary to hot standby, or the other way around. 3.4.1
Soft SW replacements
A soft SW replacement requires upward compatibility between the involved versions of communication protocols and formats. A soft SW replacement uses the update channel mechanism (see section 3.3) while performing a fuzzy primary/hot standby status change. The status of all primary replicas at the node is changed to hot standby. A node region serialised update of the dictionary is performed when modifying the status of the replicas. When all transactions using the old fragment replica status information are terminated, the new software is loaded, and the node is taken out of the active set. After the SW replacement is done, the node is reentered into the active set, and the status of all replicas are changed back to the original. No transactions are deliberately aborted during this software replacement operation, and negligible overhead is added to the active transactions. 3.4.2
Hard SW replacements
If compatibility can not be maintained over a SW replacement, a hard SW replacement must be used. This aborts all transactions active at the nodes where the replacement takes place. The simple transactions are automatically restarted after the replacement is done. A hard software replacement to multiple nodes is executed as a two-phased transaction. First all nodes prepare for the replacement by producing two succeeding fuzzy checkpoints, loading the new software version, and declare willingness to perform the replacement. In the second transaction phase the active transactions are aborted, the software is replaced, and the new active set formed. The added transaction response time by a temporary abort is caused by: The software replacement, transaction rollback, and transaction redo. This threatens the service availability while the replacement takes place. The threat depends on the
percentage of transactions active on the nodes where the replacement takes place. For simple transactions the percentage of transactions active on a node decreases with an increasing number of nodes.
3.5 A Mix of Simple and Self-repair Transactions Hardware and software failures represent signi cant threat to continuous available transaction services. Massive transaction aborts and delayed response times may result from a failure, and multiple failures may cause unavailable fragments. This section focus on recovery and self-repair from node crashes. Node crashes activate two dierent types of transactions: DBMS group recovery and fault tolerance selfrepair transactions. 3.5.1
DBMS Group Recovery Transactions
A DBMS group recovery transaction is activated after a node or communication channel failure. The group recovery transaction will establish a new consistent active node set, and perform a consistent status modi cation of the primary/hot standby replicas. The transaction uses the acknowledge-based processor membership protocol presented in [3]. The transaction service availability may be reduced during a DBMS group recovery by a combination of temporarily unavailable primary fragment replicas and the abortion of transactions active at the failed node. All transactions active at a failed node are aborted as a consequence of the primary/hot standby transaction execution. The fragments with the primary replica at the failed node are temporarily unavailable due to the status change during the execution of the DBMS group recovery transaction. During the primary/hot standby transaction execution, hot standby operations set locks. A transaction is not committed before its hot standby log records are logged and the corresponding hot standby operations have requested their locks. This implies that a hot standby replica can change status to primary without being delayed by recovery work. Redo and undo recovery work can be done concurrently with the transaction processing. The temporarily unavailability of fragments is therefore limited by the duration of a DBMS group recovery transaction. Simple transactions aborted by a node failure are automatically restarted. To provide continuous simple transaction services during the DBMS group recovery, the number of aborted STs must either be lower than the allowed abortion rate, or the total response time of the restarted STs must be within the response time requirements. The added response time for the aborted STs includes both the DBMS group recovery transaction, undoing the original transaction, and the reexecution time. 3.5.2
Self-repair transactions
Self-repair transactions reestablish the initial data fault tolerance level after a node failure. HypRa is designed for single or higher level of data fault tolerance by providing two or more replicas of every table fragment. This fault tolerance level is used when masking hardware and software server failures as presented above. HypRa
is in addition designed with online self-repair capabilities to reestablish the initial data fault tolerance level after a failure. When a table replica becomes unavailable, a new replica is produced online by a self-repair transaction to reestablish the initial number of fragment replicas. See [29] for a presentation of the self-repair policy of HypRa. While the self-repair transaction is executed, the fragments are vulnerable to further node failures which would make the fragments unavailable. To provide continuous transaction services, the time window of reduced fault tolerance must be small enough so that the expected unavailability due to multiple node failure is within the acceptable time limits. The self-repair transactions must be given enough priority so that the self-repair is completed within the repair time window. It is also a requirement that the probability of a double failure after the repair is similar to the original. This determines how subfragmentation and replica allocation must be done ([29], [30]).
4 Scalability Scaleup and scalable availability were main design goals in the design of HypRa. This section discusses the mechanisms used to achieve this, by presenting the methods used to achieve scalability of basic services, and scalability for simple and non-simple transaction processing. The main mechanism used in HypRa to achieve scalability is to online increase the number of nodes while keeping an even load distribution among nodes.
4.1 Scaling basic services Basic services are those oered by the operating system servers, the communication servers, and the hardware servers. The scalability of these services are crucial for the overall scalability of the HypRa database server. A main unit of replacement of HypRa is the node board. Based on the anticipated workload, the node board has been designed to provide a balanced capacity of the basic services [13]. Scaling basic services is done by adding node boards. This ensures a balanced scaling of basic services. The node board integrates CPU, memory, disc channels and communication channels (Figure 2). The number of communication channels give the maximum number of neighbouring nodes per node and therefore also sets a restriction on the maximum number of nodes in a server con guration. These are restrictions of a speci c node board design and not of the architecture. If the upper limit on the number of nodes for the server con guration is reached, the server con guration can be upgraded. This is an online eld operation. 4.1.1
Processing power
Scaling processing power is achieved by adding new node boards. The added processing power only enhances the server's capabilities in the absence of bottlenecks in both hardware and software. A typical bottleneck that could block scalability, is commit processing. In HypRa, group commit ([8]) is used to avoid this disc I/O bottleneck.
4.1.2
Memory allocation
Memory is shared among the dierent tasks running in the database server. After software servers have allocated their memory space, the remaining is allocated to transactions after a priority scheme given by the HypRa DBMS. Simple transaction processing has rst priority. In some cases higher priority tasks can abort lower priority ones. A minimum amount of memory is always allocated to handle the speci ed simple transaction workload. Non-simple transactions share the remaining memory. Complex transactions usually execute faster the more memory they can get ([12]). They are allowed to allocate up to a certain amount of memory each, but must wait if memory is limited. A non-simple transaction involves several nodes and must wait until there is sucient memory available on all nodes involved. A transaction mechanism is used to coordinate startup. This mechanism is used to control the number of complex operations running in parallel since these operations have both speedup and scaleup depending on the amount of memory available relative to the size of the involved tables. Adding nodes increase both memory and processor capacity. The memory allocation strategy is designed to utilise this added memory and processor capacity to scale the server. Simple transaction capacity scales because of even load distribution among nodes. Complex transaction capacity scale due to algorithms which use constant time when memory, processing power and problem size increase proportionally ([22], [20]). 4.1.3
Persistent storage
Persistent storage must be scalable with respect to online data volume and access capacity. An increased demand for online data volume on a HypRa server can be met by increasing the number of nodes, the number of disc channels per node, the number of discs per channel, or by using larger discs. A higher number of smaller discs will, due to the higher number of disc arms, give higher access capacity and transfer capacity than a smaller number of larger discs. The access capacity and transfer capacity will scale until they reach the capacity of the disc channel and the sum over all the disc channels reaches the capacity of the processor. Transactions use disc I/O services with priority according to the transaction priority. 4.1.4
Internal communication
Scaling the number of processors might introduce a cooperation problem. The HypRa design has reduced this problem by selecting a hardware architecture with a good balance between network complexity and capacity. The hypercube topology is selected in HypRa because of its regular interconnection structure with many alternative paths, and its good compromise between communication capacity and network complexity ([16]). The complexity is of order O(n log n) and maximum distance between two nodes (in hops) is O(log n) where
n is the number of nodes.
If a node or communication channel fails, messages are rerouted through alternative paths. A full hypercube has a number of nodes which is a power of 2. For scalability reasons this is not attractive since a 10 percent increase in workload might demand doubling of the number of nodes. A ner granularity of scaling is implemented by allowing incomplete hypercubes. The fault tolerant routing algorithms permits incomplete hypercubes. Extended e-cube routing ([38]) has been developed to handle incomplete hypercubes. It provides a shortest path, static routing scheme which distribute trac over the available communication channels. The routing tables are updated as part of DBMS group recovery transactions. A multilevel priority mechanism for messages is designed and adapted to the transaction types. Although the network complexity increases sublinearly this is not signi cant in the con gurations relevant for HypRa ([13]). Network complexity is small compared to the overall server cost. Communication capacity is of more importance. The hypercube neighbour concept is used when allocating primary and hot standby replicas to nodes. This is done so that neighbour communication only is used for primary/hot standby synchronisation (known as \chained declustering" in [25]). 4.1.5
External communication
The server must be able to scale the client/server communication capacity. The client/server communication uses communication co-processors connected to nodes in the server. Each communication co-processor is able to handle a maximum number of concurrent users and has an upper limit on the transaction load. The total number of communication co-processors can be scaled by online adding co-processors to hypercube nodes. Distributing the communication co-processors among the nodes in the hypercube avoids creating bottlenecks in the external communication.
4.2 Scaling simple transaction capacity To scale simple transaction capacity a bottleneck free design of inter transaction parallelism is crucial. The main issues are: Data distribution, transaction coordination, index distribution, lock distribution, log distribution, and dictionary distribution. An increased simple transaction workload will be met by increasing the number of nodes. In the debit/credit benchmark the processors capacity per data volume will remain constant when scaling the number of nodes to an increased workload and this gives scaleup. A transaction is coordinated from the node where it enters the database server which ensures distribution of the transaction coordinators. If the entry node is overloaded with transaction coordination, another node is chosen round robin from the active set. The coordinator initiates subtransactions on the nodes involved in the transaction. The coordinator uses a presumed abort two-phase-commit protocol ([34]). This coordinator allocation strategy ensures that the coordinator capacity scales with the number of nodes. Secondary indices are horizontally fragmented and distributed over all available nodes (as tables in general). The distribution is based on the hash value of the indexed attributes. Secondary indices are mappings from index attribute to primary
key (or to tuple identi er if no primary key has been de ned for the table). Secondary indices are therefore independent of table fragment replica relocations (and vice versa). Read locks are only set local to a tuple or index tuple. Write locks are set on all primary instances of an attribute group; i.e. both on the primary fragment replica and on the secondary indices involving some of the attributes. For locking of hot standby replicas, see subsection 3.1. This ne granularity locking reduces the lock distribution compared to tuple locks. Global deadlock detection is performed by a distributed, hierarchical deadlock detection algorithm ([28]). Logging is done locally at the node where a table or index fragment replica is stored. If the debit/credit assumption applies, the work involved in logging, transaction recovery, and node crash recovery is independent of the number of nodes. The dictionary is fully replicated to avoid read hot-spots. This implies that the dictionary read load is constant when the number of nodes scale. The simple transaction dictionary reads are given priority to DDL operations, see subsection 3.1.
4.3 Scaling complex transaction capacity Complex transaction processing capacity depends on the query execution strategy and the level of intra transaction parallelism. The main scaling issues are intra transaction parallelism in relational algebra and sorting algorithms. HypRa uses the hash-based methods for relational algebra operations ([11], [19], [15]). These are shown to scale almost perfectly linearly with number of nodes ([12], [20]). HypRa employs the parallel sorting algorithm described in [6] and [7]. This shows the same linear scalability as the relational algebra operations.
4.4 Scaling availability Data availability is scaled by increasing the fault tolerance level of tables. The fault tolerance level of tables can be scaled linearly, online, and per table by increasing the number of hot standby replicas of the table. Recon guration transactions are used for this. Power supply and cooling fans are organised so that their fault tolerance level scales with the table fault tolerance level (Figure 2).
5 Conclusions This, and other papers have described how the HypRa database server is designed to oer continuous database services in the presence of hardware and software failures ([29]). The fault tolerance design philosophy of HypRa is more cost-eective than alternative strategies. This is achieved through the use of standard components (e.g. HypRa uses standard SCSI discs and controllers as opposed to systems using dual ported discs ([21])). All basic resources in HypRa are giving full eect (as opposed to fault tolerance architectures based on replicated hardware, where some components only check what others have performed ([2])). The HypRa design includes an online self-repair capability which sets this design above what is commercially available today. This will give higher service availability
than oered by systems that only do online fault masking ([21], [2]). The capability of HypRa to do planned server maintenance as opposed to manual, corrective maintenance (which is the standard for all commercially available database servers) will reduce the cost of operating the server. The design of the scheduling system of the HypRa database server has been shown to give uninterrupted simple transaction service in the presence of complex transactions, database maintenance transactions, server recon gurations, and fault tolerance self-repair transactions. This type of un-interrupted service is not available in current database systems, as far as we know. The HypRa design has been shown to be bottleneck-free and to have a linear scalability for a combination of database benchmarks. This scalability can for the individual benchmarks be found in other database servers ([24], [20]).
6 Status of the HypRa development The HypRa development started as a project nanced by the Norwegian Telecom in 1989. Several other projects have established the foundations for HypRa, most notably several projects focusing on parallel mechanisms for basic database operations ([11], [14]), several projects on distributed database systems ([37], [27]) and variants of control mechanisms ([36]), and last but not least experience from several industrial database system implementation projects ([1], [33]). The rst HypRa project carried out an overall system design and a feasibility study. This was completed with a positive answer in December 1989. During 1990 HypRa has been taken through a second detailed design phase on all major system components. HypRa was set up as a commercial company in December 1990 and the implementation of the commercial database server product has started. The implementation project is estimated to approx. 100 man-years and the rst product releases are planned for 1993. The HypRa team is currently assembling its fourth prototype database server based on the Intel i486 processor, and this prototype will be used as the development platform for the industrial software. The current prototype is based on Intel 80386 processors and has been running for more than a year as a development platform for OS, communication software, relational algebra methods, and sorting algorithms. The published results ([12], [6]) were achieved on a previous Intel 80186 based prototype.
References [1] Techra User Manual. Technical report, Kongsberg V apenfabrikk avd. Trondheim, 1984. [2] Stratus XA2000 Series 200 Product Brief. Stratus Computer, Inc., 55 Fairbanks Boulevard, Marlboro, MA 01752, USA, April 1990. [3] A. El Abbadi, D. Skeen, and F. Cristian. An Ecient Fault-Tolerant Protocol for Replicated Data Management. In Proceedings of the 4th ACM SIGACTSIGMOD Symposium on Principles of Database Systems, pages 215{228, March 1985. [4] Ole John Aske. HypRa Hardware and System Software Overall Design. Research Report STF40 F89201, ELAB-RUNIT, SINTEF Group, 1989. [5] Ole John Aske, Kjell Bratbergsengen, and Tore Ster (editor). HypRa DBMS Detailed Design Volume I: Basic Software. Research Report STF40 F90131, ELAB-RUNIT, SINTEF Group, 1990. [6] Bjrn Arild W. Baugst and Jarle Fredrik Greipsland. Parallel Sorting Methods for Large Data Volumes on a Hypercube Database Computer. In H. Boral and P. Faudemay, editors, Proceedings from Sixth International Workshop on Database Machines, IWDM '89, Lecture Notes in Computer Science 368, pages 127 { 141. Springer-Verlag, June 1989. [7] Bjrn Arild W. Baugst, Jarle Fredrik Greipsland, and Joost Kamerbeek. Sorting Large Data Files on POOMA. In Proceedings from CONPAR 90, September 1990. [8] Philip A. Bernstein, Vassos Hadzilacos, and Nathan Goodman. Concurrency Control and Recovery in Database Systems. Addison-Wesley Publishing Company, Inc., 1987. [9] D. Bitton, David J. DeWitt, and C. Turby ll. Benchmarking Database Systemes - A Systematic Approach. In Proceedings of the Nineth Very Large Database Conference, October 1983. [10] Kjell Bratbergsengen. Sorting in a System Where the Storage Modules Have Connections to a Fixed Number of other Storage Modules. Project working note in Norwegian, ASTRA working note 16, Norwegian Institute of Technology, Division of Computer Science, 1978. [11] Kjell Bratbergsengen. Hashing Methods and Relational Algebra Operations. In Proceedings of the 10th International Conference on Very Large Databases, pages 323{333, 1984. [12] Kjell Bratbergsengen. Algebra Operations on a Parallel Computer - Performance Evaluation. In Fifth International Workshop on Database Machines, 1987.
[13] Kjell Bratbergsengen. HypRa Database Server Feasibility Study. Research Report STF40 F89199, ELAB-RUNIT, SINTEF Group, 1989. [14] Kjell Bratbergsengen. The Development of the Parallel Database Computer HC16-186. In Proceedings of The Fourth Conference on Hypercubes, Concurrent Computers and Applications, March 1989. [15] Kjell Bratbergsengen. Relational algebra operations. In Pierre America, editor, Proceedings of the PRISMA Workshop, Philips Research Laboratories, Eindhoven, The Netherlands, September 24-26 1990. [16] Kjell Bratbergsengen, Rune Larsen, Oddvar Risnes, and Terje Aandalen. A Neighbor Connected Processor Network for Performing Relational Algebra Operations. In The Fifth Workshop on Computer Architecture for Non-numeric Processing, Paci c Grove, Ca, USA, March 1980. SIGMOD. [17] Stefano Ceri and Guiseppe Pelagatti. Distributed Databases Principles and Systems. Mc Graw Hill, 1986. [18] Flaviu Cristian. Understanding Fault-Tolerant Distributed Systems. Research Report RJ6980, IBM Almaden Research Center, August 1989. [19] David DeWitt and Robert Gerber. Multiprocessor Hash-Based Join Algorithms. In Proceedings of the 11th Conference on Very Large Data Bases, Stockholm, August 1985. [20] David J. DeWitt, Shahram Ghandeharizadeh, Donovan A. Schneider, Allan Bricker, Hui-I Hsiao, and Rick Rasmussen. The Gamma Database Machine Project. IEEE Transactions on Knowledge and Data Engineering, 2(1):44{62, March 1990. [21] C.I. Dimmer. The Tandem Non-Stop System. In T. Anderson, editor, Resilient Computing Systems. William Collins Sons & Co. Ltd, 1985. [22] Susanne Englert and Jim Gray. Performance Bene ts of Parallel Query Execution and Mixed Workload Support in NonStop SQL Release 2. Tandem Systems Review, 6(2), 1990. [23] H. Garcia-Molina and B. Kogan. Achieving High Availability in Distributed Databases. IEEE Transactions on Software Engineering, 14(7), 1988. [24] Tandem Performance Group. A Benchmark for Non-Stop SQL on the Debit Credit Transaction. In Proceedings of ACM SIGMOD International Conference on Management of Data, Chicago USA, june 1988. [25] H.I. Hsiao and D.J. DeWitt. Chained Declustering: A New Availability Strategy for Multiprocessor Database Machines. In Proceedings of the Sixth International Conference on Data Engineering, February 1990. [26] Svein Hvasshovd. HypRa/TR - A Recovery Method for a Shared Nothing DDBMS. The Norwegian Institute of Technology, Division of Computer Systems and Telematics, 1991. Dr.Ing. thesis, In preparation.
[27] Svein Hvasshovd and Ketil Albertsen. Overall Design of the TelSQL System. Research Report STF14 A88049, RUNIT, SINTEF Group, 1988. [28] Svein O. Hvasshovd, Ole John Aske, Tore Ster, and Torgrim Gjelsvik. HypRa DBMS Detailed Design Volume II: Basic DBMS Software. Research Report STF40 F90132, ELAB-RUNIT, SINTEF Group, 1990. [29] Svein O. Hvasshovd, ystein Torbjrnsen, and Tore Ster. Critical Issues in the Design of a Fault-Tolerant Multiprocessor Database Server. In Proceedings of the 1991 Paci c Rim International Symposium on Fault-Tolerant Systems, September 1991. [30] Svein O. Hvasshovd and Tore Ster. HypRa DBMS Detail Design Volume IV: DBMS Support Software. Research Report STF40 F90134, ELAB-RUNIT, SINTEF Group, 1990. [31] IBM. IBM Database 2 Version 2, Application Programming and SQL Guide, second edition, September 1989. [32] ISO-ANSI. Information Processing Systems - Database Language SQL with Integrity Enhancement, 1989. Second edition. [33] Bo Kahler and Oddvar Risnes. A Proposal on a Distributed Version of MIMER. Research Report STF14 F85023, RUNIT, SINTEF Group, 1985. [34] C. Mohan and B. Lindsay. Ecient Commit Protocols for the Tree of Processes Model of Distributed Transactions. In Proceedings of the 2nd ACM Symposium on Principles of Distributed Computing, 1983. [35] Mads Nygaard, Steinar Haug, Ole Hjalmar Kristensen, and Tore Ster. ATCCIS - with Varying Degrees of Availability. Research Report STF40 F90066, ELAB-RUNIT, SINTEF Group, 1990. [36] Mads Nyg ard. Non-Serializability with Wander-Transactions in SkeletonDatabases. Research Report STF14 A88056, RUNIT, SINTEF Group, 1988. [37] Oddvar Risnes. Database Snapshots: A Mechanism for Replication of Data in Distributed Databases. The Norwegian Institute of Technology, Division of Computer Systems and Telematics, 1987. Dr.Ing. thesis. [38] ystein Torbjrnsen. Shortest Path Routing in a Failsoft Hypercube Database Machine. In Proceedings of The Fifth Distributed Memory Computing Conference, volume 2, pages 839{844. IEEE, April 1990.