Virtual Full Replication: Achieving Scalability in ... - CiteSeerX

Virtual Full Replication: Achieving Scalability in Distributed Real-Time Main-Memory Systems Gunnar Mathiason, Sten F. Andler Department of Computer Science, University of of Skövde P.O. Box 408, SE-541 28 Skövde, Sweden {gunnar,sten}@ida.his.se

Abstract To achieve better scalability in a fully replicated distributed main-memory database, we propose support for virtual full replication. Full replication is often necessary for availability and predictability in critical embedded applications. In the fully replicated database, however, all updates are sent to all nodes, regardless of whether the data is ever used at every node. Virtual full replication is a concept that improves scalability without changing the application’s assumption of having access to a fully replicated database. We support virtual full replication by segmenting the database and allowing segments to have individual degrees of replication. This decreases the replication effort, lowers overall memory requirements for data and decreases node recovery time. Typical scenarios include distributed databases with many nodes, where often only a small number of the nodes need to share the same subset of information. We have previously specified a segmentation syntax for specifying important application semantics and outlined an implementation. Here, we analyze the potential scalability improvements in such an architecture.

1

Introduction

A fully replicated distributed database system replicates the entire database to all the nodes in the system. There are different approaches for how updates are replicated between these nodes. In a globally consistent database, updates get visible in the same order with respect to other transactions (one-copy serializable [3]) at all replicas. This property has a high cost in locking and synchronization, since normally all replicas of the database are locked during an update. When relaxing the requirement of global consistency and allowing eventual consistency [4, 2] between replicas, locking is only performed at the node where the transaction executes, allowing simultaneous updates at other nodes.

Full replication of the database with eventual consistency enables predictable local execution times of transactions, since only local data is accessed and there is no need to access or lock other nodes during the update. Often this results in excess replication, however, since many applications actually only use a subset of the replicated data. Full replication is also costly in storage requirements, particulary for main-memory databases, which require memory enough to store a full database copy at each node. With virtual full replication [2], data is replicated only to nodes where it is used. The database is divided into segments, with individual degrees of replication, and an update to a data object within a segment is replicated only to the subset of nodes where the segment has been allocated. Thus, virtual full replication maintains the application’s view of a fully replicated database while not suffering from excess replication. Our work focuses on scalability of replication effort and storage requirements, in terms of number of nodes in the system and the number of data objects in the database. A segmented database scales much better in replication effort and storage requirements in terms of number of nodes and number of data objects. Improving scalability allows more nodes to be added or database size to be increased, without saturating the network or requiring excessive amount of main memory. We show that predictable and scalable replication effort and storage requirement can be achieved for a distributed real-time database by using partial replication of segments of the database, still supporting the same degree of fault-tolerance and data availability as that of a fully replicated database. In previous work [9] we have developed several concepts and solutions for central issues in segmentation, such as syntax for setting up segments in a fully replicated database based on data usage patterns and algorithms for building segments and creating meta information for replication control. In this paper we focus on segmentation as a principle for improving scalability but also suggest further steps in the exploration of segments, like scheduled and prioritized replication of updates in distributed real-time databases, dy-

namic segment allocation, and segment allocation policies based on other application criteria than availability. Section 2 elaborates issues for distributed real-time databases and the DeeDS database system in particular. Section 4 states the scalability problem more precisely. Section 5 describes segmentation in more detail.

real-time database. With full replication, there is a possibility that much data is replicated to many nodes without the data actually used at the participants, resulting in an unnecessarily high replication effort.

2

A fully distributed database scales poorly with number of nodes and data objects due to excess replication. In a distributed database with n nodes, an update to a data object in one of the nodes initiates an update to the remaining n − 1 nodes. Thus, in general the replication effort for updating m data objects at any node is proportional to m(n − 1) or O(mn). If we further assume that an increase in the number of nodes in the system results in a proportional increase in number of updates in the distributed database, then the scalability of full replication is O(n2 ). We elaborate this discussion in a later section on complexity.

Platform: The DeeDS database system

For our experimental research, we use a fully replicated real-time database (DeeDS) [2]. DeeDS allows eventual consistency between replicas, which may temporarily become mutually inconsistent. At some point in time the replicas must converge to consistency when the effect of an update eventually is present at all nodes and no potentially conflicting updates are present. With eventual consistency, updates are replicated between nodes asynchronously from transactions. Since for real-time databases the most important characteristics is predictability, the consistency constraint has been relaxed to achieve predictability of local real-time database operations on each node. Key features of DeeDS are: Main memory residency – there is no persistent storage on disk, to remove the unpredictability of disk access; Full replication with eventual consistency – as described above, to remove the unpredictability of network delays or network partitioning; Recovery and fault tolerance – supported by node replication and timely recovery from live nodes; Active functionality – with rules that have time constraints. Since DeeDS is fully replicated, it suffers from the previously mentioned drawback of excess replication of updates for data that is never used. Our implementation of virtual full replication in DeeDS is intended to improve this and increase scalability.

3

Scenario: DeeDS in the WITAS project

The WITAS project [6] aims at developing Unmanned Aerial Vehicles (UAV) that can be given high-level command for surveillance missions, then autonomously fly to a site for collecting information or performing a task and later return and report about the results from the mission. Besides the UAVs, there are also ground-based vehicles for communication and coordination with a central Command Center. Communication between the aerial and groundbased vehicles and the Command Center is required to have real-time properties, which can be supported by the DeeDS real-time distributed database system. Thus, DeeDS is suitable as an infrastructure between the vehicles and the Command Center and has been selected for use in simulation of communication between UAVs and ground vehicles [5]. A typical WITAS simulation has many participants and large amounts of data are transferred between them through the

4

5

The problem of scalability

Segmentation issues

A segment is a group of data objects that share properties, capturing some aspects of the application semantics, and is allocated to a specified subset of the nodes (possibly temporarily inconsistent with each other). We use the segment properties of node allocation and replication degree to support virtual full replication. Other segment properties to explore include timeliness and consistency requirements, which allows for increased predictability and concurrency in data replication [9]. Segmentation is most successful in applications where we can use an assumption of node cohesion, where clients at a few nodes share only a known subset of data in the replicated database. Locality and hotspot models for distributed data suggest that often only a few data objects are shared among many nodes, while many data objects are shared by only a small number of nodes [7]. We assume the following for segments in a replicated database: 1) Data objects in a segment all share the same segment properties (degree of replication, allocation, timeliness, consistency, etc.) that are assigned to the segment. A data object can only be assigned to one segment at a time. 2) The number of segments, their allocations and their properties are assumed to be fixed throughout the execution of the database. These assumptions are reasonable in most mission-critical real-time applications. In future work, we aim to explore the removal of the latter assumption and support of dynamic allocation of segments to allow other types of applications. The replication degree, di , for a specific segment is 1 < di < n, where n is the total number of nodes. We use g for number of segments and the term size of a segment, si , for the number of data objects in a segment, while node allocation defines the nodes that a segment is allocated to.

The database size, S,P is the sum of the sizes of the segments g in the database (S = i=1 si ). The database is accessed by means of transactions, which read or write data objects in segments and have specific requirements, depending on the process using them. We also use the general term client (application, process or transaction) for any entity that accesses the database and has requirements on properties of the data.

5.1

A segment set-up algorithm

As a basic approach to define segments, we use access information about what data objects are accessed at different nodes, which origins from a manual analysis of what data objects are accessed from the clients running at nodes and their transactions. In [9] we define a syntax where we can capture application knowledge from the manual analysis and an algorithm for setting up segments and an allocation schema for the resulting segments.

5.2

Complexity of replication and scalability

We define a model to reason about how segmentation improves scalability and replication efficiency. Since we aim at reducing the replication effort as well as storage requirements, we need to calculate the scalability of data to be sent over the network for replicating updates and the scalability of space needed. We define replication effort as ”a measure to express the effort of making the database consistent after an update to any data object.” We make the following assumptions: 1) For every update of a data object replica, we use one network message. All update messages are of the same size. 2) All data objects have the same size in bytes. 3) Our basic evaluation model is intended to be decoupled from a particular application. Thus, modelling of access patterns for updates, distribution of updates and frequency of updates are not included. In section 4 we noted that the effort to replicate a distributed database has complexity O(mn) for update size m (number of objects to replicate) and number of nodes n. When introducing segments, n is reduced to practically a constant, since fewer nodes share same data objects. We thus argue that with a segmented database the replication effort is O(m). The storage requirements for a fully replicated database is O(Sn), where S is the size of the database. With segments, the storage requirements of each segment is O(si di ), where si is the size of the segment and di is the degree of replication. If di is reduced to practically a constant, the total storage requirement is O(S).

6

Related work

Our solution has some resemblance with grouping of data as described in [10]. However, instead of allocating

data to the most optimal locations only, we allocate data to all nodes where it is used. The Ficus file system [8] uses an approach for partial optimistic replication of files similar to our approach for data objects in a replicated real-time databases, but without using the concept of segments as the granularity of replica allocation, even though Ficus volumes in some sense can be regarded as similar to segments. Much work in partial replication can be found in the area of distributed file systems. Several approaches exist, where granularity is used to replicate smaller amounts of data to more efficiently support mutual consistency. The Andrew File System and the Coda file system [11] use a set of servers that have full mutual consistency, where mobile nodes can connect and synchronize in a client/ server model. In the Ficus file system [8] the nodes are peer nodes that use optimistic replication with conflict detection and conflict resolution for resolving conflicting updates. HARP [1] is a protocol for optimistic replication to certain neighboring nodes in a distributed system, opposed to system with full replication where all nodes are updated. In our work, we replicate to certain nodes as well, but this is not based on distance to the updated node but rather the need for the data to be available. Thus, the maximum delay of updates in virtual full replication is one propagation delay.

7

Conclusions

We consider the following to be our most important current and future contributions: 1) Exploration of segments to support virtual full replication. We elaborate on the initial ideas of virtual full replication, as presented in [2], by examining how segmentation may support a fully replicated database. We show how replication effort is improved, while maintaining the level of local availability of data and the real-time properties of the database. 2) Replication control. We present an architecture for replication control that uses the specification of segment properties and algorithms to show that segments with data of different properties and requirements in consistency and timeliness may coexist in the same distributed database. 3) Evaluation model. Our evaluation model for segmented databases and our discussion of scalability and replication efficiency will show how to measure the improvements in replication effort for segmented databases.

7.1

Future work

We see a number of possible extensions to our work and the intention is that this paper is a research proposal in the area. We see an implementation of segmentation in DeeDS as necessary for a full validation of the initial work done and

a way to achieve better understanding of segmentation and the limitations of it. As an immediate next step we will propose a low level design for the implementation in DeeDS for the purpose of investigating the effect on replication effort and scalability in a typical application, such as WITAS, and an analysis of what parameters influence replication effort and scalability in practice. Further, the proposed segment properties must be validated to be sufficient for the application. We have chosen a small set of segment properties. An application may require a larger set of properties to support the semantics of the application. Increasing the proposed set of segment properties may risk an exponential increase in the number of segments. To solve this properties may need to be grouped together for meaningful profiles of applications to support. A deeper analysis of the needs from different applications could result in a more comprehensive set of segment properties. We have previously outlined concepts for supporting levels of consistency between segment replicas and other useful segment properties. In particular, a higher replication predictability may be achieved by scheduling propagation [9] and using bounded replication, based on timeliness requirements from the applications at different nodes. Currently we focus on a static description of segments, their properties and their allocation. For many applications the need for data changes with different modes of operation and this motivates a dynamic allocation of segments to nodes. Allocation and de-allocation of segments to nodes could be done in similar way as virtual memory is handled in an operating system by using database recovery techniques. Segment recovery is an issue that needs to be connected to recovery of distributed databases in general. We may recover segments incrementally in priority order (and from various sources) for startup and recovery of the database system. In our syntax for specification of segments we have defined the recover-from and storage keywords [9], but we have not actually used the information yet. By explicitly specifying the storage for a segment we can support disk-based segments and segments that can be swapped in and out of memory from and to disk. Once we support dynamic allocation of segments, segment storage can be handled more easily. A more detailed model for replication effort is necessary for better understanding the potential efficiency improvement. Our current model is limited to a simple definition of replication throughput, accompanied by a discussion of parameters that influence scalability. Our model needs to be refined so that update access patterns can be described in greater detail for particular applications. Factors, such as arrival rates, arrival distribution and size of updates are application dependent. Also, architectural factors influence the replication effort, such as how many network messages are used for propagating updates (batch updates, broadcast

updates etc.). An extensive evaluation model also needs to consider the object and segment sizes and the actual size in bytes of update messages. Replication effort may need to incorporate other parameters than the number of messages over the network. The database system may replicate over a variety of network links of different quality and cost and for that reason we may need to add network propagation cost, network delays or other parameters to our model for replication effort.

7.2

Acknowledgements

We extend our thanks to members of the DRTS group, in particular Sanny Gustavsson and Marcus Brohede.

References [1] N. Adly, M. Nagi, and J. Bacon. A hierarchical asynchronous replication protocol for large scale systems. In Proceedings of the IEEE Workshop on Advances in Parallel and Distributed Systems, pages 152–157, 1993. [2] S. Andler, J. Hansson, J. Eriksson, J. Mellin, M. Berndtsson, and B. Eftring. DeeDS towards a distributed and active realtime database system. SIGMOD Record, 25(1):38–40, March 1996. [3] P. Bernstein and N. Goodman. The failure and recovery problem for replicated databases. In Proceedings of the 2nd ACM Symposium on Principles of Distributed Computing, pages 114–122, Montreal, Quebec, Aug 1983. ACM, New York. [4] A. Birrell, R. Levin, R. Needham, and M. Schroeder. Grapevine: an exercise in distributed computing. Communications of the ACM, 25(4):260–274, April 1982. [5] M. Brohede and S. Andler. Distributed simulation communication through an active real-time database. In Proc. 27th Annual NASA Goddard Software Engineering Workshop (SEW27 2002), Greenbelt, MD, USA, 4-6 December 2002. [6] P. Doherty, G. Granlund, K. Kuchcinski, E. Sandewall, K. Nordberg, E. Skarman, and J. Wiklund. The WITAS unmanned aerial vehicle project. In Proc. 14th European Conf. Artificial Intelligence (ECAI), pages 747–755, Berlin, August 2000. [7] J. Gray and A. Reuter. Transaction Processing: Concepts and Techniques. Morgan Kaufmann, 1993. [8] R. Guy, J. Heidemann, W. Mak, T. Page Jr., G. Popek, and D. Rothmeier. Implementation of the Ficus replicated file system. In USENIX Conf. Proc., pages 63–71, June 1990. [9] G. Mathiason. Segmentation in a distributed real-time mainmemory database (HS-IDA-MD-02-008). Master’s thesis, University of Skövde, Sweden, 2002. [10] R. Mukkamala, S. C. Bruell, and R. K. Shultz. Design of partially replicated distributed database systems: an integrated methodology. In Proceedings of the 1988 ACM SIGMETRICS conference on Measurement and modeling of computer systems, pages 187–196. ACM Press, 1988. [11] M. Satyanarayanan. Distributed file systems. In S. Mullender, editor, Distributed Systems, chapter 14. Addison-Wesley, 2nd edition, 1994.