Bumper: Sheltering distributed transactions from conflicts

4 downloads 2225 Views 1MB Size Report
Apr 30, 2015 - Received 6 February 2015. Received in ... we call Distributed Time-Warping (DTW). And second ... for instance by relying on multi-versions [4–6]: these allow effi- .... ios. The structure for the rest of this paper is as follows. We start by ..... triad is detected by the coordinator of the 2PC upon merging the votes.
Future Generation Computer Systems 51 (2015) 20–35

Contents lists available at ScienceDirect

Future Generation Computer Systems journal homepage: www.elsevier.com/locate/fgcs

Bumper: Sheltering distributed transactions from conflicts Nuno Diegues ∗ , Paolo Romano INESC-ID, Rua Alves Redol 9, Lisbon, Portugal Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais 1, Portugal

highlights • • • • •

Cloud data stores that expose distributed transactions suffer of transaction aborts. We identify that many aborts can be avoided while preserving strong consistency. With Bumper, we reduced aborts for transactions to nearly 0% in many workloads. The performance and scalability improved up to 3x in conflict-prone applications. Our approach uses a novel distributed protocol that scales to hundreds of servers.

article

info

Article history: Received 6 February 2015 Received in revised form 15 March 2015 Accepted 3 April 2015 Available online 30 April 2015 Keywords: Distributed transactions Spurious aborts 1-copy serializability High scalability

abstract Large scale cloud applications are difficult to program due to the need to access data in a consistent manner. To lift this burden from programmers, Deferred Update Replication (DUR) protocols provide serializable transactions with both high availability and performance in read-dominated workloads. However, the inherently optimistic nature of DUR protocols makes them prone to thrashing in conflict-intensive scenarios: existing DUR schemes, in fact, avoid any synchronization during transaction execution; thus, these schemes end up aborting any update transaction whose reads are no longer up to date by the time it attempts to commit. To tackle this problem, we introduce Bumper, a set of innovative techniques to reduce aborts of transactions in high-contention scenarios. At its core, Bumper relies on two key ideas. First, we spare update transactions from spurious aborts (i.e., unnecessary aborts of serializable transactions), by attempting to serialize the transactions in the past. For this, we use a novel distributed concurrency control scheme that we call Distributed Time-Warping (DTW). And second, we avoid aborts due to contention hot spots (that cannot be tackled by DTW) via a programming abstraction that we call Delayed Actions. These, allow for efficiently serializing, in an abort-free fashion, the execution of conflict-prone data manipulations. By means of an extensive evaluation, on a distributed system of 160 nodes, we show that Bumper can boost performance up to 3× in conflict-intensive workloads, while imposing negligible (about 2.5%) overheads in uncontended scenarios. © 2015 Elsevier B.V. All rights reserved.

1. Introduction The advent of the cloud computing paradigm has empowered programmers with the ability to scale out their applications easily to hundreds of nodes in a distributed system. However, developing applications capable of effectively exploiting the computational capabilities of large scale distributed cloud platforms is far from being a trivial task. Data management systems help programmers to deal with this by providing the abstraction of serializable distributed



Corresponding author at: INESC-ID, Rua Alves Redol 9, Lisbon, Portugal. E-mail address: [email protected] (N. Diegues).

http://dx.doi.org/10.1016/j.future.2015.04.002 0167-739X/© 2015 Elsevier B.V. All rights reserved.

transactions. A well-established approach to implement this abstraction is that of Deferred Update Replication (DUR) [1,2]: servers replicate data, to which clients perform requests of transactions; servers synchronize only at the commit of the transactions to either atomically update data across the servers or to abort. This type of protocols follows an optimistic approach to concurrency control [3], as accesses in transactions are performed without enforcing synchronization, and the serializability is ensured in the commit operation during validation of the transaction. The literature is rich in enhancements to optimistic concurrency control, for instance by relying on multi-versions [4–6]: these allow efficient read-only transactions, by sparing them from any aborts and remote validations. Another key property to enhance scalability is that of genuine partial replication. In this property the execution

N. Diegues, P. Romano / Future Generation Computer Systems 51 (2015) 20–35

(a) Abort that time-warp can avoid.

21

(b) Abort that delayed action can avoid.

Fig. 1. Examples of executions that cause spurious aborts when using typical DUR protocols [12,4,5,13,14]. These can be avoided by using the two techniques that Bumper encompass: namely time-warping and delayed actions.

of a transaction can only involve nodes that replicate data items it accessed [2]. 1.1. Identified problems The aforementioned DUR systems were shown to perform well, even in large scale, while providing strong semantics in the form of transactions. However, as we shall see later in the paper, the scalability of these systems can be critically challenged in conflictprone scenarios. The main factors constraining the scalability of these systems are of a twofold nature. They are both related to the algorithms used to regulate concurrency among transactions, as well as to the degree of parallelism admitted by the applications:

• State of the art DUR protocols rely on overly conservative validation schemes. These schemes abort an update transaction, whenever any of its reads is no longer up to date, by the time it requests to commit. This mechanism gained wide adoption because it can be implemented efficiently. However, we note that it does not represent a necessary condition to detect nonserializable histories [7], and, as we will show, it can induce a high number of spurious (i.e., unnecessary) aborts. • It is well understood that the maximum degree of parallelism (and hence, of scalability) admitted by any transactional system is deeply affected by the data access patterns exhibited by the applications deployed over them [8,9]. For instance, several standard online processing transactional profiles are characterized by contention hot spots; these are frequently updated data items, such as warehouse balance counters in the known Transaction Processing Performance Council Benchmark C (TPC-C) [10]. Transactions accessing such data items are not only inherently non-parallelizable; they are also prone to undergo repeated aborts, which can have detrimental effects on the system’s throughput and user-perceived responsiveness. 1.2. Contributions We address the issues discussed above by introducing Bumper: a set of mechanisms aimed to shelter transactions from conflicts, thus enhancing scalability in conflict-prone scenarios while ensuring strong-consistency (1-copy serializability [11]). At its core, Bumper relies on two novel mechanisms to prevent different types of conflicts: distributed time-warping (time-warping for the sake of brevity) and delayed actions. The idea at the basis of time-warping is to prevent (a type of) spurious aborts that do not threaten the serializability of transactions. We illustrate such type of aborts in Fig. 1(a): many typical DUR protocols [12,4,5,13,14] abort the update transaction T in the example; the reason is that it read item x, and transaction A concurrently committed with a write to x, making the read of T stale. The intuition of this common approach that leads to such aborts is that, in DUR protocols, update transactions are required to commit in the logical present, i.e. the snapshot observed by transaction

T must be valid by taking into account every transaction committed before T . Looking back at Fig. 1(a), we see that such an approach leads to a spurious abort of T . In fact, in such a scenario, it is possible to safely serialize T before A, and thus spare its abort. A key property of time-warping consists in its efficiency. From a theoretical perspective, it is straightforward to design an algorithm capable of accepting every serializable history: it suffices to track the full graph of dependencies between every transaction and to ensure its acyclicity [3]. Unfortunately, this is an unbearably onerous approach, especially in a large scale system. Conversely, time-warping uses a novel, lightweight validation mechanism, which tracks only direct dependencies developed by a transaction during its execution (such as the ones shown in Fig. 1). Not only does this mechanism prevent spurious aborts that would be caused by the validation schemes employed by traditional systems; it can also be implemented very efficiently and in a genuine fashion (i.e., by only collecting information at nodes that are involved in the distributed transaction). One of the key contributions of this article is precisely in designing an efficient implementation of time-warping. We show that this technique can effectively reduce aborts, while introducing minimal additional overhead. Clearly, there exist limits to the aborts that time-warping can avoid. An example is shown in Fig. 1(b), where transactions C and D read and write the same data item z. Since they mutually miss each others write, neither one can be time-warp committed and serialized before the other. To cope with these challenging conflict patterns, we introduce a programming abstraction for distributed transactions, complementary to time-warping, which we name delayed action: this is a code fragment to be executed transactionally, but whose sideeffects/outputs are not observed elsewhere in the encompassing transaction. By allowing programmers to wrap conflict-prone code within a delayed action, Bumper can postpone its execution until the transaction’s commit procedure. At that point it can guarantee that the snapshot observed will not be invalidated by a concurrent transaction. This allows ensuring that a delayed action cannot cause the abort of its encompassing transaction, while guaranteeing that it is atomically executed in the scope of the transaction that triggered it. The key ideas at the basis of Bumper (i.e., time-warping and delayed actions) are applicable to several systems, such as SCORe [5], P-Store [2], Spanner [6] or D2 STM [12]. In this paper, we demonstrate their practicality by integrating them with SCORe [5], a highly scalable DUR protocol that employs genuine partial replication and a decentralized multi-versioning algorithm. Our evaluation, employing 4 well-known benchmarks and 160 nodes, shows that Bumper can boost performance up to 3× in conflict-intensive workloads, with negligible (2%) overheads in uncontended scenarios. The structure for the rest of this paper is as follows. We start by presenting the system model in Section 2. Section 3 presents the mechanisms at the base of Bumper, whose correctness we discuss in Section 4. We further refine the proposed solution in Section 5, and present its experimental evaluation in Section 6. Finally, we overview related work in Section 7 and conclude in Section 8.

22

N. Diegues, P. Romano / Future Generation Computer Systems 51 (2015) 20–35

Table 1 Description of the notations used in the paper. Symbols

Description

= n1 , . . . , nn owners(k) origin(T ) participants(T ) local(k) A→T

Nodes in the distributed system Set of nodes that replicate item k The node where transaction T begins Set of nodes that replicate items written or read by transaction T Whether the data item k is replicated at the local node Transaction T reads-from transaction A; i.e.: A writes to some item k, A commits, and then T reads k and obtains the version installed by A Transaction Bmisses transaction T ; i.e.: B reads some item k and, during the execution of B, transaction T writes (in w riteSet (T )) a new version for k



B 99K T

2. System model and assumptions In the following we introduce several notations, which we also summarize in Table 1 for the sake of clarity. We consider a classical asynchronous distributed system model consisting of a set of processes (also called nodes) = n1 , . . . , nn that communicate via message passing and can fail according to the fail-stop (crash) model. We assume a simple key-value model for data. For each data item k there is a chain of versions as in typical Multi-Version Concurrency Control (MVCC) schemes [3]. Each version of k is a tuple ⟨k, v al, ts⟩, where k is the key, v al is its value and ts is a scalar that totally orders the chain of k. We also assume partial replication, in which node ni stores only a partial copy of the  key-value storage. Each data item is replicated by r processes in , and we assume that among any r replicas there exists at least one that is correct (i.e., does not crash). We denote with ow ners(k) the set of processes that replicate k. We model transactions as a sequence of read and write operations on data items, preceded by a begin, and followed by a commit or an abort operation. Formally, a transaction T is a tuple ⟨id, ts, rs, ws⟩. The id is a unique identifier for T and ts is a timestamp that defines the versions visible to T . Finally, rs corresponds to the readSet (T ) of data items read by T and w s is the w riteSet (T ) of buffered values that T wants to commit. As we will discus in detail later, in the proposed solutions we shall extend this tuple with another timestamp.  We denote participants(T ) as = k∈F ow ners(k) where F w riteSet (T ) ∪ readSet (T ) (i.e.,the footprint of T ). A transaction T , which we call origin(T ), and can originates on a process ni ∈ read/write any data item (even if not replicated locally). We additionally define two relations between transactions (similar to those in [11]). We say that transaction T reads-from A when transaction A writes to k, commits, and then T reads k and obtains the version installed by A (we denote this by A → T ). Transaction B misses transaction T when B reads k and, during the execution of B, transaction T writes (in w riteSet (T )) a new version for k (we denote this by B 99K T ). We assume the correctness criterion of 1-copy serializability in which every concurrent execution of committed transactions is equivalent to a serial execution involving the same transactions [11]. We assume no blind writes [3], i.e., every write to some k by a transaction is preceded by a read of k in that transaction. This simplifies the presentation of our ideas, which would be trivially extended to consider blind writes. Processes communicate via reliable FIFO-ordered channels, i.e., every message sent is eventually received in the send order as long as both the sender and the receiver are correct. 3. Bumper: techniques to reduce aborts We describe Bumper in the next sections by delving into the proposed conflict reduction mechanisms. We first discuss how Bumper can be applied to SCORe, a highly scalable state of the art protocol that provides 1-copy-serializability and abort-free readonly transactions. By integrating Bumper with SCORe, we can

obtain a protocol that excels both in the management of readdominated and write-intensive workloads. Before introducing Bumper’s mechanisms, we give an overview of SCORe, which is aimed at providing a sufficient background. Next, Section 3.2 establishes the principles of distributed timewarping transactions and how Bumper implements them. Then, we present delayed actions to bypass hot spots of contention in Section 3.3. Finally, we discuss the integration of Bumper in alternative transactional replication protocols in Section 3.4. 3.1. A primer on SCORe We briefly overview SCORe [5], the baseline protocol chosen to present Bumper. SCORe is a scalable genuine partial replication protocol. Each data item k is replicated only in a sub-set of the nodes. We provide a high-level overview of this layout in Fig. 2. While abstracting over many details, this example allows to understand a normal flow of transactions in SCORe. A client submits a transaction T to any node, known as origin(T), and this node issues the read and write operations contained by T . This may entail remote read operations for some item k, in case origin(T) does not replicate k locally. As SCORe is a genuine partial replication protocol, T is guaranteed to only access nodes that maintain data that T accessed. This is important in particular for the efficiency of the commit phase – which validates and atomically updates the keys modified by the transaction – as it requires multiple communication steps among the nodes that participate in this phase. In more detail, the concurrency control mechanism employed locally at each node in SCORe is similar to classic multiversioning [3,15]: each node ni maintains multiple versions (per data item) in a chain, where each version is tagged with a scalar timestamp. This timestamp is used to totally order the commit events of transactions that update data items replicated by ni . At the core of its concurrency control, SCORe manages a distributed timestamp scheme that allows to establish which versions are visible to a transaction, and the serialization order for update transactions. The versions of the data items that are visible to T are determined via a timestamp associated with T , representing its snapshot (called tsS ). This is defined when the transaction starts, by reading a logical clock local to origin(T). From that moment on, any read operation by T is allowed to observe the most recently committed version of the data item, having timestamp less than or equal to T .tsS . In the proposed algorithms, we abstract this visibility rule of versions under a function called localReadSCORe . Finally, the commit of SCORe also assigns a commit timestamp to T , called T .tsC , that establishes its serialization point. This commit procedure relies on the Two-Phase Commit (2PC) protocol. Upon receiving the prepare message from the transaction coordinator (noted as origin(T )), each participant ni acquires read/write locks on the keys read/written by T , which ni replicates. Next, each participant ni validates T and sends its vote to the coordinator. If T is successfully validated at ni , then it proposes a serialization point for T (the value of T .tsC ) that is later than any serialization point it

N. Diegues, P. Romano / Future Generation Computer Systems 51 (2015) 20–35

Fig. 2. High level overview of transaction processing in SCORe. Data items are partially replicated in a sub-set of the nodes. As a result, transactions (such as T ) execute in a specific node chosen by the client and may access (i.e., read/write) data maintained by other nodes. Transaction T only contacts nodes that replicate data that T accessed: in this example, node n1 is not involved in the transaction T because it does not replicate items k3 or k4 .

has previously observed. To this end, the coordinator merges the votes received from the participants by choosing the maximum proposed tsC for T , or aborting T if any vote is negative. Lastly, the commit message is sent to every participant ni . We abstract this final part in a function called finalizeSCORe . This function eventually mutates the items in the storage, according to the write-set being committed, and makes it available for new transactions to read. Every node in SCORe has a commit thread responsible for this procedure; this thread serially writes back committed transactions by respecting their total order at the node (defined by their tsC ). Although we did not experience this to be a bottleneck, it would be possible to improve it by employing parallel commit procedures [16], which are orthogonal to the solutions proposed in this paper. We conclude this overview by presenting Fig. 3. This execution flow illustrates the different communication steps and functions described above. We note that this execution is the same as that overviewed in Fig. 2 for transaction T . This level of detail is interesting to understand the different phases of the transaction, which we shall modify to use the techniques of Bumper. 3.2. Distributed time-warping The objective of time-warping is to allow an update transaction T , which missed some transaction A, to be able to commit successfully. Denote with γ the set of transactions missed by an update transaction T , and let A ∈ γ be the transaction among those in γ that has the earliest commit according to real time. Time-warping tries to serialize T before A and install the writes of T so that they are visible to transactions that serialize after A — in this case we say that T time-warp commits. This way (recalling Fig. 1(a)) Bumper would successfully time-warp commit T . In these scenarios, traditional validation would abort T , because it merely checks whether the snapshot read by T is still up to date at commit time. In SCORe, T .tsC is used to serialize T after the transactions that T depends on. When a transaction time-warps, we additionally compute T .tsW , a scalar used to order T before the transactions it missed. Moreover, we now tag versions by assigning T .tsW to their timestamp — if a transaction does not time-warp, then T .tsW = T .tsC . There is of course a limit to which we can bypass conflicts by using time-warping and still ensuring serializability. For this, we define an abort condition based on a structure created by the misses relation, that we call a triad. A triad exists whenever B 99K T 99K A (where, possibly, A = B). We call such T a piv ot, because it links the transactions in the triad. Then, the time-warp rule for abort is that a transaction fails its validation if, by committing, it would create a triad whose pivot time-warp commits. Arguments on the

23

safety of this validation rule are provided in Section 4. The intuition is that the transaction that completes the triad witnesses a history in which the pivot is not contained, which may contradict the fact that the pivot time-warp commits. In Fig. 4 we show two examples of executions that illustrate how time-warping operates. On the left, in Fig. 4(a), we can see that Bumper allows to commit T , whereas SCORe (and other traditional protocols) would abort it. This is done by serializing T before both A and B, despite T being an update transaction and committing in real time after A and B. Then, on the right in Fig. 4(b), we illustrate a scenario in which a triad is detected. In this case, T detects to be the pivot of a triad during validation time, which causes it to abort. We note that this triad condition may still generate some spurious aborts. Indeed, it would be possible to serialize the three transactions in the order B, T , A. However, it is well known that designing a scheduler capable of accepting all (and only) serializable histories is prohibitively expensive, especially in distributed settings [7]. Hence, we argue that this triad-based condition check represents a sweet spot in the design space of distributed concurrency control schemes. It allows for reducing spurious aborts without introducing large overheads. We note that, in existing DUR protocols [5,14,12], a transaction T is aborted (upon its validation), if by committing T , it would develop one miss relation — thus rejecting many more serializable executions than Bumper. In fact, such protocols would abort T in both the examples shown in Fig. 4. 3.2.1. Detailed algorithm description In the following we describe in detail how to implement distributed time-warping by extending SCORe. To this end, the main change consists in the computation of the timestamp T .tsW . We add to each key k an associated stamp (k.readStamp) that represents the latest access of any transaction that read k. A key’s readstamp contains two scalars: the stamp itself and the identifier of a transaction. Moreover, each version tuple now also has a Boolean, stating if it was installed by a time-warped transaction, and the scalar tsC from the transaction that installed it. We present our ideas together with Algorithms 1–3. We highlight the line numbers corresponding to the extension of Bumper over SCORe: we use larger fonts for the lines with Bumper’s extensions; and different symbols next to the number to distinguish between time-warping (vertical line |) and delayed action extensions (circle ◦). Moreover, we also summarize in Table 2 the datastructures used and their description. 3.2.2. Transaction execution We start with Algorithm 1. In the begin operation of a transaction tx we assign a unique identifier and a snapshot timestamp tsS to tx. In the w rite operation we just buffer the values locally for deferred update (as in SCORe). The read operation for key k first checks for a potential readafter-write. Then, it sends a request to the nodes that replicate k and waits for a reply. If k is replicated locally, then the communication step is skipped. The only extension required by the timewarping mechanism in this phase is line 18. We first note the following: with time-warping, the prefix of versions visible to tx according to tx.tsS is not stable, because a concurrent transaction U may time-warp commit and serialize before the point in time established by tx.tsS (by having U .tsW ≤ tx.tsS ). Thus, that line in the algorithm serves to guarantee that tx cannot miss any update transaction that time-warp commits and serializes before tx.tsS , by ensuring either that: 1. tx safely reads-from U (despite being a concurrent transaction); 2. Or, alternatively, the commit procedure of U notices that tx 99K U, which, as we will see, forbids U from time-warp committing.

24

N. Diegues, P. Romano / Future Generation Computer Systems 51 (2015) 20–35

Fig. 3. Communication steps involved in processing and committing transaction T . As in Fig. 2, T reads k4 plus k3 and writes to k4 .

(a) T is committed by serializing before A and B. SCORe would abort T .

(b) T is aborted because a triad is found. Typical protocols (like SCORe) also abort T .

Fig. 4. Examples of executions where T develops miss relationships (e.g.: T 99K A and T 99K B). We show one example where time-warp avoids the abort of T , in contrast with typical protocols, and one example where time-warp detects a dangerous triad. Table 2 Description of data-structures used in the algorithms. The underlined fields were not originally present in SCORe. Type

Field

Description

Transaction

id ro tsS tsC tsW mustTW cannotTW writeSet readSet delayed

Unique identifier for the transaction Whether the transaction is read-only Timestamp for the snapshot visible to read operations Timestamp reflecting the commit order of the transaction Timestamp reflecting the serialization order of the transaction Whether the transaction must time-warp to commit Whether the transaction cannot time-warp to commit The set of keys and values written by the transaction The set of keys and timestamps read by the transaction Delayed actions that are pending execution

Key

versions readStamp

Set of committed versions for the key Timestamp for the last time the key was read

Version

tsC tsW timeWarped

Timestamp that reflects the commit order of the transaction that committed this version of the key Timestamp that reflects the serialization order of the transaction that committed this version of the key Whether it was committed by a time-warped transaction

In the second case, tx witnessed a serialization order that did not include U. Hence, if we let U time-warp commit and serialize before tx, we would obtain a non-serializable history. This step is only performed for read-only transactions at this point, whereas update transactions perform it during the distributed commit (we shall address this further). To achieve the safety guarantees required, we use the tuple ⟨stamp, id⟩ from k.readStamp. Then, in function updateReadStamp, we derive a new stamp from the last known commit timestamp used at nodei , and increase the stamp of k along with the transaction identifier (line 28). In the case that more than one transaction updates a given stamp, the corresponding identifier becomes φ (line 29). Together with the visibility rules for versions, these actions ensure that read-only transactions always observe a consistent (1-copy serializable) snapshot, despite time-warping

transactions. Therefore, read-only transactions skip the distributed commit that we describe next for update transactions. To conclude this section, we revisit the execution presented early in Fig. 1(a). In this execution, transaction T is aborted by SCORe, because it reads an item that is written by a concurrent transaction A that commits before T . In Fig. 5 we show that, by employing the proposed algorithm, T can be serialized before A, even though T commits after A in real-time order. 3.2.3. Distributed commit When an update transaction tx requests to commit, a TwoPhase Commit (2PC) is triggered by sending the prepare message to all participants(tx) (line 57) (i.e., nodes replicating keys accessed by tx). The prepare phase at nodei incorporates the proposed time-warp validation (to find dangerous triads), whose outcome

N. Diegues, P. Romano / Future Generation Computer Systems 51 (2015) 20–35

Algorithm 1 Bumper pseudo code 1/3. 1:

2| 3| 4| 5| 6| 7: 8: 9: 10: 11: 12: 13: 14:

begin(Transaction tx, bool ro) in nodei = origin(tx) tx.mustTW ← tx.cannotTW ← false tx.ro ← ro ◃ optimize for read-only txs tx.id ← getUniqueId() tx.tsS ← nodei .lastCommit ◃ obtain the snapshot of visible versions tx.tsW ← ⊥ ◃ initialize special value: no time-warp write(Transaction tx, Key k, Value v ) in nodei = origin(tx) tx.writeSet ← tx.writeSet ∪ ⟨k, v⟩ ◃ defer write to commit-time read(Transaction tx, Key k) in nodei = origin(tx) if k ∈ tx.writeSet then return tx.writeSet.get (tx)◃ read-after-write case: return deferred write send ReadReq[k, tx] to all nj ∈ ow ners(k) wait ReadReply[v al] from any nj ∈ ow ners(k) return v al

15◦ delayAction(Transaction tx, Action code) in nodei = origin(tx) 16◦ tx.delayed ← tx.delayed ∪ code ◃ delay the action similarly to

17:

18| 19: 20: 21: 22:

Algorithm 2 Bumper pseudo code 2/3. 31: upon receive prepare[tx] in nodei ∈ participants(tx) 32: for all k ∈ tx.writeSet : local(k) do 33: acquireLock(k, Exclusive) 34| ⟨ stamp, id ⟩ ← k.readStamp 35| ◃ check if any concurrent transaction B read data item k 36| if stamp ≥ tx.tsS ∧ id ̸= tx.id then 37| tx.cannotTW ← true ◃ the concurrent B missed this tx 38: 39: 40:

41| 42| 43| 44| 45| 46|

writes

47|

upon receive ReadReq[k, tx] in nodej ∈ ow ners(k) if tx.ro then updateReadStamp(tx, k)◃ make the read access visible acquireLock(k, Shared) v al ← localReadSCORe (k, tx) ◃ delegate the local read to SCORe releaseLock(k) reply ReadReply[v al]

48◦ 49◦ 50◦ 51◦ 52◦ 53◦

23| updateReadStamp(Transaction tx, Key k) in nodei ∈ ow ners(k) 24| atomically do { 25| ⟨ stamp, id ⟩ ← k.readStamp 26| newStamp ← nodei .lastCommit ◃ make read visible at present

25

54: 55:

(B 99K tx) for all ⟨k, ts⟩ ∈ tx.readSet : local(k) do acquireLock(k, Shared) ◃ check concurrently installed versions missed by tx (i.e., ∃A : tx 99K A) for all V ∈ k.versions : V .tsC > tx.tsS ∧ V .ts ̸= ts if V .timeWarped ∨ k ∈ tx.writeSet then reply vote[no, tx] ◃ completes a triad; may not be serializable return tx.mustTW ← true ◃ tx missed some A (i.e., ∃A : tx 99K A) tx.tsW ← min(tx.tsW , V .tsC ) ◃ compute time-warp timestamp for tx updateReadStamp(tx, k) ◃ update txs make reads visible at commit-time if tx.delayed not empty ∧ tx.mustTW then reply vote[no, tx] ◃ restart with eager delayed actions return for all action ∈ tx.delayed : local(action) do for all k ∈ action.keySet () do acquireLock(k, Delayed) ◃ prepare to execute delayed actions tx.tsC ← fetchAndInc(nodei .nextId) ◃ increment logical present time reply vote[yes, tx]

time

if newStamp > stamp k.readStamp ← ⟨ newStamp, tx.id ⟩ ◃ update timestamp for one tx 29| else k.readStamp ← ⟨ stamp, φ ⟩ ◃ φ = several readers 30| } 27| 28|

is sent along with the vote of the participant. We recall that tx is aborted if it completes a triad in which the pivot would time-warp commit. For this, we use the flags tx.mustTW and tx.cannotTW — a dangerous triad exists if both flags are true. Triads can be detected in two cases: when tx time-warp commits and becomes a pivot, the triad is detected by the coordinator of the 2PC upon merging the votes. The case in which tx is not the pivot, but instead completes a dangerous triad, is detected in line 42. The computation of those flags takes place after the acquisition of the locks associated with the keys (as explained for SCORe, shared mode for reads and exclusive mode for writes): 1. In lines 32–37, we check for a possible miss B 99K tx by verifying if k.readStamp was increased concurrently to the execution of tx (this also checks that the only reader is not tx itself). In such case, tx cannot time-warp commit because B already witnessed the absence of tx at that logical time. 2. In lines 38–47, we check for a possible tx 99K A for every k read by tx. To do so, we obtain the versions of k installed concurrently to the execution of tx, and verify if tx did not read that version. In this case, tx must time-warp commit to correctly serialize before A. Given this second case, we note that we immediately abort tx if we can deduce the existence of a dangerous triad at that point (line 42): (1) tx would complete a dangerous triad where A is the pivot; or (2) tx and A form a cycle, as both read and write k concurrently,

which is a particular case of the definition of triad. Otherwise, we update tx.tsW in line 46, which represents the point in time in which the writes of tx will be installed. To respect the fact that tx 99K A, we minimize tx.tsW with A.tsC so that tx serializes before A (hence why we keep tsC in the versions installed, in line 80): this ensures that the resulting time-warp serializes tx before the set of transactions it missed. At this point (line 47) update transactions also call updateReadStamp, which read-only transactions invoke during execution (since they have no commit). This is required to prevent an update transaction U from updating a read stamp of datum k, before validating a possible write that it has on k. This would be problematic because U could ‘‘shadow’’ a concurrent read stamp update and fail to notice a miss relationship. After conducting this novel validation at participant nodei , we let SCORe propose tx.tsC by atomically incrementing a local scalar. Then, participants reply to the coordinator and each vote is merged in lines 59–64. The validation flags are also merged: this allows the function to check if participants(tx) detected the dangerous triad that we disallow. Next, if tx did not miss any transaction, we set tx.tsW = tx.tsC . This corresponds to case in which typical validation schemes (such as the original one in SCORe) do not abort tx; so we do not time-warp commit. Upon receiving the commit decision, a participant nodei relays this event to SCORe. Eventually, this invokes the write-back mechanism that we also extended: we tag versions using tx.tsW and add other metadata that we described. 3.3. Delayed actions A delayed action corresponds to a part of the application code that is encapsulated in a transaction, but that can be executed outside of the normal flow of execution of that code and postponed

26

N. Diegues, P. Romano / Future Generation Computer Systems 51 (2015) 20–35

Fig. 5. Detailed execution of transactions in Fig. 1(a) with Bumper. Transaction T is able to commit due to time-warping.

Algorithm 3 Bumper pseudo code 3/3. ◃ commit procedure: join votes of the participants and decide 56: commit(Transaction tx) in nodei = origin(tx) 57: send prepare[tx] to all nj ∈ participants(tx) 58: for all nj ∈ participants(tx) do 59: wait vote[ _, v otej ] from nj ◃ _ may be YES or NO 60| tx.tsW ← min(v otej .tsW , tx.tsW ) ◃ possibly time-warp to the past 61| if v otej .mustTW then 62| tx.mustTW ← true 63| if v otej .cannotTW then 64| tx.cannotTW ← true 65: if (∃ vote[no, v otej ]) ∨ (tx.mustTW ∧ tx.cannotTW) then 66: send abort[tx] to all nj ∈ participants(tx) return abort tx.tsC ← max(v otes.tsC ) ◃ use the "most" present known (SCORe rule) 69| if not tx.mustTW then ◃ if tx does not have to time-warp 70| tx.tsW = tx.tsC ◃ then tx serializes at the present 71: send commit[tx] to all nj ∈ participants(tx) 67:

68:

72: 73:

74◦ 75◦ 76: 77:

78:

79| 80| 81| 82|

upon receive commit[tx] in nodei ∈ participants(tx) atomically do { for all action ∈ tx.delayed : local(action) do action.execute() ◃ execute delayed actions guaranteedly no aborts finalizeSCORe (tx) ◃eventually invokes writeBack function described next }

◃ invoked for each write of tx when it is ready to commit writeBack(Transaction tx, Key k) in nodei : local(k) newVersion.ts ← tx.tsW ◃version the write with the serialization timestamp newVersion.tsC ← tx.tsC newVersion.timeWarped ← tx.mustTW k.prependNew Version(newVersion)

until the transaction’s commit (line 16). This is possible whenever the output generated, or the state updated, by a portion of code of the transaction is not required (i.e., read) elsewhere within that transaction. To help understand how delayed actions can be used in realistic applications, we illustrate in Fig. 6 how they can be employed. For that, we use the Payment transaction profile of the well known TPC-C benchmark [10]. In this transaction there is a contention hot spot on the warehouse yield, as this information is updated by all transactions that target a given warehouse. As a matter of fact, we note that the warehouse yield does not affect the outcome of the payments — it is only important to be updated within the transaction to ensure correct accountability checks, and possibly be made available after the transaction commits (e.g., to display it). We address situations like this by exploiting delayed actions. These are executed at the end of the distributed commit, after the lock acquisition phase — this way

Fig. 6. Example of contention hot spot in the Payment transaction of TPC-C [10]. Avoiding aborts requires only to extract the contending piece of code into a function and use the API for delayed actions.

we can guarantee that the delayed reads observe the most recent available version at the time in which the transaction commits. Hence, no concurrent transaction can commit and invalidate those reads, thus making delayed actions abort-free. In the example in Fig. 6, we show that the programmer merely needs to inspect the code and extract such contention hot spots into delayed actions. A first challenge is how to ensure that delayed actions can be executed efficiently in a partial replicated setting. Since they are executed during the distributed commit, it is desirable to ensure that, whenever a delayed action is executed at a node ni , it only accesses data locally stored by ni , to avoid involving more nodes in the commit. We avoid these issues by having delayed actions abide by the following programming paradigm. The programming interface of delayed actions requires to implement a method that can possibly return some value (analogously, e.g., to the Callable interface in Java). When defining a delayed action, programmers are further required to identify (a possible over-approximation of) the keys to manipulate. We assume that these keys can all be read or written, and so we compute, based on them, the executors set of nodes where the delayed execution is to be performed. In the prepare phase, the delayed action is registered at those nodes by piggybacking the corresponding keys in the prepare message. During its execution at node ni , a delayed action may only access data locally maintained by ni . Otherwise the transaction is aborted (line 49) and restarted, this time executing the delayed actions ‘‘eagerly’’, i.e., within the transaction, by not postponing them. Note that there already exists a need for co-location of data, to take advantage of the genuineness of the partially replicated system, so we argue that this constraint is not overly restrictive in practice. Moreover, we assume deterministic computations, as these are being executed by different replicas. Finally, we allow delayed action instances (one per node in executors) to return results, which is then reduced in the coordinator via a programmerdefined operator. A second challenge is to regulate concurrency between transactions encompassing delayed actions and regular transactions. Suppose that k is incremented concurrently, by transactions L1 and L2 using delayed actions, and by a regular transaction T . Intuitively: (1) we want L1 and L2 to proceed concurrently (ensuring that their

N. Diegues, P. Romano / Future Generation Computer Systems 51 (2015) 20–35

27

Fig. 7. Execution of two transactions C and D, which manipulate a contention hot spot z via delayed actions. This detailed execution is similar to the overview in Fig. 1(b). We omit other reads and writes in these transactions for simplicity of presentation.

Table 3 Mutual exclusion modes between different lock modes in Bumper. The new delayed mode acquisition allows multiple transactions to concurrently acquire the lock in that same mode. This prevents normal read and write operations from other transactions to execute concurrently with the execution of delayed actions. Lock mode

Read

Write

Delayed

Read Write delayed



× × ×

× ×

× ×



effects are serializable); (2) we want T to detect the concurrent read and write of k performed by the delayed actions; and finally, (3) we do not want either L1 or L2 to abort because T committed first and manipulated k. To correctly address this challenge, we delegate the processing of delayed actions to the commit thread available at every node, which is responsible for serially writing-back updates produced by transactions in the total order defined by their commit timestamp (tsC ). This allows the distributed commit of L1 and L2 to progress concurrently during the execution of 2PC, until their delayed actions are executed sequentially within the commit thread, before their write-back (line 75). To guarantee correctness when some transaction T conflicts with L1 , we rely on the lock acquisition for the keys (to be manipulated in the delayed actions). Differently from normal writes, which acquire an exclusive lock at prepare time, these keys might be locked by several transactions (line 53) and written by their delayed actions (within the commit thread in line 75). Therefore we safely allow L1 and L2 to share locks over the keys to be manipulated by their delayed actions. For this reason, we created a delayed mode for the locks associated with each key, which may be shared by delayed actions and is mutually exclusive with both read and write modes. Note that delayed actions encapsulate only conflictprone portions of the code, that would otherwise abort the transactions that contain them. As such, by extracting these portions to delayed actions, we can effectively allow concurrency between such transactions that would be sequentialized due to the conflicts. The rules for mutual exclusion in these lock modes are summarized in Table 3. To summarize the protocol described in this section, we present a detailed execution for the contending transactions of Fig. 1(b). Recall that those transactions C and D manipulate a contention hot spot. In Fig. 7, we detail how delayed actions allow both transactions to commit successfully. We can see that both transactions register the need to execute a delayed action into the metadata of the transaction. Furthermore, they also register the keys that will be accessed within those delayed actions. Then in the prepare phase the locks for the delayed actions are acquired in delayed mode. Lastly, upon the reception of the commit message, the delayed actions are executed at the node that replicates the manipulated data (n1 which replicates k1 ). Recall that

delayed actions are executed sequentially by the commit thread running on each node. Finally, we consider a transaction T that must time-warp commit (because it has at least one miss). If T has any pending delayed actions to be executed, this can pose a problem: the timewarp commit serializes the transaction in a past point in time, but the delayed actions are serialized in the present. So there is a duality between the intent of both mechanisms, as timewarping tries to deal with stale data whereas delayed actions are designed to always operate over fresh data. We further study these challenges in Section 5, where we extend the protocol with time-warpable delayed actions. To isolate such complexity from the contributions so far presented, for the moment, we reconcile these two mechanisms by simply aborting transactions that must both time-warp and execute delayed actions — this causes the delayed actions to be executed in the normal way (thus eagerly) in the re-execution (line 49). A more efficient reconciliation is then presented in detail in Section 5.2. 3.4. Integration with alternative protocols To highlight the generality and portability of the conflict reduction mechanisms of Bumper, we briefly discuss how it may be adapted for the case of full replication, or in case that the system does not use MVCC. In fully replicated systems, all states are available in every node. In this case we note that it is possible to relax some of the constraints on the programming paradigm of delayed actions, as it would no longer be necessary to act upon just a group of keys that is known to be co-located. Hence, it would not be needed to have a priori knowledge of the footprint of delayed actions. On the other hand, it is possible that the local concurrency control does not support MVCC, and instead maintains only a single version of data. Then, the time-warping mechanisms can still be applicable, as long as there exists a timestamping mechanism, as this allows for reasoning on the concurrency of events. This is typically available in optimistic concurrency control schemes such as those used in most systems that we mentioned [2,5,6]. For delayed actions we also consider that there may exist multiple threads performing the write-back of non-conflicting transactions in parallel. In this case one can resort to alternative locking mechanisms to correctly serialize delayed actions that contend. Further, delayed actions can also be used in systems with lower isolation levels (e.g., Snapshot Isolation [17]), although this research direction is outside of the scope of this paper. As an example, we have also applied delayed actions to a transactional distributed index [18]. In summary, the ideas proposed here assume an underlying system that uses a timestamp-based concurrency control mechanism. As long as the system provides 1-copy serializability, as is the case of the protocol that we extend in this paper (SCORe [5]),

28

N. Diegues, P. Romano / Future Generation Computer Systems 51 (2015) 20–35

transitively), and thus A.tsC ≤ B.tsS . Then, because T .tsS < A.tsC , and A → B (even if transitively), we obtain T .tsS < B.tsS . We note that these conclusions are also visible in the timeline drawn in Fig. 8. Because T .tsS < B.tsC , T can only read-from B if B time-warps before T .tsS — this implies that B misses some transaction C . But at this point we can summarize the restrictions devised so far and reach an absurd that forbids B → T . We have that T .tsS < B.tsS , and for B 99K C , it must be that B.tsS < C .tsC ; but then we get T .tsS < B.tsW , so it is an absurd that B → T . Thus all executions in H B are serializable, and Bumper preserves 1-copy serializability. 4.2. Delayed actions Fig. 8. Execution allowed by Bumper but rejected by SCORe (by aborting some transaction).

then Bumper preserves that criterion and reduces conflicts that lead to aborts of transactions. This intuition shall become clearer as we discuss the correctness of the proposed solution in the following section by reasoning on how Bumper reorders transactions depending on their timestamps in order to reduce spurious aborts. 4. Discussion of correctness We now show that applying Bumper to SCORe preserves the latter’s isolation level (1-copy serializability) when considering time-warping. In Section 4.2, we reason on the correctness of delayed actions. Finally, we discuss the impact of Bumper on failure recovery. 4.1. Time-warping We start by considering the fail-free executions that SCORe accepts (named H G ), and show that Bumper also accepts them. Then we consider the fail-free executions that Bumper accepts, but that SCORe rejects (named H B ), and show that they are necessarily serializable. We begin by noting that SCORe always aborts a transaction T whenever Bumper tries to time-warp commit T . Then, we can show for H G that, when Bumper aborts, this implies that SCORe also aborts: lines 66 and 42, where Bumper aborts tx, correspond to the cases in which tx has at least a miss and so SCORe aborts tx. We now consider an arbitrary execution, which can be extended to match every execution in H B , and show that it is serializable. Recalling classic serializability theory results [7], we say that an execution is serializable if the graph of read-from and miss relations (edges) between transactions (nodes) is acyclic. In the following we shall derive a set of restrictions on this arbitrary execution until we reach an absurd, which we also exemplify in Fig. 8. Since we are considering the set of executions that are accepted by Bumper, but rejected by SCORe, they must contain one timewarped transaction T that missed transaction A. This implies that some transaction A commits before transaction T does so (such that T notices the miss and time-warps) and also that A is concurrent with T (otherwise no miss would occur). Also note that A cannot have time-warp committed as well, as that would have triggered the abort of T in line 42. The same line (but its second condition instead) would be triggered if A missed T (which would have created a cycle between A and T ). It is also impossible for T to readfrom A because they are concurrent and T has already missed A; this would require A.tsW ≤ T .tsS , but we already stated that A cannot time-warp commit, so we actually have T .tsS < A.tsW = A.tsC . Now consider some transaction B that misses T : this would create a triad (detected either by B in line 42 or by T in line 66). So this arbitrary execution can only contain a cycle if T reads-from B. Consider, without loss of generality, that B reads-from A (perhaps

When considering delayed actions, we explain why we can disregard the validation for these postponed manipulations, and what guarantees the consistent evolution of the state of replicas of the same set of keys. We start by recalling that the delayed actions of a transaction T are executed after a commit decision for T arrives to the commit thread of some node nodei ∈ participants(T ) — thus the corresponding locks will have been acquired in the prepare phase of the commit procedure. This ensures mutual exclusion when accessing some key k between (1) the accesses inside delayed actions and (2) validations of transactions that access k normally. In addition to this, because the execution takes place in the commit thread, this implies that delayed actions cannot observe concurrent transactions. Given that we restricted T not to time-warp when it has delayed actions, then T .tsW = T .tsC . Therefore, both normal and delayed portions of T are serialized on T .tsC . Finally, the consistent evolution of the state of the various replicas of a key is ensured given that (i) SCORe ensures that replicas of the same keys update this set of keys according to the same totalorder [5], and (ii) given that we assumed delayed actions to be deterministic. 4.3. Dealing with failures Bumper, when layered upon SCORe, inherits its virtues and issues for what concerns fault-tolerance. We refer to the original SCORe paper [5] for a detailed discussion on how to deal with failures. We briefly recall that, due to the reliance of SCORe on Twophase Commit, it is necessary to adopt additional mechanisms (such as replication techniques for ensuring high availability of the coordinator’s state [19]) in order to avoid blocking in spite of failures of the coordinator. Note that Bumper does not introduce any additional complexity/drawback for what concerns failurehandling: this enhances the relevance and practicality of the proposed solution. 5. Time-warpable delayed actions So far we have presented distributed time-warping and delayed actions to reduce the conflicts in serializable distributed transactions. We now additionally study, from the protocol perspective, if it is possible to reconcile transactions that both time-warp and exploit delayed actions. To do so, we explain the challenges in Section 5.1, describe the solution in Section 5.2, and present correctness arguments on the proposed solution in Section 5.3. 5.1. Preliminary considerations We begin in Fig. 9 with an execution example portraying the problems that arise when allowing a transaction to both time-warp and execute delayed actions. We highlight two facts: transaction B executes a delayed action whose footprint intersects with transaction T ; further, T time-warp commits such that T .tsW < B.tsC . Hence T exemplifies the scenario in which a transaction both timewarps and exploits delayed actions.

N. Diegues, P. Romano / Future Generation Computer Systems 51 (2015) 20–35

29

Fig. 9. Example of a transaction that uses both time-warp and delayed actions.

Note that the execution of the delayed action by T was not defined in the figure because we have not specified the behavior of time-warpable delayed actions. This simple case illustrates the challenge that we need to undertake. If the delayed action of T executes in the present logical time (i.e., T .tsC ), it shall read z : 1 and produce z = 2. However, this breaks the atomicity of T because it publicizes a write to y with version T .tsW and a write to z with version T .tsC . Naturally, this may allow concurrent transactions to read only a subset of the writes of T , which is undesirable. The alternative is for T to execute its delayed action in the snapshot it time-warp commits into (i.e., T .tsW ). In such case it shall read z:0 and produce z = 1. Although this is coherent for that snapshot, it ignores the fact that B has already created a more recent version for z (according to the logical time-warp order). As a result, we would generate an non-serializable history (which would be short of an increment to z), which is also undesirable. 5.2. Reconciling time warping and delayed actions We now present a protocol that supports time-warpable delayed actions. In the following algorithm, we extend the metadata used by Bumper (and already listed in Table 2) with some additional fields, which we describe in Table 4. In order to tackle the aforementioned challenges we propose a conceptually simple solution:

• The delayed actions of T are executed in the snapshot of T .tsW . • If a write of such an action produces a version V that is not the latest one, then it means that there is a logically more recent version V ′ . In that case, T retro-actively fixes by re-running the delayed action that had initially created V ′ . Hence, we allow time-warpable delayed actions to modify data that had already been committed. Naturally, it is not always possible to perform this fix. If V ′ had not resulted from a delayed action, we should prevent T from timewarping. This would otherwise give rise to a number of issues: for instance, it would not be possible to reconstruct the value produced by B, had B been serialized after T . Further, V ′ may have already been externalized, via read operations issued either by B or other transactions that read V ′ from B. To avoid these issues, we allow transactions to time-warp and still use delayed actions. We impose a restriction in the footprint of their delayed actions: it may only have been accessed concurrently by delayed actions that have a void return type (do not return any value), which we call Silent Delayed actions. This check can be safely performed during the prepare phase, after having performed the validation of the non-delayed part of the transaction. We point out that this use case is exactly the one that is most desirable. The footprint of a delayed action is most likely highlycontended, for which reason it should be accessed most of the time in delayed actions to avoid conflicts. If that is the case, then we allow those transactions to still time-warp if they meet conflicts on other data items in the normal course of the transaction. This is possible because:

Algorithm 4 Time-Warpable Delayed Actions pseudo code 1/2. 83: upon receive prepare[tx] in nodei ∈ participants(tx) 84: ◃ omitted lines 32–48, as in Algorithm 2, as they remain unchanged 85: for all action ∈ tx.delayed : local(action) do 86: for all k ∈ action.keySet () do 87: acquireLock(k, Delayed) 88: ⟨ stamp, id ⟩ ← k.readStamp 89: if stamp ≥ tx.tsS ∧ id ̸= tx.id then 90: tx.cannotTW ← true ◃ this delayed action forbids 91: 92: 93:

time-warp if ∃ V ∈ k.versions : V .tsC > tx.tsS ∧ ¬V .silent tx.cannotTW ← true ◃ this delayed action forbids time-warp ◃ omitted lines 54–55, as in Algorithm 2, as they remain unchanged

94: upon receive commit[tx] in nodei ∈ participants(tx) 95: atomically do { 96: for all action ∈ tx.delayed : local(action) do 97: action.snapshot ← tx.tsW ◃ assign serialization point to 98: 99: 100: 101: 102: 103: 104: 105: 106:

delayed action tx.currentAction ← action ◃ delayed action being executed action.execute() ◃ triggers delayedRead and delayedWrite while tx.toFix not empty ∧ tx.mustTW do actionToFix ← tx.toFix.remov eFirst () tx.fixed ← tx.fixed ∪ actionToFix ◃ used to avoid repeated replay tx.currentAction ← actionToFix ◃ delayed action being executed actionToFix.execute() ◃ replay the delayed action to fix it’s time-warp finalizeSCORe (tx) }

1. We know how to ‘‘re-write history’’ by replaying delayed actions (whereas we would not be able to do so in a general way for versions produced by normal writes). 2. We can guarantee that the values which are being overwritten have not been externalized to the application (as they may only have been accessed by concurrent silent delayed actions). We now describe the changes to the pseudo-code presented earlier in Algorithms 4 and 5. We show only the relevant lines, the functions that have changed, to simplify presentation. We begin by changing the prepare phase of the distributed commit for some transaction tx. Lines 88–92 augment the management of delayed actions in that phase with some validations: if the footprint of the delayed actions was accessed by a concurrent transaction, via normal reads or writes, then this prevents tx from time-warping. Upon receiving a commit decision, we also augment the algorithm with additional management of delayed actions. Besides executing the delayed actions of tx (lines 96-99), we now also conduct a fix procedure (lines 100–104) before finalizing the transaction. The idea is that executing a delayed action can add other

30

N. Diegues, P. Romano / Future Generation Computer Systems 51 (2015) 20–35

Table 4 New fields added to the data-structures to support time-warpable delayed actions. These are in addition to those already listed in Table 2. Type

Field

Description

Transaction

toFix fixed currentAction

Set of Actions to be replayed in the commit Set of Actions already replayed in the commit The Action being replayed in the current commit

Action

snapshot

Timestamp assigned to serialize the delayed action

Version

readActions

Set of delayed actions that read this version during a replay (via a DelayedRead operation)

Algorithm 5 Time-Warpable Delayed Actions pseudo code 2/2. 107: delayedRead(Transaction tx, Key k) ◃ used during delayed execution 108: for all V ∈ k.versions do 109: if V .ts ≤ tx.dAction.snapshot then 110: if ¬tx.dAction.isSilent() then 111: updateReadStamp(tx, k) 112: else if tx.dAction ̸∈ V .readActions then 113: V .readActions.add(tx.dAction) 114: return V .value 115: delayedWrite(Transaction tx, Key k, Value v )◃ in delayed execution 116: 117: 118: 119: 120: 121: 122: 123: 124: 125: 126: 127: 128: 129: 130: 131: 132: 133: 134: 135: 136:

V ← k.versions.first () while V .ts > tx.dAction.snapshot do previousV ← V ◃ find the write spot in the versions... V ← V .next ◃ ...to time-warp this delayed write if V .ts < tx.dAction.snapshot then newVersion.ts ← tx.tsW ◃ inserting a new version newVersion.tsC ← tx.tsC newVersion.timeWarped ← tx.mustTW if previousV = ⊥ then ◃ This is the newest version added to the version chain. k.prependNew Version(newVersion) else ◃ New version inserted ‘‘in the past’’. previousV .next ← newVersion newVersion.next ← V else ◃ Updating an existing version during a re-execution. newVersion.value ← v for all action ∈ V .readActions do if action ̸∈ tx.fixed then tx.toFix ← tx.toFix ∪ action

delayed actions to the list tx.toFix, which is iteratively processed until nothing is left to fix. In the normal case, where a delayed action does not time-warp, this part of the procedure is never triggered. We additionally show the algorithm for the read and write operations of delayed actions (namely, the DelayedRead and DelayedWrite methods). The DelayedRead returns a version that is coherent with the snapshot used for the delayed action. In lines 110–111 we also ensure that a non-silent delayed action, say T D , stamp the key k they read. This allows for preventing concurrent transactions that write to k (either during the normal course of the transaction or of a delayed action) to time-warp commit before TD . They cannot be safely re-executed as, being non-silent, it may have externalized a return value affected by the read of k. If the read operation is being issued by a silent delayed action, we additionally register it with the version returned (in case it is not already registered there). The idea is to track actions that read/depend on the version, so that if the version ever changes (in line 133) the fix procedure can use the silent delayed actions to re-compute the state. The write operation writes in-place (recall that this happens after the commit decision, with locks acquired). It must consider

the case where it writes to a version that already exists in line 133 due to a fix procedure. Additionally, it must consider the case where it inserts a new version (lines 120–130): either it is the most recent version (line 126), or it is changing the past 129–130. In the end of a delayed write we trace also which transactions have read the version that is being overwritten. This is necessary in case the delayed action is being executed by a time-warp committed transaction T , in order to trigger the re-execution of any silent delayed actions that got serialized after T (see 100–104). Finally, we point out that T must not necessarily be prevented from time-warping in lines 90 and 92. We can optimize this by collecting the minimum snapshot to which T can time-warp instead, which would be used by the coordinator to decide. As such, in this optimization, each participant would collect the maximum snapshot at which a transaction read or wrote (verified in lines 89 and 91) as the restriction to how far in the past T can time-warp. We have omitted this for ease of presentation. 5.3. Revisiting correctness We now present arguments for the correctness of the algorithm allowing for time-warpable delayed actions. In the following, we consider that k is a key that some transaction T accesses during one of its delayed actions. We first highlight that T can only timewarp, and still use delayed actions, if the footprint of its delayed actions obeys some conditions: for each key k we make sure that it was not read or written concurrently to the execution of T by some T ′ externally to a non-silent delayed action. We recall that the reason why we forbid this is because T may serialize before T ′ (depending on the time-warp to be computed by the distributed prepare phase). Therefore, the accesses performed by T ′ would have seen the writes of T , which they did not because T is only finishing in natural time order after T ′ . As a result, it is impossible for T to time-warp, as T ′ may have already externalized the absence of T ’s execution. The execution of delayed actions by T happens at the logical time corresponding to T .tsW . If T does not time-warp, then we simply execute the delayed action at the present logical time. Otherwise, T executes the delayed action as if it had happened at logical time T .tsW , which is similar as long as no other transactions T ′ wrote to some k. We are left with the situation in which some T ′ wrote to k, and T must time-warp. As highlighted above, T ′ may only have done such write via a silent delayed action, as otherwise T would have been prevented from time-warping. In this case, however, we allow T to time-warp because we leverage silent delayed actions to replay their executions and reproduce a consistent state of the data store. The intuition behind this is that no transaction has leaked the state of that data because of the validation that ensured no T ′ had read it outside of a silent delayed action. Hence, this allows the data to remain in an undetermined state, which can thus be modified by transactions even after it was committed. Therefore, for each delayed action D that T executes, and creates or changes versions in that snapshot T .tsW , we trigger the fix procedure for every silent delayed action D′ that (i) accessed versions overwritten by D, and (ii) had been committed by some T ′ concurrently with the execution of T . For this to work, two things must happen:

N. Diegues, P. Romano / Future Generation Computer Systems 51 (2015) 20–35 Table 5 Benchmarks used in the evaluation of Bumper. We detail whether each one of them benefits from time-warping and/or delayed actions. Benchmark

Section

Time-warp

Delayed actions

Skip-List YCSB [20] TPC-C [10] Vacation [21]

6.1 6.1 6.2 6.4

✓ ✓

× ×

×

✓ ✓



1. The commit procedure not only executes the silent delayed actions, but it also stores them associated with the versions they read, for future use, along with their snapshot version — they are garbage collected similarly to versions. 2. The writes executed during a delayed action may be new (the delayed action is executed for the first time), or may be repeated (when re-executing a delayed action to fix the timeline due to a time-warp). The first case corresponds to the creation of a new version in the data-store, whereas the second case overwrites the value of an existing version. This is safe because the key is locked, which prevents accesses, and also because we have the assurance that no transaction has read this version outside of a silent delayed action. 6. Experimental evaluation We integrated Bumper into a publicly available implementation of SCORe [5], which is based on Infinispan, a mainstream inmemory data management system developed by Red Hat. This allows us to evaluate the benefits of Bumper using as baseline a highly scalable, strongly consistent genuine partial DUR protocol. We measure both overall throughput and abort probability (note that read-only transactions do not abort). Every run uses replication degree of 2 for fault-tolerance — hence data is considered durable once a transaction is committed. This experimental study aims at answering the following questions: 1. How much can Bumper enhance SCORe’s scalability in a conflict-prone scenario? 2. To what extent can it reduce the transaction’s abort rate? 3. What overheads does Bumper introduce in conflict-free workloads? We used four well-known benchmark applications, which we will briefly describe while presenting the results. We list these benchmarks in Table 5, specifying how they benefit from Bumper. This allows to understand which features of Bumper can be evaluated by means of the various benchmarks used in this study. Each execution is the result of ten runs with exclusive access to all the machines used. We use geometric mean whenever showing averages over normalized results. We conducted the tests on top of OpenStack, a cloud computing infrastructure, deployed in a dedicated cluster of 20 machines. Each machine is equipped with two 2.13 GHz Quad-Core Intel(R) Xeon(R) E5506 processors, 40 GB of RAM and interconnected via a private Gigabit Ethernet. The VMs were instantiated via OpenStack and provided with 1 physical core plus 4 GB RAM. This represents a common scenario of deployment in cloud infrastructures, where customers acquire several virtual machines equipped with relatively small physical resources. For all tests we varied the number of VMs from 10 to 160, such that they were always uniformly distributed across the physical machines. As such we allocate up to 8 VMs per machine, allowing 8 GB of RAM left to the host operating system. We verified that the amount of RAM allocated to the host machine was sufficient to ensure the management of the hosted VMs. Finally, the virtualized operating system was Ubuntu 12.04 and our prototype ran on Java HotSpot version 1.6.0_38.

31

6.1. Distributed time-warping We begin with two benchmarks and workloads without any obvious contention hot spots, and thus we avoid using any delayed action. This is done to ensure that the benefits achieved are exclusively due to time-warping mechanism. We start with a micro-benchmark originally proposed to evaluate transactional memory systems, and that exercises a skip list data-structure — a building block for many applications. A skip list is used to maintain an ordered set of integers with an average size of 256 elements and a range of possible keys of 65 thousand integers. This means that most of the time there might exist structural conflicts when manipulating the list, but rarely should the transactions be attempting to insert or remove the same element. Fig. 10(a) shows the results for a workload with 50% readonly transactions (that check the existence of a given element) and update transactions that insert and remove items. The results show a peak speedup of 2.23 at 160 nodes. As we will consistently witness, these gains are due to a considerable reduction of aborts. Here we report a reduction of average aborts of update transactions from 15% to 0.9%. We then consider YCSB (Yahoo! Cloud Serving Benchmark) [20], which was designed to benchmark NoSQL key-value storage systems and generates data access patterns that mimic real applications’ skewed workloads. We used a workload containing 50% of update transactions, which uniformly access 16 keys and modify up to 4 keys, and 50% short read-only transactions that access a single data item. Encompassing several accesses within a transaction is similar to a recent proposal to enhance YCSB to evaluate transactional data stores [22]. The results (Fig. 10(b)) show considerable gains due to the avoidance of many conflicts by exploiting the time-warping mechanisms; Bumper yielded an average speedup of 2.8 over SCORe, by reaching a peak throughput of almost 13 k txs/s against 4.8 k txs/s at 160 nodes in this conflicting-prone and update intensive workload. 6.2. Delayed actions We now move to evaluate the benefits achievable by using delayed actions. To this end we consider a porting of the TPCC benchmark, a well-known transactional benchmark that was adapted to run on top of transactional key-value stores (and used, in previous works [4,5], to evaluate the performance of strongly consistent partial replication protocols). This benchmark portrays the activities of a whole-sale supplier and contains a number of easily identifiable contention hot spots. Specifically, one of its transaction profiles, the so-called Payment transaction, updates the balances of a warehouse and of its district whenever an order for an item stored in that warehouse is processed. The balances of each warehouse and its district are maintained by a distinct pair of keys. These balances turn quickly into contention hot spots as the scale (and consequently the parallelism level) increases, but are only update and not used elsewhere in the transaction. We encapsulated the update of the warehouse/district balance into a delayed action, and injected a workload containing 50% of update transactions. The results of this experiment, depicted in Fig. 11, demonstrate clearly the benefits deriving from the ability of delayed actions to avoid contented hot spots. The average abort probability of an update transaction is 38% for SCORe, whereas Bumper reduces it to 0.8%. As a result Bumper scales up to 6.4 k txs/s with a 3.4 speedup. 6.3. Overhead assessment To assess the overhead of the mechanisms at the core of Bumper we also conducted experiments in uncontended scenarios, in

32

N. Diegues, P. Romano / Future Generation Computer Systems 51 (2015) 20–35

(a) SkipList (with up to 160 nodes).

(a) YCSB (with up to 160 nodes).

Fig. 10. Benefits of time-warping (no delayed actions were used in these benchmarks).

All workloads had 50% update transactions. These figures show once again a reduction of abort percentage with direct gains on the throughput and scalability of the system. The Vacation-Delayed workload clearly shows that the lightweight, but highly conflicting transactions profile can scale up to 15 k txs/s by exploiting delayed actions, corresponding to a peak improvement of 3x. The Vacation-Mix shows the benefits of both our contributions in this benchmark, with an average abort reduction of 15% to 1% and a speedup improvement of 2.1× at 160 machines. 6.5. Exploiting total-order

Fig. 11. Benefits of delayed actions in TPC-C (time-warping was disabled in these experiments).

which there are no benefits from the usage of either time-warping or delayed actions. We resorted again to YCSB and used a workload with 50% update transactions. To guarantee absence of contention, we altered the data access pattern of transactions to update a single key selected from disjoint sets. As a result of this experiment, we measured a negligible average 2.5% overhead, which results from the additional validations that Bumper computes at commit time while holding the locks. These results are shown in Fig. 12, where we highlight a very clear trend between both protocols. 6.4. Mixed evaluation Finally, we also used the Vacation benchmark from the STAMP [21] suite of transactional memory applications. Vacation simulates an online travel agency in which several types of resources can be manipulated by customers or by the agency. We used a port of this benchmark for distributed key-value stores. This benchmark showcases the benefits of both time-warping and delayed actions. Note, however that the current prototype does not support the time-warping of transactions that encompass delayed actions. The latter are used to work around contention hot spots associated with keys that maintain the number of free/used travel resources of various kinds. Several invariant checks are performed around these statistics to ensure the consistency of the application was not broken. We show three workloads for Vacation in Fig. 13: 1. Vacation-TW, in which the transactions that manipulate the contention hot spots are disabled; 2. Vacation-Delayed, with the previous disabled and replaced by transactions manipulating contention hot spots; 3. Vacation-Mix, in which half of the update transactions affect contention hot spots.

Finally, we also experimented by creating a variant of SCORe (and respectively Bumper), in which the prepare messages are delivered via a genuine Total Order Multicast (TOM). This kind of technique has been shown to be adequate for conflict-prone scenarios by achieving deadlock avoidance [23]. The point of this experiment is to investigate whether by integrating Bumper into a TOM-based genuine partial replication scheme, one would still be able to obtain significant advantages. To this end, we conducted an experiment in a contended scenario with YCSB, using update transactions that manipulate several keys (as described above), with 50% update transactions. We present this experiment in Fig. 14, where we show not only Bumper and SCORe, but also the augmented variants that use TOM. On average, the TOM variants improve 50% over their counterparts that can be affected by deadlocks (and thus spuriously abort due to timeouts). Still, we can see that Bumper-TOM also improves on average 2.76 times over SCORe-TOM. Overall, these results highlight that the mechanisms used in Bumper are orthogonal to total ordering techniques, as they address orthogonal causes of aborts (observation of stale snapshots vs contention hot spots vs distributed deadlocks), and that the two can complement each other to enhance performance in conflict-prone scenarios. 7. Related work The idea of decoupling the real-time ordering of commit events from their actual serialization order is similar in spirit to the idea at the basis of Virtual Time [24]: allowing speculative out-oforder processing of events, as long as they are reconcilable along the Global Virtual Time. In Jefferson’s work, however, Time Warp is used to consistently roll-back a stale process to a safe global state. Here, instead, the time-warp commit mechanism is used to inject ‘‘back in time’’ the versions produced by a transaction that observed an obsolete snapshot; this is done with the ultimate goal of reducing aborts. Open nesting has also been used to reduce conflicts in Distributed Transactional Memories [25]: it requires the programmer

N. Diegues, P. Romano / Future Generation Computer Systems 51 (2015) 20–35

(a) YCSB (with up to 160 nodes).

(b) TPC-C (with up to 160 nodes). Fig. 12. Low contention scenarios for overhead assessment.

(a) Vacation-time-warping.

(b) Vacation-delayed.

(c) Vacation-mix. Fig. 13. Variants of workloads in Vacation with time-warping, delayed actions and a mixed workload.

(a) Speedup in YCSB with TOM.

(b) Abort rates in YCSB with TOM. Fig. 14. YCSB contended scenario prone to deadlocks.

33

34

N. Diegues, P. Romano / Future Generation Computer Systems 51 (2015) 20–35

to change the application to take advantage of the nested transactions, and more importantly, to define compensating transactions. This also makes the programming model more complex and has only shown modest improvements. Quite differently, distributed transactional schedulers [26,27] can be seen as oracles that use the current history to determine when a transaction should be allowed to execute. Broadly, a transactional scheduler determines the ordering of transactions so that conflicts are either avoided altogether or reduced. Time-warping, instead, reconciles conflicting transactions to yield serializable histories, and can be used orthogonally to any a priori scheduling strategy. Database replication has also been widely researched. In the last decade, several works have studied full replication techniques, typically layered on top of Total-Order primitives [28–30]. In particular, in [1], a technique for deterministically reordering totally-order delivered transactions was proposed to reduce conflicts. PaxosCP [31] improves a full replication to allow concurrent transactions to succeed, where a naive Paxos would abort, by detecting accesses that do not conflict. Archie [30] uses speculation to avoid any complex operation performed after the establishment of the total-order, thus shortening the commit phase’s critical path. More recently, and in the scope of Bumper, a number of works have proposed genuine partial replication techniques that are best fit for very large scale data stores (e.g., intra data-center deployments with thousands of nodes). S-DUR [14] proposes a technique to scale conflict-prone workloads — their approach reduces the communication steps and coordination of replicas, which are similar goals of partial replication. RAM-DUR [13] targets another angle by using heterogeneous nodes that act as caches and others as persistence nodes. None of these solutions allows tolerating the type of conflicts tackled by Bumper, and they may hence be complemented by using the techniques that we propose in this paper. Finally, H-Store [32] promulgated a minimalist concurrencycontrol free design, which relies on a single threaded execution engine. By serializing transaction processing at each node, H-Store can reduce data contention and avoid the abort of transactions that access only locally stored data. On the other hand, in presence of transactions that span data maintained by multiple nodes, H-Store incurs in processing stalls, which can lead to resources’ underutilization and hamper performance. Also, H-Store sequentializes the processing of entire transactions, whereas delayed actions allow encapsulating and serializing fine-grained code blocks (typically much smaller than the whole transaction). Centralized database systems have been proposed around the work of SSI [33], with different validation schemes for conflict reduction, and assuming that the underlying system ensures Snapshot Isolation [17]. The concept of triads, presented here to support distributed time-warp, is similar in spirit to the dangerous structure in SSI. Recent work has applied analogous ideas to a distributed database with full replication [34], and to Transactional Memory (TM) [35]. This work is substantially different, as it exploits partial replication to enable the achievement of larger scalability levels. The abstraction of per-tx boxes [36] pursues analogous goals to delayed actions, namely reducing contention by delaying the execution of conflict-prone code until commit time. Yet, per-tx boxes were proposed for shared-memory TMs and rely on a shared commit lock. The idea of transactional boosting [37] decreases conflicts in data-structures by relying on the commutativity of operations. In Bumper we consider a different, more challenging system model with a shared-nothing cluster, in which neither of these techniques are applicable. 8. Conclusion This paper addressed the issue of maximizing the scalability of deferred update replication protocols in presence of conflict

intensive workloads. We did so by introducing two innovative mechanisms aimed to reduce the transaction abort rate in orthogonal ways, Distributed Time-Warping and Delayed Actions. The first mechanism avoids spurious aborts caused by conventional validation schemes: whenever an update transaction misses updates of concurrent committed transactions, we try to serialize it in the past, provided that such execution exists in some consistent sequential history. The second proposed mechanism is a programming abstraction that postpones until commit phase the execution of code manipulating contention hot spots, thus avoiding contention, while still guaranteeing that it takes place atomically with the transactions that triggered it. The mechanisms that compose Bumper can be plugged on various transactional replication protocols to enhance their robustness in high contention scenarios. We presented in detail how Bumper can be integrated with SCORe, a recent, highly scalable deferred update protocol, which provides genuine partial replication and relies on a multi-versioning concurrency control scheme. We evaluated the benefits achievable with Bumper by conducting an experimental study using four well-known benchmarks. Our experiments show improvements of throughput up to 3× in conflict intensive workloads, with negligible overheads in the absence of contention. Acknowledgments This work was supported by national funds through Fundação para a Ciência e Tecnologia (FCT) with reference UID/CEC/50021/ 2013, by the specSTM project (PTDC/EIA-EIA/122785/2010) and by the GreenTM project (EXPL/EEI-ESS/0361/2013). References [1] F. Pedone, R. Guerraoui, A. Schiper, The database state machine approach, J. Distrib. Parallel Databases 14 (1) (2003) 71–98. [2] N. Schiper, P. Sutra, F. Pedone, P-Store: genuine partial replication in wide area networks, in: Proceedings of Symposium on Reliable and Distributed Systems, SRDS, 2009, pp. 214–224. [3] P.A. Bernstein, V. Hadzilacos, N. Goodman, Concurrency Control and Recovery in Database Systems, Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1987. [4] S. Peluso, P. Ruivo, P. Romano, F. Quaglia, L. Rodrigues, When scalability meets consistency: Genuine multiversion update-serializable partial data replication, in: Proceedings of the International Conference on Distributed Computing Systems, ICDCS, 2012, pp. 455–465. [5] S. Peluso, P. Romano, F. Quaglia, SCORe: A scalable one-copy serializable partial replication protocol, in: Proceedings of the International Middleware Conference, Middleware, 2012, pp. 456–475. [6] J.C. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost, J.J. Furman, S. Ghemawat, A. Gubarev, C. Heiser, P. Hochschild, W. Hsieh, S. Kanthak, E. Kogan, H. Li, A. Lloyd, S. Melnik, D. Mwaura, D. Nagle, S. Quinlan, R. Rao, L. Rolig, Y. Saito, M. Szymaniak, C. Taylor, R. Wang, D. Woodford, Spanner: Google’s Globallydistributed database, in: Proceedings of Operating Systems Design and Implementation, OSDI, 2012, pp. 251–264. [7] C.H. Papadimitriou, The serializability of concurrent database updates, J. ACM 26 (4) (1979) 631–653. [8] P.S. Yu, D.M. Dias, S.S. Lavenberg, On the analytical modeling of database concurrency control, J. ACM 40 (1993). [9] P. Di Sanzo, B. Ciciani, F. Quaglia, P. Romano, A performance model of multi-version concurrency control, in: Proceedings of Modeling, Analysis and Simulation of Computers and Telecommunication Systems, MASCOTS, 2008, pp. 1–10. [10] TPC Council, TPC-C Benchmark, http://www.tpc.org/tpcc. [11] A. Adya, Weak consistency: a generalized theory and optimistic implementations for distributed transactions (Ph.D. thesis), Massachusetts Institute of Technology, 1999. [12] M. Couceiro, P. Romano, N. Carvalho, L. Rodrigues, D2STM: Dependable distributed software transactional memory, in: Proceedings of the Pacific Rim Symposium on Dependable Computing, PRDC, 2009, pp. 307–313. [13] D. Sciascia, F. Pedone, RAM-DUR: In-memory deferred update replication, in: Proceedings of Symposium on Reliable and Distributed Systems, SRDS, 2012, pp. 81–90. [14] D. Sciascia, F. Pedone, F. Junqueira, Scalable deferred update replication, in: Proceedings of Conference on Dependable Systems and Networks, DSN, 2012, pp. 1–12. [15] M. Bravo, N. Diegues, J. Zeng, P. Romano, L. Rodrigues, On the use of clocks to enforce consistency in the cloud, IEEE Data Eng. Bull. 38 (1) (2015) 18–31.

N. Diegues, P. Romano / Future Generation Computer Systems 51 (2015) 20–35 [16] S. Fernandes, J. Cachopo, Lock-free and scalable multi-version software transactional memory, in: Proceedings of Principles and Practice of Parallel Programming, PPoPP, 2011, pp. 179–188. [17] H. Berenson, P. Bernstein, J. Gray, J. Melton, E. O’Neil, P. O’Neil, A critique of ANSI SQL isolation levels, in: Proceedings of SIGMOD, 1995, pp. 1–10. [18] N. Diegues, P. Romano, STI-BT: A scalable transactional index, in: Proceedings of the International Conference on Distributed Computing Systems, ICDCS, 2014, pp. 104–113. [19] J. Gray, L. Lamport, Consensus on transaction commit, ACM Trans. Database Syst. 31 (1) (2006) 133–160. [20] B.F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, R. Sears, Benchmarking Cloud Serving Systems with YCSB, in: Proceedings of Symposium on Cloud Computing, SoCC, 2010, pp. 143–154. [21] C.C. Minh, J. Chung, C. Kozyrakis, K. Olukotun, STAMP: Stanford transactional applications for multi-processing, in: Proceedings of the International Symposium on Workload Characterization, IISWC, 2008, pp. 35–46. [22] A. Dey, A. Fekete, R. Nambiar, U. Rohm, YCSB+T: Benchmarking web-scale transactional databases, in: Proceedings of the International Conference on Data Engineering Workshops, (ICDEW), 2014, pp. 223–230. [23] P. Ruivo, M. Couceiro, P. Romano, L. Rodrigues, Exploiting total order multicast in weakly consistent transactional caches, in: Proceedings of Pacific Rim International Symposium on Dependable Computing, PRDC, 2011, pp. 99–108. [24] D.R. Jefferson, Virtual time, ACM Trans. Program. Lang. Syst. 7 (3) (1985) 404–425. [25] A. Turcu, B. Ravindran, On open nesting in distributed transactional memory, in: Proceedings of Systems and Storage Conference, SYSTOR, 2012, pp. 1–12. [26] J. Kim, B. Ravindran, On transactional scheduling in distributed transactional memory systems, in: Proceedings of Stabilization, Safety, and Security of Distributed Systems, SSS, 2010, pp. 347–361. [27] J. Kim, B. Ravindran, Scheduling closed-nested transactions in distributed transactional memory, in: Proceedings of International Parallel and Distributed Processing Symposium, IPDPS, 2012, pp. 179–188. [28] Y. Lin, B. Kemme, M. Patiño Martínez, R. Jiménez-Peris, Middleware based data replication providing snapshot isolation, in: Proceedings of SIGMOD, 2005, pp. 419–430. [29] B. Kemme, G. Alonso, Don’t be lazy, be consistent: Postgres-R, a new way to implement database replication, in: Proceedings of Conference on Very Large Data Bases, VLDB, 2000, pp. 134–143. [30] S. Hirve, R. Palmieri, B. Ravindran, Archie: a speculative replicated transactional system, in: Proceedings of the International Middleware Conference, Middleware, 2014, pp. 265–276. [31] S. Patterson, A.J. Elmore, F. Nawab, D. Agrawal, A. El Abbadi, Serializability, not serial: Concurrency control and availability in multi-datacenter datastores, Proc. VLDB Endow. 5 (11) (2012) 1459–1470.

35

[32] M. Stonebraker, S. Madden, D.J. Abadi, S. Harizopoulos, N. Hachem, P. Helland, The end of an architectural era: (it’s time for a complete rewrite), in: Proceedings Conference on Very Large Data Bases, VLDB, 2007, pp. 1150–1160. [33] M.J. Cahill, U. Röhm, A.D. Fekete, Serializable isolation for snapshot databases, in: Proceedings of SIGMOD, 2008, pp. 729–738. [34] H. Jung, H. Han, A. Fekete, U. Röhm, Serializable snapshot isolation for replicated databases in high-update scenarios, Proc. VLDB Endow. 4 (11) (2011) 783–794. [35] N. Diegues, P. Romano, Time-Warp: lightweight abort minimization in Transactional Memory, in: Proceedings of Principles and Practice of Parallel Programming, PPoPP, 2014, pp. 167–178. [36] J. Cachopo, A. Rito-Silva, Versioned boxes as the basis for memory transactions, Sci. Comput. Program. 63 (2) (2006) 172–185. [37] M. Herlihy, E. Koskinen, Transactional boosting: a methodology for highlyconcurrent transactional objects, in: Proceedings of Principles and practice of parallel programming, PPoPP, 2008, pp. 207–216.

Nuno Diegues is a Ph.D. student in Information Systems and Computer Engineering at Instituto Superior Técnico in Portugal since 2012. He is also a researcher affiliated with INESC-ID research laboratory. His main research interests are transactional systems, both in shared memory as well as in distributed systems. His focus is on the creation of efficient systems, considering the dual concern of traditional performance metrics as well as energyefficiency.

Paolo Romano received the Ph.D. degree in Computer Engineering from the Sapienza Rome University in 2007. He is currently an assistant professor of the Computer Engineering department at Instituto Superior Técnico in University of Lisbon. He is also a senior researcher affiliated with INESC-ID research laboratory. His interests span dependable distributed systems, performance modeling and evaluation, as well as autonomic systems, and parallel computing.