Light-weight, Runtime Verification of Query Sources - CiteSeerX

Light-weight, Runtime Verification of Query Sources Tingjian Ge, Stan Zdonik Computer Science Department, Brown University 115 Waterman St., Providence, RI 02912, USA {tige, sbz}@cs.brown.edu Abstract— Modern database systems increasingly make use of networked storage. This storage can be in the form of SAN’s or in the form of shared-nothing nodes in a cluster. One type of attack on databases is arbitrary modification of data in a database through the file system, bypassing database access control. Additionally, for many applications, ensuring strict and definite authenticity of query source and results is required or highly desirable. In this paper, we propose a lightweight approach for verifying the minimum information that a database server needs from the storage system to execute a query. The verification is definite and produces high confidence results because of its online manner (i.e., the information is verified right before it is used). It is lightweight in three ways: (1) We use the Merkle hash tree data structure and fast cryptographic hash functions to ensure the verification itself is fast and secure; (2) We verify the minimum number of bytes needed to ensure the authenticity of the source related to the query result; and (3) We achieve high concurrency of multiple reader and writer transactions and avoid delays due to locking by using the compare-and-swap primitive. We then prove the correctness and progress guarantees of the algorithms using concepts from the theory of distributed computing. We also analyze the performance of the algorithm. Finally, we perform a comprehensive empirical study on various parameter choices and on the system performance and concurrency with our approaches.

I. INTRODUCTION Traditionally, networked database servers have been confined to data centers in physically secured locations. Nowadays, DBMS servers are increasingly networked, allowing many potential points of attack. With the availability of high speed LANs and storage networking protocols such as FCIP [28] and iSCSI [30], these networks are becoming virtualized and open to access from user machines. Attackers might be able to access the storage devices directly, or data might be attacked when in transit on a network between storage devices. Further, the current trend toward virtualization and outsourcing like Amazon’s Elastic Compute Cloud (EC2) web service [37] allows companies to eliminate or reduce the expense of running their own data center. Database applications are beginning to migrate to this type of environment. The potential downside of this approach is that data is no longer under the physical control of its owner. Thus, an attack on a remote virtualization cluster (e.g., by a malicious insider who may be able to obtain a root password or who may simply be able to corrupt the physical medium) can compromise data integrity. Our techniques are important

to give customers confidence that such exploits will not go undetected. Another example, related to electronic commerce and digital rights management, is provided in [16]. Our work addresses arbitrary modification of data including changes that may bypass the database software. For example, after gaining certain passwords, a student changes his grade, a dishonest speculator alters financial data of a company, or an identity thief modifies personal information of some victim through access to the file system. Analogous to data encryption support by a database system, we propose lightweight data integrity protection support in a database. Note the functional difference between encryption and integrity protection. Encryption is typically heavier-weight and protects sensitive data (e.g., medical data, salary, etc.) from being accessed by un-trusted parties due to privacy. While encryption also provides integrity protection to some degree, it is typically more computationally expensive than pure integrity protection mechanisms and cannot guard against attacks that simply swap two encrypted values or rollback a value to its old version. In addition, encryption requires the database server to manage a secret encryption key, which we try to avoid in this work. Moreover, since the storage systems at the server are susceptible to attacks, it is necessary to provide separate lightweight data integrity support, as an orthogonal mechanism. Since the binaries of database software as well as important configuration files are generally much smaller and more centralized than the data stored in the database, they can be more easily protected or verified (e.g., by a hashing mechanism similar to ours). Any dynamically linked code is typically very small (e.g., order of 10’s of KB). The attacks that we focus on in this work are those that modify a subset of the user data in a purposeful way. As in previous work (e.g., [2, 6]) and the classical security model for database encryption (e.g., [36]), we trust that the database software “does its job”, i.e., it executes the code as it is supposed to, and the runtime memory is secure. Ensuring this (which by itself is a separate security issue requiring different techniques) is the minimum basic starting point to make use of any software. Because the storage system of the database may not be trusted, our goal in this work is to let the database server dynamically verify the authenticity (i.e., correctness and completeness) of the source data that it needs to run the query. The server verifies the data as it reads it from the disk (i.e., no separate I/O cost). By authenticating the source, the server

ensures that the query result is correct and complete. We shall show that doing this in a DBMS can not only be efficient and lightweight, but also gives the server complete confidence in the authenticity of the source and query results. This is in contrast to verification of data offline, rather than during query execution. In the offline scenario, verifying whole files in a database is costly requiring separate I/O operations, and an adversary might actually launch the attack in between verification and query execution. There has been some recent work on secure file systems (e.g., [15, 35]) that offer similar functionality to ours. However, since those systems are very general, they cannot exploit characteristics and semantics of database access. First, we observe that many DBMS’s are not built on top of a standard file system, but rather, for performance reasons, build on raw disk. In this context, the DBMS would need to perform the verification. As we have said, this presents opportunities to take advantage of database semantics. Given that, we note two fundamental enhancements over secure file systems that this work provides: (1) The file system approach must verify all the bytes in each page read or written. In contrast, we need only verify a subset of the data. We take advantage of the data semantics by introducing MAS and MVS (Section III). This reduces overhead (Figure 11). (2) Also, in the presence of many concurrent transactions, we provide a fine-grained, lock-free, concurrency control scheme that eliminates potential locking overhead and has better progress-guarantees. Note that what we attempt to accomplish is different from the large body of work for outsourced databases (e.g., [13]) in which the server is completely un-trusted, requiring it to create a verification object (VO) for the returned results that the client must verify. Nonetheless, even in the outsourced database scenario, it would still be useful for the server to apply the techniques in this work. The server operator certainly has the incentive to behave honestly, largely because it would risk being caught, thus, losing business. Therefore, it is in the server operator’s interest to first check the integrity of the source data it uses to run the query. This would save server’s resource consumption of executing “doomed” queries; it would also let the server provide evidence to the client that it indeed behaves honestly, and it is the source that is corrupted. Another point to note is that while the client authentication of query results in previous work is thus far still limited to a few relatively simple query types, our approach of authenticating data sources is independent of query types and works for arbitrary queries. The verification object approach (e.g., [13]) also requires the existence of an index structure like a B+ tree, which we do not. Clearly, there are more storage costs and communication costs involved in the VO approach too, and the verification cost is shifted to the client side which may be resource constrained. The basic idea of our work is that we build one or more Merkle hash tree(s) for each sensitive database object (e.g., specific attributes and their associated indexes). Then at query execution time, as we read data from the disk, we verify that a

minimum subset of “referenced” bytes (Section III) from the sensitive database objects is authentic, thus proving that the data source and query result are both correct and complete relative to the initial “good state” from which the Merkle trees were built. Our algorithm needs a separate data structure called a Hierarchical Tag Table (HTT) for this purpose. We show that these data structures are very compact. The HTT takes one page at most. We achieve high concurrency of multiple reader and writer transactions and avoid delays and other problems due to locking (detailed in Section IV-A.2) by using ideas from the theory of distributed computing [10]. In particular, we use the compare-and-swap primitive instead of mutual exclusion (locks) to accomplish the wait-free property (Section IV). We prove the correctness and progress guarantees of our algorithms and analyze their performance. To sum up, the contributions of this paper are: • A framework for lightweight source verification as the server is executing the query to ensure high confidence on the result. • Techniques for minimizing the number of bytes hashed in order to have minimal impact on performance. • The use of the compare-and-swap primitive to achieve the wait-free property between multiple writers and readers. We also prove the correctness and progress guarantee of our algorithm and analyze its performance. • A comprehensive empirical study of (1) various parameter choices and (2) the performance and concurrency of our system. The rest of the paper is organized as follows. In Section II, we introduce our verification framework and the data structures we introduced. Section III shows how we can minimize the amount of hashing that must be done to improve performance, yet still preserve correctness. In Section IV, we first introduce the primitive and theory we use from the field of distributed computing; then we present our reader and writer’s algorithms, following which we prove their correctness, progress guarantees, and analyze their performance. We perform a comprehensive empirical study in Section V and list related work in Section VI. Finally, Section VII concludes the paper. II. FRAMEWORK AND DATA STRUCTURES A. Background: Merkle Hash Tree σ=sign(h1..8 , SK) h1..8 h1..4

h12=H(h1|h2) hi=H(ri)

h5..8

h12

h34

h56

h78

h1

h2

h3

h4

h5

h6

h7

h8

r1

r2

r3

r4

r5

r6

r7

r8

Fig. 1 Illustrating a Merkle hash tree

The Merkle hash tree [19] is a tree structure originally proposed for efficient authentication of equality queries in a database sorted on the query attribute (Figure 1). Every

database record corresponds to a distinct leaf node. Each leaf node stores a hash value for the binary representation of its record. The tree is constructed bottom-up, with each internal node storing the hash value of the concatenation of the hash values of its children. The owner signs the hash value stored in the root of the tree. Using a tree structure and lightweight cryptographic collision-resistant hash functions, it avoids costly signature operations on all the data to be verified. Instead, only a single value, the root of the tree, needs to be verified, say, against its signature. Merkle hash trees have been extended to have arbitrary fan-out and are frequently used in data authentication. In this paper, the term “Merkle hash tree” refers to such a tree with an arbitrary fan-out. B. Verification Framework The basic idea is that we build one or more Merkle hash tree(s) for each sensitive database object on the disk (such as indexes, data files, etc.). Then at query execution time, as we read data from the disk, we verify that a minimum subset of “referenced” bytes (Section III) from the sensitive database objects is authentic, thus proving that the source and query result are both correct and complete relative to the initial “good state” from which the Merkle trees were built. We conduct the work in C-Store [32], an open-source columnoriented database system. The same ideas would work on roworiented systems as well. Hierarchical Tag Table (HTT)

Sign(h, SK)

σ

Hash tag of a sub-block

... DB object page 1

DB object page 2

DB object page n-1

DB object page n

Fig. 2 Illustrating the infrastructure for verifying a sensitive database object

More specifically, we divide each page of a sensitive database object into a number of sub-blocks and compute a small hash tag (e.g., 20 bytes using SHA-1) for each subblock. This constitutes the base level (leaves) of an in-page Merkle tree. The in-page tree has a fan-out which we determine in Section V-C. The root of an in-page tree is called a page tag. Starting from the page tags, we build another Merkle tree with some fan-out f. The tree is built bottom-up, until we get a single root hash value. This is illustrated in Figure 2. As shown in Figure 2, we call the collection of hash tags from “page tags” and above Hierarchical Tag Table (HTT). HTT is a Merkle tree with some fixed fan-out f and with page tags being the leaves. The default page size is 64Kbytes in CStore. Consider a sensitive column, say students’ grades. CStore densely-packs the grades column in a single file. HTT is quite small in practice. For example, for a large table of 20M records, the integer grade column spans 1280 pages. As each page tag is 20 bytes (we use the SHA-1 hash function in this

work; it can be easily changed to use another hash function), the base level (leaves) of HTT is only 1280*20 = 25Kbytes. Thus, with a simple calculation we can see that in this example, the HTT is less than one page, for any fan-out. We shall discuss the runtime verification algorithm in detail in Section IV. Basically, as a query runs, the server verifies the authenticity of the data it uses from disk by computing hash values in a bottom-up manner, all the way up to the root. There are two ways to verify the root: (1) The server keeps a signature of the root signed by the client (or owner) and verifies against the signature. (2) The server sends the root hash value to the client (or owner) who decides if it is authentic. Note that for (1), it is possible for an adversary to launch a so-called “rollback attack”, in which the adversary may observe a signature of the old version of an object, and later revert the object to its old version, together with its root signature. To prevent this attack, like in [16], we need a small amount (e.g., 20 bytes) of trusted store. One can also use hardware to enforce the tamper-resistance [34]. The root hash values of all the sensitive objects in a database are further hashed in a tree in the same manner until we get a single root hash, whose signature resides in the trusted store. We address the concurrency aspects of this in Section IV. It is worth stressing that all database objects, including system tables and indexes, pertaining to the sensitive attributes being protected (Section III-A) need to be part of an HTT structure, and any bytes referenced in them must be checked at runtime. For example, a system table may have a field to indicate if a data table is empty or not. As system tables, like other user tables, also reside on un-trusted storage, an adversary could corrupt it and change the field to indicate a data table is empty when it is not. Without integrity checking on the system table, a select statement on the data table could erroneously return an empty result set even though the data table itself is not corrupted. Similarly, the table size information that the server reads during execution must be authenticated and the result must be complete. Thus, by ensuring the integrity of all the information (data and metadata) that the trusted server reads from the disk (untrusted storage) for any decision it makes, we ensure both the correctness and the completeness of the query results. C. Organizing HTT From the previous discussion, we know that HTT is typically small (less than one page), even for large tables. We also know that each node in HTT is of fixed length (20 bytes) with fixed fan-out parameter f. These properties greatly simplify the organization and management of HTT. This is crucial for performance. In fact, we limit each HTT to be one page, thus making operations and their synchronization on HTT fall within a page in memory. Synchronization is discussed at length in Section IV. In the rare case that a page HTT is not big enough for a database object, we simply divide the object into smaller ones and build a Merkle tree (thus HTT) for each.

Let us first determine the fan-out parameter f from a performance perspective. We note that typically, the speed (in bytes/sec throughput) of a cryptographic hash function (e.g., SHA-1) is faster when the input is bigger message blocks. This can be seen from the “openssl speed” test command. Theorem 1. Let the output length of the cryptographic hash function be w bytes. Let its throughput (speed) function be T(i) bytes/sec, where i is the input message block size in bytes. Then the hash verification of a leaf page takes the minimum f ⋅w time when the fan-out of the HTT is arg min . f

T ( f ⋅ w) ⋅ log f

Proof: If the number of leaves of the HTT is n, consider a bottom-up computation path of the hash tree to verify a leaf (i.e., page tag). Each node of the path has f·w bytes, which is also the input block size to the hash function. The total number of bytes to be hashed is f ⋅ w ⋅ log f n and the hash throughput is T(f·w) bytes/sec. With simple algebraic manipulation, we obtain the optimal fan-out as specified by the theorem.  From Theorem 1 we can see that the optimal fan-out parameter depends on the output width of the hash function used and its throughput function. For SHA-1, w = 20 bytes. The hard part is the throughput function, which we can determine empirically. Using “openssl speed” test and multiple regression (polynomial fitting) [18], we can obtain the throughput function, thus determining the optimal fan-out. Our experiments indicate that the optimal fan-out is between 10 and 20. We shall further look at the choice of this parameter in the experiments in Section V. So far, we have determined that each HTT takes one page, and the fan-out parameter f is 20. As each HTT node is of fixed length, we can use packed arrays to organize hash tree nodes within the HTT page. Each leaf node of HTT is 20 bytes, and each internal node is 24 bytes. The extra 4 bytes is to store additional information used for synchronization between concurrent write and read operations on the hash tree, which we discuss in detail in Section IV-B. The 64Kbytes page of the C-Store is partitioned as follows for storing HTT. The first 60 Kbytes are used to store the leaves. Thus the capacity of an HTT is 60K / 20 = 3072 page tags. Accordingly, the 2nd level of HTT has a capacity of ⎢⎡3072 / 20⎥⎤ = 154 nodes,

and takes 154 * 24 = 3696 bytes. In the same manner, the 3rd level contains at most 8 nodes and spans the next 192 bytes. Finally, the top level (root) takes the next 24 bytes. Using fixed size arrays for each level of HTT makes the storage compact and the computation of node addresses easy. It simplifies the management of HTT. Note that the HTTs in a database are very compact. As described above, a one-page HTT can cover 3072 data pages of sensitive attributes (Sec. III-A), which is 192 MB. When we extend it to the whole database to guard against the “rollback attack” (Sec. II-B), a second level one-page HTT can cover 3072 first level HTT pages. Thus, only one HTT page at the second level already can handle a database that has sensitive objects of total size 192MB * 3072 = 576GB! More importantly, HTTs are just metadata. This means that the

complexity of data insertion and deletion only needs to be taken care of by the database server using conventional concurrency control techniques, without considering HTTs. The synchronization techniques described in Section IV are only needed to ensure the integrity and fast update of HTTs themselves. III. MINIMIZING HASH OPERATIONS A. Minimum Attribute Set (MAS) Consider the relation in Figure 3. Suppose our specific goal is to guard against an attack in which an adversary purposefully changes the grades of one or more students (including possibly swapping the grades of two students). Then we can determine that the minimum attribute set (MAS) to be hashed and verified is {Name, ID, Grade}. Once we protect these attributes, one can reasonably assume that the adversary does not have any rational incentive to change other attributes of the table. Name

ID

Year

Grade

Instructor

Grader

Comments

Amy

345

sopho more

82

Debra

Dan

Turned in early.

Ben

816

junior

55

Debra

Bill

NULL

Carol

317

junior

79

Debra

Dan

NULL

Dave

522

senior

96

Debra

Bill

Well written

Eileen

106

sopho more

85

Debra

Dan

missed one prob.

Fig. 3 Illustrating the minimum attribute set (in grey) to be protected

Many database systems (e.g., [36]) allow users to choose to encrypt only sensitive attributes for efficiency. Analogous to that, only hashing and verifying data related to MAS helps to minimize hash operations, reducing the overhead of verification. Note that data objects on the disk that are related to MAS, such as indexes on those attributes, need to be protected as well. Also note that any such saving is more pronounced for column-oriented systems than for row stores, in which the values to be hashed are scattered in the base level pages. B. Minimum Verification Set (MVS) At first blush, it seems that one has to verify all “referenced” bytes (i.e., bytes read from database objects on the disk) that have an associated hash tree during the query execution to ensure the correctness and completeness of the source and result. However, in many situations, we can do better than that. Definition 1. An execution of a query reads a set of pages P from disk. For each page p in P, there are a set of bytes R that the server actually examines for the execution of the query. We call R the set of referenced bytes of page p. Suppose there were some “auxiliary input” that leads us directly to a minimum subset V of R, such that by only examining V (i.e.,

we do not know any bytes in page p other than V), we could finish the execution of the query without needing any other information from p. Then we call V the minimum verification set (MVS) of page p. Example 1. Consider a B+ tree traversal to search for a key value k. At an internal node (page) p, typically one does a binary search to locate a pointer to follow and continues the search at that child node. Then the set of key value bytes used during the binary search is the set (R) of referenced bytes of page p. Suppose, on the other hand, there were some auxiliary input that takes us directly to a set of contiguous bytes (ki, pi, ki+1) in p, where ki and ki+1 are two contiguous key values satisfying ki < k < ki+1, and pi is the pointer to follow, then we could finish the execution without ever needing any other bytes from p. Thus, (ki, pi, ki+1) is the minimum verification set V, and we only need to verify V for p.  There are three cases to consider: (1) the data in page p is not corrupted; (2) the data is corrupted somewhere outside the correct MVS; and (3) the data is corrupted inside the correct MVS. Clearly, in case (1) the verification succeeds and in case (3) it fails. In case (2), if the intermediate steps leading to the final correct MVS fails, we will detect it; or if the intermediate steps lead to a set of data different from the correct MVS (e.g., some faked data), then the verification will fail. The verification only succeeds when the intermediate steps lead to the correct MVS, namely the minimum set of data that ensures the correctness of the execution step. The key point is to ensure that every step in query processing uses authentic data to make a valid move. We discuss MVS for a few types of file organizations used in database systems (and especially in C-Store). As in [29], we consider the following operations (related to queries) on the files: (1) scan (i.e., fetch all records in the file), (2) search with equality selection, and (3) search with range selection. All three operations have a variant (called early-stop) that the operation stops as soon as a qualifying tuple is found (scan can be considered as being associated with an optional arbitrary predicate different than equality/range selection). The early-stop variant occurs in many situations in query processing, for example: (1) when there is a primary key or unique constraint on a column and the predicate is an equality selection; or (2) when evaluating IN or EXISTS subqueries. An example would be: “select … where c1 IN (select c2 from t2 where …)” EXISTS subqueries are similar. Heap files. For all three operations described above, when executed on a page, both the referenced bytes R and the minimum verification set V are all bytes of the page. However, for the early-stop variant, if there is a qualifying tuple in the page, R is the bytes in the page up to the point where that qualifying tuple is found and V is only the qualifying tuple itself. • Sorted files. This file organization is frequently used in CStore. For the scan operation, it is exactly the same as heap files. For an equality or range search on a page, because tuples are sorted by the search key column, one can first use •

a binary search on the first tuple of each page to locate a contiguous set of pages that may have qualifying tuples. The first and last pages of this set may be partially in the range. Hence, another binary search within the first and last pages of the set will locate the exact set of qualified tuples. The combined minimum verification set (MVS) of all the pages contains this contiguous set of tuples, plus one tuple immediately before and one immediately after the set (if any, to show completeness). Note that the set of referenced bytes R in general contains extra tuples that are examined (i.e., used for comparisons) during the two binary searches. Just like heap files, for the early-stop variant, V contains only the qualifying tuple itself. In the case of no qualifying tuples, V contains only the tuple immediately before and the tuple immediately after the searched range. • B+ tree. A complete scan corresponds to a traversal of all the leaves of a B+ tree, and the discussion of it is no different from the previous ones. Typical searches using B+ trees can be reduced to a primitive operation of searching for a key k at a (node) page p. This has been discussed in Example 1. • Hashed files. The only interesting operation for hashed files is equality search. For static hashing, one needs to verify the bytes read from the hash directory (i.e., mapping from a hash value to the page address of the bucket) and then verifying the equality search within the page(s) of the corresponding bucket, in a way similar to heap files or sorted files depending on whether the pages of a bucket are sorted. It is not hard to extend the discussion to dynamic hashing, which we omit. • Compressed files. One can compress a column in C-store using one of the supported compression methods (i.e., run length encoding, bitmap encoding, and delta encoding, etc.). Sometimes query processing can proceed on compressed data directly to gain more efficiency [1, 32]. As run length encoding and delta encoding are applied on sorted data and bitmap encoding is on unsorted data, the determination of MVS is analogous to sorted or unsorted files. Minimizing hash operations for verification is important for the performance of verifying query source data at runtime. In Section V-D, we verify this experimentally. IV. ALGORITHMS FOR CONCURRENCY AND SPEED Typically multiple update transactions (writers) and read transactions (readers) run concurrently. These transactions would all access the HTT making it a potential bottleneck. To address this problem, we next present a concurrency control scheme that will minimize contention while at the same time guaranteeing no deadlock or starvation. A. Synchronization Primitives A change of hash value of a data page tag (leaf of HTT) affects the hash values on the path from the leaf to the root. Since an HTT fits in one page, the synchronization is essentially on an in-memory data structure. We adopt concepts from the theory of distributed computing and design

an algorithm that does synchronization without using locks. We use the so-called compare-and-swap (CAS) atomic operation on integers. We prove the correctness of the algorithm and the “wait-free” property (to be defined). We also analyze its performance. Note that the synchronization on HTT is separate from the locking employed by a DBMS on data pages. In a sense, the HTT is “metadata” and we would like to make the operations on it as lightweight as possible. A.1. Compare-and-swap (CAS) Compare-and-swap is a synchronization operation supported by contemporary computer architectures [10]. For example, it is the CMPXCHG instruction in the x86 and Itanium architectures. CAS atomically compares the contents of a memory location to a given value and, if they are the same, modifies the contents of that memory location to a given new value. The result of the CAS operation must indicate whether it performed the substitution; this can be done with a simple boolean response. Typically, the CAS operation is repeated until it succeeds. Note that the true benefit provided by this synchronization primitive is the “atomicity” guaranteed by the hardware. This is because, in general, checking the current value of some object and setting it to a new value are two atomic operations (instead of one) which can have other operations scheduled in between. This atomicity gives us extra power for synchronization. CAS is used to implement semaphores in multiprocessor systems. It is also used to implement lock-free, wait-free, and obstruction-free algorithms in multiprocessor systems, which we explain in Section IV-A.2. In our case, because an HTT fits in a page in memory, we can use the synchronization power of CAS to coordinate multiple writer transactions and avoid the overhead and problems (e.g., fault-tolerance and scalability) associated with locks [10]. A.2. Progress Properties The traditional way to implement shared data structures is to use mutual exclusion (locks) to ensure that concurrent operations do not interfere with one another. Nevertheless, locks are poorly suited for asynchronous, fault-tolerant systems [10]: if a faulty process is halted or delayed in a critical section (holding a lock), non-faulty processes will also be unable to progress. Even in a failure-free system, a process can encounter unexpected delays as, for example, a result of a page fault or a cache miss. Similar problems arise in heterogeneous architectures, where some processors may be inherently faster than others, and some memory locations may be slower to access. In response, researchers in distributed computing have investigated a variety of alternative synchronization techniques that do not employ mutual exclusion [10, 11]: • A synchronization technique is wait-free if it ensures that every thread will continue to make progress in the face of arbitrary delay (or even failure) of other threads. • It is lock-free if it ensures that some thread always makes progress.

•

It is obstruction-free if it guarantees progress for any thread that eventually executes in isolation. Even though other threads may be in the midst of executing operations, a thread is considered to execute in isolation as long as the other threads are not running.

Certainly the bottom-line of all of them is the correctness condition, i.e., linearizability [12] or serializability which is implied by linearizability: Although operations of concurrent processes may overlap, each operation appears to take effect instantaneously at some point between its invocation and response. From their definitions, it is clear that wait-free, lock-free, and obstruction-free properties (from stronger to weaker, in that order) all provide stronger progress guarantees than locks and avoid the problems associated with locks as discussed earlier [11]. In Section IV-D, we shall show that the algorithm we present has the wait-free property which encompasses the others. B. Reader’s and Writer’s Algorithms We discussed the organization of an HTT page in Section II-C. We mentioned that each internal node is 24 bytes, which is a 20-byte hash value and a 4-byte Active Children Count (ACC). There is a separate in-memory bitmap VMAP shared by transactions that has one bit for each internal node (thus is very small) indicating if it has been “verified” by a writer. The reader’s algorithm and the writer’s algorithm are shown in Figure 5 and Figure 6, respectively. The goal of the reader’s algorithm is to verify a set of tobe-verified leaves (data page tags) initially in V. The verification procedure is in a bottom-up manner, one level at a time (for each Repeat loop). The verification is illustrated in Figure 4. Using siblings’ hash values stored in HTT, the algorithm computes the hash values on the paths from V to root. We stop the traversal along some path when we hit a node marked “verified” (by a concurrent writer). Finally, if the algorithm reaches the root, it verifies the computed root, as discussed in Section II-B.

Fig. 4 Illustrating a bottom-up verification. Black nodes are leaf-to-root paths of to-be-verified pages and grey nodes are their siblings whose stored hash values in HTT are used for the computation.

The writer’s algorithm uses the interface provided by database buffer managers [8] to fix the HTT page in the buffer without being replaced during a successful run of the writer’s algorithm. This is needed because we cannot give an adversary the chance to change the hash values in HTT (while it is on the disk) which the writer has already verified. As we

shall explain in the proof of Theorem 2 in Section IV-C, the verification phase is to ensure correctness. Reader’s Algorithm. (1) Let V be the set of to-be-verified leaves (and their hash values). (2) Repeat (3) Compute the hash values of the parents of the nodes in V, using their siblings’ stored hash values in HTT if needed. Remove these nodes from V. (4) For each computed parent, if it is marked “verified” (in VMAP), but the computed hash is different from what’s stored in the node, then return FAILURE; if it is not marked “verified”, then add it (with the computed hash value) to V. (5) Until V is empty or it only has the root. (6) If V is empty, return SUCCESS; else return the result of verifying the root. Fig. 5 A reader transaction’s algorithm for verification in HTT

Writer’s Algorithm. (1) Fix (i.e., pin) the HTT page in the buffer. (2) Verification Phase. We first verify both the MVS of the writer transaction and the old page tags of the to-beupdated pages using Reader’s Algorithm. If successful, using CAS, we mark each internal node (whose hash value we computed during verification) as “verified” in VMAP. (3) Marking Phase. For each leaf in the to-be-updated list, along the path from it to the root, increment (using CAS) every internal node’s ACC. (4) Update Phase. For each leaf in the to-be-updated list, along the path from it to the root, in bottom-up order, decrement (using CAS) every internal node’s ACC, unless a node’s ACC is 1 but it has a child with a non-zero ACC value, in which case we stop the bottom-up decrement and continue on the next leaf. If we have just changed a node’s ACC value from 1 to 0, we compute and update its new hash value. (5) Helping Phase. Starting from the root, arbitrarily pick a child with non-zero ACC. Traverse downward in depthfirst order until we reach a node (with ACC v) whose children all have ACC value 0, or are leaves. Using CAS, change this node’s ACC from v to 0. Compute and update its hash value. Go up one level and continue this depth-first traversal and update, until we get back to the root and its ACC is 0. (6) The hash value at the root is the new reference value. (7) Unfix the HTT page. Fig. 6 A writer transaction’s algorithm for verification and updates in HTT

The marking and update phases are designed to let concurrent writers share the hash computation work in order to avoid redundant computation or even overwriting another writer’s result erroneously. The somewhat non-intuitive case in the update phase is when ACC is one but a child has a nonzero ACC. This might happen due to a concurrent slow writer who is still in the middle of the marking phase. These are also explained in the proof of Theorem 2. The helping phase is to

ensure that an active writer always makes progress, in spite of the failures of other writers. This is detailed in the proof of Theorem 2 and Theorem 4. C. Correctness of the Algorithms We show that the algorithms give the correct result for both writers and readers. In Section IV-D and IV-E, we shall also show the progress guarantee and analyze the performance. Theorem 2. The returned result of the writer’s algorithm (root hash value) is always correct: i.e., no faked values can be mixed in; no real page tag change can be lost (e.g., overwritten). Proof. We first show that no false changes by an adversary can possibly be mixed in during the update process. This property is guaranteed by two measures: • Before each hash value update of the nodes on the path from a changed page tag (leaf of HTT) to root, the verification phase of the algorithm (against the old values) ensures that no false changes made to any page (that goes into, say, the sibling hash values along the update path) can be mixed in during the “legal” updates in the update phase and the helping phase. • “Fixing” the HTT page in step (1) ensures that the hash values of the sibling nodes of an update path stay in memory during the update and are thus secure (cannot be changed by an adversary who might have access to the disk). We next show that no real page tag change can be lost. A page tag (i.e., leaf) change is propagated up one level at a time on the path to root. There are only two possible ways to stop this “propagation”: (1) an update to the parent node is subsequently overwritten by a concurrent update initiated by a sibling; (2) a concurrent writer transaction hangs. In case (1), for one to overwrite another, the two writers must be both in their update or helping phase, i.e., they both have been through the marking phase (using CAS), making the parent node’s ACC at least two. Thus, it must be the case that the writer who comes in first “deposits” the change, and the last writer among the siblings does the actual hash update of the parent (step 4). This makes case (1) impossible. In case (2), if a concurrent writer hangs at the update or helping phase on a sibling node, then the helping phase initiated by an active transaction ensures that the “propagation” continues. The interesting sub-case is when the concurrent writer hangs at the marking phase. An active writer (at the update phase) comes first to a parent node that they both marked and just “deposits” the new hash value at the child, waiting for the slow writer to compute the hash. But the slow writer hangs there (step 3) and has never marked any node above that point. Because we are missing the increment of ACC from the slow writer, the bottom-up decrement of ACC of the active writer may see an ACC value of 1, if decremented, and would cause it to compute the hash (step 4), losing its own “deposited” value somewhere below. The algorithm takes care of that and leaves the ACC value as 1 in that case (step 4), enabling the helping phase to finish the hash computation. Thus, case (2) is also impossible. 

Theorem 3. The result of a reader’s verification algorithm is correct. Proof: The reader’s algorithm is to simply compute the hash values bottom-up. The only interesting case is when it hits node marked “verified” by a concurrent writer. In that case, the writer verified all nodes on the path from that one to the root, and the HTT has been kept in the buffer and is secure. Note that a concurrent writer might have updated the hash value of a sibling of the reader’s path but not yet updated their parent at the moment the reader reaches this shared parent (causing a mismatch if the reader had to compute the hash). But in that case, the writer must have been through the verification phase and their parent is already marked “verified”. Thus, the reader verifies successfully.  D. Progress Guarantee We show that the algorithms are wait-free. Recall that a synchronization technique is wait-free if it ensures that every transaction will continue to make progress in the face of arbitrary delay (or even failure) of other transactions. Theorem 4. The reader’s and writer’s algorithms are waitfree with the assumption that the incoming page update rate is smaller than the processing rate of hashing a page tag through a leaf-to-root path of HTT. Proof. Consider the case in which only one writer transaction is active until it exits the HTT, and there are any number of other writer transactions that arbitrarily stop (hang) at any place within the algorithm. Clearly the “update” phase of the active transaction cannot propagate all its updated leaves out of the root, as some of them will hit an internal node with nonzero ACC thus waiting for other transactions to finish their hash operations. Nonetheless, the “helping” phase of the active transaction will do the hashing of all updates until they exit the root, as long as there is a non-zero ACC path from them to the root, regardless of which transaction the updates belong to. This will certainly include all the updates of the active transaction since it has been through the “marking” phase, creating a nonzero ACC path for each of its updates. Furthermore, the helping phase will eventually finish due to the assumption stated in the theorem.  Note that the assumption in Theorem 4 is reasonable, as otherwise the wait queue length of the updates to be processed would increase arbitrarily, beyond the processing capability of the system. E. Performance Analysis The progress guarantee in Section IV-D only guarantees that the operations will finish. Now we study the question of how long an individual update will spend in the system. The “input parameters” of the system are as follows. We measure time in hash steps. The processor can finish one hash operation in one hash step. We assume: • An update comes in with probability p at every hash step. • Each update touches l leaves uniformly at random. • The “marking” phase is instantaneous.

Given these quantities, we want to compute the duration D of an individual update (i.e., the time interval, in hash steps, for an update to exit the HTT system). Lemma 1. Let the height of the root of an HTT be h. For an arbitrary set of updated page tags (leaves) L, let the set of internal nodes that are the lowest common ancestor of at least two leaves in L be LCA. For an internal node n, let b(n) be the number of branches of n that are on the path from some leaf in L to the root. Let h(n) denote the height of node n. Then the cost of propagating L to the root using the writer’s algorithm is h + h(n) ⋅ (b(n) − 1) .

∑

n∈LCA

Proof: We prove this by induction on the number of leaves in L. When |L| = 2, let the LCA of the two leaves be n. Thus, we have b(n) = 2. Clearly the cost of hashing these two leaves is h + h(n) (with shared computation from node n to root), and the theorem holds. Now suppose the theorem holds when there are l leaves in L, and we consider adding one more leaf. Let the union of paths (edges plus nodes) from the l leaves to root be B. The new leaf must join with B at some node n’. Accordingly, b(n’) will increase by one due to the addition of the new leaf. If b(n’) used to be 1, which means n’ was not in LCA before, n’ is now in LCA with b(n’) = 2. If b(n’) used to be more than 1 (thus it used to be in LCA), it stays in LCA. In either case, the theorem’s formula dictates that the cost is increased by h(n’). This is indeed the cost of computing the hash values of the new leaf from leaf level up to n’, as required by the algorithm. Therefore the theorem holds with |L| = l+1 too. This concludes the induction.  Theorem 5. Let the input parameters be as defined earlier. Let the fan-out of HTT be f, and its height be h. Then, the expected value of the duration D of an individual update operation is upper-bounded by k

f f h −1 ⎛ f ⎞⎛ 1 ⎞ ⎛ 1 ⎞ u = h + ∑ f i ⋅ ( h − i) ⋅ ∑ ∑ ⎜ ⎟ ⎜ i ⎟ ⎜1 − i ⎟ qf ⎠ i =0 j = 2 k = j ⎝ k ⎠ ⎝ qf ⎠ ⎝

f −k

where q is a parameter satisfying the condition 1< q ≤

1 ⎛ 1⎞ 1 − ⎜1 − ⎟ f ⎠ ⎝

plu

Proof. Consider a random variable D, indicating the time duration a random input update operation stays in the system before exiting from the root. From Lemma 1 and the linearity of expectation, we have h −1

E ( D) = h + ∑ f i ⋅ ( h − i ) ⋅ E ( Bi − 1)

(1)

i =0

where Bi is a random variable indicating for a node n at distance i from root, its b(n) value as defined in Lemma 1, except that it is always 1 when b(n) < 2 (thus we only count nodes in LCA). As Bi-1 is a random variable with non-negative values, we have

∞

f −1

f

j =1

j =2

E ( Bi − 1) = ∑ Pr( Bi − 1 ≥ j ) =∑ Pr( Bi − 1 ≥ j ) =∑ Pr( Bi ≥ j ) j =1

⎛f⎞ = ∑ ∑ Pr( Bi = k ) =∑∑ ⎜ ⎟ pi +1k (1 − pi +1 ) f − k j=2 k = j j=2 k = j ⎝ k ⎠ f

f

f

(2)

where pi is the probability that a node at distance i from the root contains an updated leaf in its subtree (thus it contributes 1 to its parent’s b value). The intuition of the first equality in (2) is that, for j from 1 upwards, cumulatively, Pr[Bi − 1≥ j] is the probability that we add 1 to the expectation [21]. Thus, for a time interval D, ⎛ f h −i −1 ⎞ pi +1 = 1 − ⎜1 − ⎟ fh ⎠ ⎝

plD

⎛ 1 ⎞ = 1 − ⎜1 − i +1 ⎟ f ⎠ ⎝

plD

Consider a subtree of the root, suppose the probability that it contains an updated leaf is no more than 1/q (for some q > 1): ⎛ 1⎞ p1 = 1 − ⎜1 − ⎟ f ⎠ ⎝

i.e., 1< q ≤

plD

≤

1 q

1 ⎛ 1⎞ 1 − ⎜1 − ⎟ f ⎠ ⎝

plD

1 q⋅ fi

0.6

(3)

Clearly, the bigger pi+1 is, the bigger Bi is, and the bigger D is. Thus, combining (1), (2), and (3) above, we have that E(D) is bounded from above by k

f f h −1 ⎛ f ⎞⎛ 1 ⎞ ⎛ 1 ⎞ u = h + ∑ f i ⋅ ( h − i) ⋅ ∑ ∑ ⎜ ⎟ ⎜ i ⎟ ⎜1 − i ⎟ k qf ⎠ i =0 j =2 k = j ⎝ ⎠ ⎝ qf ⎠ ⎝

f −k

where q is a parameter satisfying the condition as stated in the theorem.  From Theorem 5, we can write a simple program that iterates over different values for q within its range and find one that gives the best (tightest) upper bound, given all other parameter values. We shall show this in Figure 12 and Figure 13 of Section V-F. V. EMPIRICAL STUDY A. Setup We implemented our techniques and algorithms in C-Store. We use a synthetic database that consists of tables of different sizes, the details of which are described in the respective experiments. A synthetic database is sufficient for our purpose since the speed of hashing does not depend on the contents of the tables. The experiments were conducted on a 1.6GHz AMD Turion 64 machine with 1GB physical memory and a TOSHIBA MK8040GSX disk. B. Choosing HTT Fan-out The fan-out size of the HTT tree is a parameter that affects the speed of building the HTT from scratch and doing verification with an established HTT when some data pages are accessed by a query. As per our analysis in Section II-C, using the result of the “openssl speed test” when the fan-out f is between 10 and 20, the speed of verifying individual data

Creating HTT time (milliseconds)

pi +1 ≤

0.7

2.5

Then, for a node at distance i+1 from root, we have

2 Verification time (milliseconds)

f

pages is the fastest. In this experiment, we examine the actual speed of (1) building a maximum-size HTT (one page) from scratch, which can cover 3072 data pages; and (2) verifying 30 random pages (out of 3072). The numbers are the averages of 1000 runs for each. Figure 7(a) shows the running time of creating a maximumsize HTT from scratch. We can see that the bigger the fan-out, the less time it takes to build an HTT. The reason is as follows. At the base (leaf) level of HTT, the same number of bytes is hashed with different fan-outs. At upper levels, clearly when the fan-out is bigger, each level has fewer bytes and, thus, there are fewer levels. Because the speed is largely determined by the number of bytes hashed, we have the observed result. On the other hand, the result of verifying 30 random pages shows a different trend. This is shown in Figure 7(b). Because we are only hashing a small number of pages (compared with the total of 3072), when the fan-out is smaller (10 or 20), a smaller number of bytes are “summarized” (hashed) at each level and the difference in the number of levels does not differ that much (logarithmic). Hence, smaller fan-out gives faster verification time. In the following experiments, we choose the fan-out to be 20.

1.5

1

0.5

0.4

0.3

0.2

0.5 0.1

0

10

20 30 40 HTT fan-out

0

10

20 30 HTT fan-out

40

(a) (b) Fig. 7 The running time of creating HTT and verification of 30 pages vs. different fan-out sizes.

C. Choosing Sub-block Size in Data Pages Before producing the HTT, each data page first needs to be “summarized” as a “page tag” in the leaf level of the HTT. This is essentially another hash tree (per data page), which we call an in-page tree. In this experiment, we partition a data page into sub-blocks of some size (we experiment with three choices: 1024, 512, and 256 bytes). Thus, at the base level of an in-page tree, each sub-block is hashed (into a 20-byte value using SHA-1). Accordingly, in the upper levels of the in-page tree, we choose a fan-out such that each hash operation is applied to a number of bytes that is close to the sub-block size. We measure the time of both hashing a whole data page up the in-page tree and only hashing a sub-block up the in-page tree which is often the case during query execution. Note that the hash values of internal nodes of an in-page tree (as metadata) are stored in the data page as well, taking some space (except for the root, which is the “page tag” stored as a leaf in the HTT). For different sub-block sizes, this metadata size is different. For example, for a 64KB page, the three sub-block sizes we experiment with have in-page metadata size around 1.25KB, 2.6KB, and 5.3KB respectively. Thus smaller subblock size entails larger overhead in space.

Figure 8(a) shows the result of hashing all sub-blocks up the in-page tree. We can see that while the base level has the same number of bytes, a smaller sub-block size implies more nodes and bytes at upper levels and a deeper in-page tree with longer hashing time. However, when one only needs to hash a sub-block in the page, which is arguably typical during query processing, as shown in Figure 8(b), a smaller sub-block size is more efficient. This is because we hash less data at each level. Overall, considering the tradeoffs between time and space overhead, we use sub-block size 512 bytes by default. 60 Hashing time for a sub-block (microseconds)

2

1.5

1

0.5

0

50

40

30

20

10

0

1024 512 256 sub-block size (bytes)

1024 512 256 sub-block size (bytes)

(a) (b) Fig. 8 The running time of hashing a whole data page (a) and hashing only a sub-block (b) up an in-page tree, with different sub-block sizes. 60

80

referenced bytes MVS

Verification time (milliseconds)


referenced bytes MVS

70

50

40

30

20

60 50 40 30 20

10 10 0

500

1000 1500 Number of pages

2000

0

0

0.005

(a)

0.01 0.015 0.02 0.025 Range query selectivity

0.03

(b)

E. Comparison with Other Systems We now examine the overhead of our techniques for integrity checking of query sources. We compare the performance of our system with a baseline system without any checking, a system that does encryption instead of integrity checking, and a system that does integrity checking for each accessed file. We experiment on a table student_grades with a schema as shown in Figure 3 in Section III-A: name CHAR(30), id INTEGER, year CHAR(10), grade INTEGER, instructor CHAR(30), grader CHAR(30), comments CHAR(50). The table has 3M records. The minimum attribute set (MAS) contains name, id, and grade, which we verify at runtime. We issue the following two queries: Q1: SELECT instructor, year, AVG(grade) FROM student_grades WHERE id > ? AND id < ? GROUP BY instructor, year Q2: SELECT COUNT(*) FROM student_grades WHERE grade < ? AND comments IS NOT NULL

Q1 asks for some statistics of the grades of the instructors’ courses using samples of a certain size while Q2 asks for a count of the students who may need some attention (i.e., grade below certain number and have comments from the grader). There is a B+ tree index on the id column and another one on the grade column. Figure 11 (a) and (b) show the results of the two queries.

Fig. 9 Verification time of MVS vs. that of all referenced bytes

3

0.4 referenced bytes MVS

0.3 0.25 0.2 0.15 0.1

Baseline system

With runtime verification With encryption

2.5 Execution time (seconds)


0.35

3.5

Baseline system

With runtime verification With encryption

3

File system verif ication

File system verif ication

Execution time (seconds)

Hashing time for in-page tree (milliseconds)

2.5

vary the file size. In (b), we fix the file size to be 1000 pages but vary the selectivity of the range query. We can see that using MVS consistently saves on the verification cost. Figure 10 shows the saving by using MVS on B+ trees that have different numbers of levels. We can see that the saving here is quite significant, due to the considerable decrease of bytes to be verified for each B+ tree node.

2

1.5

1

2.5 2 1.5 1

0.05

0.5 0

2

3 4 Number of levels in B+ tree

0.5

5

0

Fig. 10 Verification time of MVS vs. that of all referenced bytes on the B+ tree data structure

D. Minimum Verification Set In this experiment, we examine how much we save by only verifying MVS, compared to verifying all read or written bytes. We look at two interesting file structures discussed in Section III-B, namely sorted files and B+ trees. Figure 9 shows the experiments on sorted files. We compare the running time of verifying all referenced bytes (read or written) with the running time of only verifying MVS. In (a), we fix the selectivity of the range query to be 0.01 but

0

2

4 6 8 # of qualified records

(a)

10 5

x 10

0

0

2

4 6 8 # of qualified records

(b)

10 5

x 10

Fig. 11 Execution times for queries Q1 (a) and Q2 (b) in different systems

We vary parameters of the queries and hence the number of qualified records. As expected, the baseline system without any integrity checking or any encryption is the fastest. Our runtime verification system comes next. We use SHA-1 on the three protected attributes. For Q1, grade and id columns are checked while instructor and year are not. For Q2, grade is and comments is not. We compare with a system that uses DES in CBC mode for encryption on the same three protected

attributes. It involves decryption and is slower than integrity checking. Moreover, as we discussed in Section I, it cannot guard against attacks that simply swap two encrypted values or rollback a value to its old version. A system that does verification at the file system level checks the integrity of all accessed files (using SHA-1) and is the slowest due to the extra data (attributes other than MAS and data other than MVS – Section III) that it has to check. Note that the performance difference is more pronounced in Q2 (Fig. 11 b) as the attribute that has to be checked by our system (grade) takes up a smaller portion due to the size of the comments attribute. Compared to checking at the file system level, our system observes the semantics of the data in a DBMS and provides fine-grained protection. actual maximum delay result from Theorem 5

6 5 4 3 2 1 0

0

0.02

0.04 0.06 0.08 Probability a writer comes in at each hash step

0.1

7 Maximum writer delay (in hash steps)

7

0.12

Fig. 12 Maximum delay of a writer transaction and the comparison with the result of Theorem 5.

6

G. Crashes of Writers In this study, we ran nine different experiments, each with 1000 concurrent writers. Each experiment had a different crash probability ranging from 0.1 to 0.9. Figure 14 shows the result. When a writer crashes, it stops at one of three phases (mark, update, and helping) with equal probability. As we described earlier, there is no blocking (deadlock or livelock) when threads arbitrarily stop at different phases. This property is indirectly observable in this set of experiments since in all cases the worst case writer delay is never greater than 6 hash steps (a very small number). In other words, no transaction is waiting too long. As the crash probability goes up, the maximum delay even drops slightly. This is because the number of competing writers drops and the amount of “real work” as a whole drops.

actual maximum delay result from Theorem 5

5 4

6 Maximum writer delay (in hash steps)

Maximum writer delay (in hash steps)

7

writer to propagate up the tree. Thus, the synchronization is very lightweight because we avoid the overhead of locking. We then fix p to be 0.05 but vary l (number of pages a writer changes). The result is shown in Figure 13. Again we observe that the maximum delay is well bounded and close to the result of analysis from Theorem 5.

5

4

3

2

1

3 0 0

2 1 0

1

2 3 4 5 Number of pages a writer changes

6

Fig. 13 Maximum delay of a writer in the system with different number of pages a writer changes, and the comparison with the result of Theorem 5.

F. Maximum Delay of Concurrent Writers In this experiment, we empirically verify the synchronization mechanism we developed for concurrent writers. By varying two parameters p (the probability that a writer transaction enters the system at each hash step) and l (number of pages a writer changes), we verify the progress and performance guarantees we analyzed in Section IV. We first fix l to be 3 and vary the incoming rate of writer transactions by setting the probability p that a new writer enters at each hash step. Then we measure the maximum delay of a writer in the system and compare it with the result of Theorem 5. Figure 12 shows the result. We compare the actual maximum delay with the result calculated from Theorem 5. The two bars in each group are roughly consistent (The small differences are likely caused by process scheduling overhead). We can see that the synchronization is smooth: there is no writer that is locked out or starving and the maximum delay is well bounded. Recall that the HTT has three levels and the delay should be at least 3 hash steps for a

0.1

0.2

0.3

0.4 0.5 0.6 Crash probability of writers

0.7

0.8

0.9

1

Fig. 14 Maximum delay of a successful writer in the system when writers crash and stop with different probabilities in different phases of the algorithm.

VI. RELATED WORK The invention of Merkle hash tree [19] inspired a series of research work that uses it for secure and efficient authentication of data. Data authentication has been used in data publication over the Internet, e.g., [3, 5, 17, 7, 26, 24, 25, 4]. The introduction of Database-As-a-Service model [9] brings data authentication a new source of usage. Targeting at this model, Narasimha and Tsudik [23] developed a new approach (DSAC) based of signature aggregation and chaining. Mykletun et al. [22] explore the applicability of digital signature schemes (RSA, DSA, and recently proposed schemes) and present their performance measurements. Li et al. [13] proposed Merkle B-Tree, which is a state-of-the-art, disk-based authenticated structure. [14] and [27] are recent work on outsourced data streams, targeting long-running queries and sliding window queries. Our work differs from this line of work in important ways. We are trying to solve a different problem. Our aim is not for the server to provide a proof for the client to verify the query results. We let the server which has abundant resources verify the integrity of the data and metadata (coming from the

storage system) that are the “ingredients” of a query result. Thus, unlike previous work, our approach is not limited by query types; nor do we require the existence and usage of specific indexing structures like B+ trees. Any arbitrary query type works the same way, and we ensure both correctness and completeness of the result for a trusted server (Section II-B). In contrast, in previous work, authentication is only for some specific query types; one would not be able to verify the result of a more complex query. Nonetheless, servers of outsourced databases still have incentive to use our techniques as we discussed in Section I. Maheshwari et al. [16] did pioneering work in dealing with untrusted storage of a database. But their work is specifically for a non-relational database system, TDB, which is an embedded DBMS specifically for Digital Rights Management [34]. TDB is tightly integrated with C++: It provides typed storage of C++ objects and uses C++ as the data definition language. [16] concentrates on the architecture and integrates encryption and hashing with TDB’s low-level data model which is log structured storage. [16] does not handle high throughput or concurrency and does not need sophisticated query processing for its applications. Our work, however, targets relational database systems and provides a framework for selective verification, and lock-free algorithms for scalable concurrency. VII. CONCLUSIONS The storage system is increasingly under attack due to new security vulnerabilities of Internet servers. Thus, for many applications, it becomes required or highly desirable to have total confidence on the authenticity of the query results computed using data read from insecure storage systems. In this paper, we provide a framework that uses the Merkle hash tree structure to check the authenticity of the data at query execution time. The verification is minimal and definite. We discuss the minimum verification set for different data structures and use the compare-and-swap primitive and concepts in the theory of distributed computing to ensure concurrent writers proceed in an efficient way. We prove the correctness and progress properties. We also analyze the performance of our algorithm. A comprehensive empirical study is conducted to verify our proposals and algorithms. REFERENCES [1] [2] [3] [4] [5] [6]

D. Abadi, S. Madden, and M. Ferreira. Integrating compression and execution in column-oriented database systems. In SIGMOD 2006. R. Agrawal, J. Kiernan, R. Srikant, and Y. Xu. Order preserving encryption for numeric data. In SIGMOD 2004. E. Bertino, B. Carminati, E. Ferrari, B. Thuraisingham, and A. Gupta. Selective and authentic third-party distribution of XML documents. TKDE, 16(10), 2004. W. Cheng, H. Pang, and K. Tan. Authenticating multi-dimensional query results in data publishing. In DBSec, 2006. P. Devanbu, M. Gertz, C. Martel, and S. Stubblebine. Authentic data publication over the internet. J. of Computer Security, 11(3), 2003. T. Ge and S. Zdonik. Fast, Secure Encryption for Indexing in a Column-Oriented DBMS. In ICDE 2007.

[7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37]

M. T. Goodrich, R. Tamassia, N. Triandopoulos, and R. Cohen. Authenticated data structures for graph and geometric searching. In CT-RSA, pages 295–313, 2003. J. Gray and A. Reuter. Transaction Processing: Concepts and Techniques. Morgan Kaufmann publishers, Inc. 1993. Hacıgümüş, H., Iyer, B., Mehrotra, S. Providing Databases as a Service. In ICDE, 2002. M. Herlihy. Wait-Free Synchronization. In TOPLAS, Vol. 11, No 1, January 1991, Pages 124-149. M. Herlihy, V. Luchangco, M. Moir. Obstruction-Free Synchronization: Double-Ended Queues as an Example. In ICDCS, 2003. Herlihy, M., and Wing, J. Axioms for concurrent objects. In ACM Symposium on Principles of Programmmg Languages, 1987. F. Li, M. Hadjieleftheriou, G. Kollios, L. Reyzin. Dynamic authenticated index structures for outsourced databases. In SIGMOD, 2006. F. Li, K. Yi, M. Hadjieleftheriou, and G. Kollios. Proof-Infused Streams: Enabling Authentication of Sliding Window Queries on Streams. In VLDB, 2007. J. Li, M. Krohn, D. Mazi`eres, and D. Shasha. Secure Untrusted Data Repository (SUNDR). In OSDI, 2004. U. Maheshwari, R. Vingralek, and W. Shapiro. How to build a trusted database system on untrusted storage. In OSDI, 2000. C. Martel, G. Nuckolls, P. Devanbu, M. Gertz, A. Kwong, and S. Stubblebine. A general model for authenticated data structures. Algorithmica, 39(1):21–41, 2004. Mendenhall, W., and Sincich, T. Statistics for Engineering and the Sciences. Fourth Edition. Prentice-Hall, Inc. 1994. Merkle, R. A Certified Digital Signature. In CRYPTO, 1989. G. Miklau. Confidentiality and Integrity in Distributed Data Exchange. PhD thesis, University of Washington, 2005. M. Mitzenmacher, E. Upfal. Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press, 2005. E. Mykletun, M. Narasimha, and G. Tsudik. Authentication and integrity in outsourced databases. In NDSS, 2004. M. Narasimha and G. Tsudik. DSAC: Integrity of outsourced databases with signature aggregation and chaining. In CIKM, 2005. G. Nuckolls. Verified Query Results from Hybrid Authentication Trees. In Data and Applications Security, 2005. H. Pang, A. Jain, K. Ramamritham, and K.-L. Tan. Verifying completeness of relational query results in data publishing. In SIGMOD, 2005. H. Pang and K.-L. Tan. Authenticating query results in edge computing. In ICDE, 2004. S. Papadopoulos, Y. Yang, and D. Papadias. CADS: Continuous Authentication on Data Streams. In VLDB, 2007. M. Rajagopal, E. G. Rodriguez, and R. Weber. Fibre Channel Over TCP/IP (FCIP). RFC 3821, July 2004. R. Ramakrishnan and J. Gehrke. Database Management Systems. McGraw-Hill, 3rd edition, 2003. J. Satran, K. Meth, C. Sapuntzakis, M. Chadalapaka, and E. Zeidner. Internet small computer systems interface, 2004. S. Smith, E. Palmer, and S. Weingart. Using a high-performance, programmable secure coprocessor. In Financial Cryptography, 1998. M. Stonebraker, D. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. O'Neil, P. O'Neil, A. Rasin, N. Tran, S. Zdonik. C-Store: A Column Oriented DBMS. In VLDB, 2005. M. Stonebraker, S. Madden, D. Abadi, S. Harizopoulos, P. Helland, and N. Hachem. The End of an Architectural Era (It’s Time for a Complete Rewrite). In VLDB, 2007. R. Vingralek, U. Maheshwari, and W. Shapiro. TDB: A Database System for Digital Rights Management. In EDBT, 2002. A. Yumerefendi and J. Chase. Strong Accountability for Network Storage. In FAST, 2007. Oracle Corporation. Database Encryption in Oracle 8i, August 2000. http://aws.amazon.com/ec2.