A metaprogramming approach to data provenance James Cheney University of Edinburgh
[email protected]
Abstract Provenance is information about the history, origin, or derivation of a piece of data in a database. It is useful for debugging, error correction, and data integrity, especially in scientific databases. For automatic provenance tracking techniques to behave consistently and reliably, it is important to prove their correctness. While semantic correctness criteria for provenance have been proposed for simple query languages, these criteria have been difficult to adapt to realistic database query languages including features such as grouping, negation, and aggregation. We take the view that correct provenance must show how the data depends on the input and how the output data was computed from the input. Moreover, since provenance should be userreadable, we believe that it is expedient to use the expressions of the query language itself as provenance. Thus, in our view, provenance tracking is a metaprogramming problem. In this paper, we motivate and describe the data provenance problem, identify correctness criteria, and define and prove correct a dynamic provenance tracking technique for a core database query language with aggregation, grouping, and negation operations. We also discuss possible applications of existing techniques for partial evaluation, metaprogramming and static analysis to provenance tracking.
1. Introduction When a computational process is used to obtain scientific results, it can be difficult to make an informed judgment about the reliability of the results. This is increasingly a problem as scientists make use of more complex computer systems such as distributed grid computing and shared databases, instead of simple numerical computations. Rather than simply trust the results of a complex distributed computation or database query, many scientists now record auxiliary information about the process used to obtain the results, or at least believe such information ought to be recorded. This auxiliary information is often called provenance [10, 27] or lineage [6, 13, 14]. Provenance is an essential ingredient in scientific data, and is useful for detecting errors, propagating corrections, and assuring integrity. A number of systems of increasing levels of generality have been developed for tracking and retaining provenance for scientific
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. c ACM [to be supplied]. . . $5.00. Copyright
and geospatial data [6, 7, 16, 23, 25]. Many of these systems are designed for specific scientific disciplines, and most record a high-level description of the program or processing steps taken in constructing a result; often the level of granularity is at the level of files, data sets, and macroscopic processing steps such as scripts or high-level functions. The provenance of an output is a tree which shows how the result (root value) was computed by composing processing steps (internal nodes) and input values (leaves). Usually, there is a lot of redundancy in this representation, so an application-specific DAG representation is used to store this information efficiently for all outputs. These representations (like the underlying computations) are often untyped; indeed, detecting and tracking down type and format-conversion errors after the fact is a typical application of this form of provenance. In addition, many researchers have studied adding support for provenance to general-purpose computer systems, such as databases [5, 8, 9, 13, 14, 32, 33] or filesystems [17]. Most of this work has taken place in the database community, where notions of provenance have been defined for queries, views, and updates. Typically, one defines the provenance of an output value (e.g., a tuple) as a minimal fragment (e.g. set of tuples) of the input database on which the output depends. This could mean the part of the input from which the output was copied, or the (larger) part whose values contributed to the output. The first type of provenance is often called where-provenance because it shows where the data came from; the second is often called why-provenance because it explains why the data is present in the output [10]. Most of this work to date has focused on the database tuple as the level of granularity, and none provides a convincing explanation of provenance for queries with negation, aggregation, grouping, or operations on base types. As an example, consider the following query Q: SELECT x.a,SUM(x.b) FROM x in T UNION U GROUP BY x.a WHERE x.c > 500 This query takes the union of two tables, selects those tuples whose c-value exceeds 500, groups their rows by a-value, then sums the b-values in each group and returns rows containing the a-values and corresponding sums. Example input and output are shown in Figure 1. We decorate the tables T and U with their names and row identifiers T1 , U1 , . . . in order to make it possible to refer to parts of the database in the following discussion. Figure 2 shows the classical why- and where-provenance that could be associated with this query. The why-provenance of an output tuple is a set of input tuples containing all the data that contributed to the computation of the output; that is, a subset of the input tuples from which the output can be recomputed. The whereprovenance of an output tuple is a set of input tuples that includes all of the data that was copied into the output tuple. For queries including grouping and aggregation such as Q, classical why- and
0
T B T1 Q@ T2 T3
a 1 2 3
b 1 42 3
c U 150 U , 1 600 U2 800 U3
a 1 2 3
b 10 20 30
1 c 1 550 C = 2 A 300 3 700
10 42 33
Q1 (x) P1 C1 P10 P(1,10) P2 C2 P42 P(2,42) P3 C3 P33 P(3,33)
= = = = = = = = = = = = =
{y | y ∈ T ∪ U, y.c > 500, y.a = x} true U1 .a Q1 (U1 .a) = {U1 } C1 U1 .b C1 (U1 .a, U1 .b) true T2 .a Q1 (T2 .a) = {T2 } C2 T2 .b C2 (T2 .a, T2 .b) true T3 .a Q1 (T3 .a) = {T3 , U3 } C3 T3 .b + U3 .b C3 (T3 .a, T3 .b + U3 .b)
Figure 2. Classical, tuple-based why- and where-provenance
C
=
PQ(T ,U )
=
{x | x ∈ T ∪ U, x.c > 500} = {U1 , T2 , U3 , T3 } {(U1 .a, U1 .b), C1 ∧ C2 ∧ C3 ∧ C (T2 .a, T2 .b), (T3 .a, T3 .b + U3 .b)}
Figure 1. Input data and results for the query Q(T, U ) W hy(1,10)
=
{T1 , T2 , T3 , U1 , U2 , U3 }
W here(1,10)
=
{U1 }
W hy(2,42)
=
{T1 , T2 , T3 , U1 , U2 , U3 }
W here(2,42)
=
{T2 }
W hy(3,33)
=
{T1 , T2 , T3 , U1 , U2 , U3 }
W here(3,33)
=
{T3 } or {U3 }
where-provenance is unsatisfactory. The why-provenance captures too many dependences: it says that each part of the output depends on each part of the input. The where-provenance is somewhat arbitrary because the source location from which a value is copied depends on arbitrary choices made during query evaluation (e.g. 3 could have been copied from either T3 .a or U3 .a) and some parts of the output may depend on the input, but not be literal copies of any part of the input (e.g., 33 was not present in the original database, so it contributes nothing to where-provenance). Consequently, classical why-provenance captures too many false dependences, while where-provenance misses some true dependences. Semantic criteria for defining provenance have been studied for simple database query languages, but it appears difficult to extend these criteria to handle more realistic query languages or update languages. In recent work, Buneman, Chapman, Cheney and Vansummeren have introduced definitions of provenance for queries or updates which attempt to capture sensible behavior for richer query and update languages [8, 9]. However, there are many design decisions for which there is no obvious a priori right answer, and it is sometimes difficult to explain why certain design decisions that seem wrong should be avoided. We therefore believe it is crucial to identify formal criteria for the correctness of provenance information and tracking; that is, to make explicit the guarantees or invariants that information should obey in order to be called provenance. 1.1
Overview of our approach
There are many similarities between the motivation and applications for provenance and techniques for analyzing, understanding, and improving programs, including • use-def chains in data-flow analyses [19],
• justifications in declarative debugging for functional or logic
languages [4, 21],
• intermediate data structures used in adaptive or incremental
computation [3, 22],
• program dependence graphs and traces in program slicing (es-
pecially for functional or logic programs) [20, 24], and
• type systems for dependency and information flow security [1].
Based on these similarities, we believe that techniques familiar from the above topics may be useful in the study of data provenance. Unfortunately, it does not appear to be the case that these techniques can be used directly for provenance tracking. Provenance differs from all of the above in that it is intended to be read
Figure 3. Expression provenance and understood by ordinary users, to help them understand the results of a computation, and may be meant to be stored permanently alongside the results of the computation and used for many purposes, rather than temporarily within a programming tool for a single purpose. Thus, factors such as succinctness, readability, and application-independence are important design considerations for computing and storing provenance. In this paper we study provenance for a typed database query language that includes collection (set) types, grouping and aggregation, and operations on base types. We define the provenance of a value v : τ as an annotation c p which shows how v was computed from the data in the database. Here, c is an expression of type bool which shows the control dependences of v on the input (data that affected the computation of v) and p is an expression of type τ that shows the data dependences of v on the input (how v was computed from the input). Control dependences generalize the classical notion of why-provenance, whereas data dependences generalize where-provenance, so we call c the why-provenance and p the where-provenance components of c p. Figure 3 shows correct provenance records for all of the parts of the result of the example query Q, culminating in the provenance PQ(T ,U ) of the whole result table. To simplify notation, we take advantage of the fact that all of the values in the output are distinct; thus, P1 is the provenance of the unique copy of 1, P(3,33) is the provenance of (3, 33), etc. We also make use of several abbreviations Q1 (−), C, C1 , C2 , C3 . The key conditions on which the result value depends can be phrased in terms of the results of intermediate queries. For example, condition C3 states that the results depend on the fact that the only tuples in T ∪ U having c-value greater than 500 and a-value equal to T3 .a are {T3 , U3 }. This implies that T3 .a = U3 .a and that T3 .c > 500 and U3 .c > 500. The right-hand sides of P3 , P33 , and P(3,33) show how the respective values were computed from data in T and U . In particular, the aggregation operator SUM is specialized to an expression T3 .b + U3 .b. Compared to classical why- and where-provenance, this information allows us to answer many more questions about the origin of the data and effects of hypothetical changes to the database. For example, the b-fields only appear in the where-provenance, not the why-provenance. This means that if we change the b-values, we can
predict the effect on the query using the where-provenance since the control flow behavior of the query is unaffected. Conversely, the c field appears in all of the conditions. Thus, if a c-value is changed, the where-provenance may not be relevant. Also, note that we cannot do without the dependency information in C1 , C2 , C3 , or C; to see why, suppose we updated the database by inserting a row T4 = (3, 4, 600) into T . This would change the output tuple (3, 33) to (3, 37), but not its where-provenance, since the where-provenance component of P(3,33) mentioned only rows T3 and U3 . However, C3 becomes false because the query Q1 (T3 .a) now evaluates to {T3 , T4 , U3 }, which differs from {T3 , U3 }. Similarly, inserting a tuple T5 = (5, 6, 1000) would affect the overall result of the query, because it would contribute the value (5, 6) to the result, but the other intermediate results would be unchanged. The goal of this paper is to give a sensible definition of correctness for this form of provenance and to show how it can be computed or approximated in a systematic way. There are two obvious (and relatively easy to satisfy) properties that expression-based provenance should have: 1. Type safety: the provenance expressions are well-formed and their types consistent with the annotated values. 2. Value correctness: the why-provenance constraints are satisfied, and the where-provenance expression evaluates to the same value as the original expression. However, these two conditions alone are not enough. Indeed, a trivial approach that defines the “provenance” of a value as the value itself (e.g., P(3,33) = true (3, 33)) satisfies both criteria, but this approach is clearly unsatisfactory: it does not explain how the value depends on the input. In order to rule out trivial solutions, we identify a third correctness criterion for provenance: 3. Dependency correctness: if a change in the input results in a change to the output, then the change is reflected in the output’s provenance. That is, provenance should capture all of the true control and data dependences. We can also ask whether its converse holds: that is, 4. Dependency accuracy: if a change to the input does not change the output, then it does not invalidate the output’s provenance. That is, accurate provenance should avoid false dependences. Computing accurate provenance is difficult in general, because it may require reasoning about all possible executions of a query, not just the execution that is actually performed. This is impractical for a large database and impossible if we wish to compute provenance statically. Thus, we prefer to view accuracy, along with efficiency and usability, as desirable properties by which we can compare approaches, not absolute correctness requirements. The structure of the rest of this paper is as follows. Section 2 defines the core query language we will use. Section 3 formalizes our approach to provenance and defines the type-, value-, and dependency-correctness properties. Section 4 introduces an operational semantics which evaluates an ordinary query to a value annotated with provenance information, and proves its correctness. Section 5 discusses applicable metaprogramming, partial evaluation, and static analysis techniques, and Sections 6 and 7 discuss future work and conclude.
2. Background 2.1
Query languages
Relational algebra (consisting of selection, projection, join, union, set difference, and renaming operations) is typically used in database theory as an abstraction of real query languages such
as SQL [2]. However, SQL contains many features which cannot easily be simulated in pure relational algebra. Like any programming language, it includes basic data types and operations (such as integers, strings, and dates). SQL can also perform grouping (GROUP BY) and aggregation (SUM, AVERAGE) operations which involve nested relations (relations containing relation-valued fields). We will use a more general query language called the nested relational calculus (NRC) [2, 12]. The NRC includes only base types, pairing, and collection (finite set) types; aggregation and grouping queries can be represented directly using more primitive set operations such as comprehensions. NRC’s collection types are essentially the same as the list, set, or multiset monads in functional programming languages [31]. The types and expressions of our variant of NRC are as follows: τ e
::= ::= | | |
bool | int | τ1 × τ2 | {τ } x ¯ | x | let x = e1 in e2 | (e1 , e2 ) | πi (e) i | e1 + e2 | sum(e) | e1 = e2 b | ¬e | e1 ∧ e2 | if e0 then e1 else e2 ∅ | {e} | e1 ∪ e2 | e1 − e2 | {e2 | x ∈ e1 } |
[
e
Here, i ∈ Z = {. . . , −1, 0, 1, . . .} represents integer constants and b ∈ B = {true, false} denotes boolean constants. Most of the constructs are standard; we describe the less familiar ones. We distinguish between global variables x ¯ ∈ GV ar and local variables x ∈ V ar. The former are used to refer to the contents of the database, and cannot be introduced within a query, while the latter are used for intermediate values and can be introduced by let and set comprehension. The set operations include ∅, the constant empty set; singletons {e}; set unionSand difference; set comprehension {e2 | x ∈ e1 }; and flattening e. (The singleton, comprehension, and flattening operations are the same as the “return”, “map”, and “concat” operations of the monad of finite sets; this observation forms the basis for the use of comprehension syntax for database query languages [11].) By convention, we write {e1 , . . . , en } as syntactic sugar for {e1 } ∪ · · · ∪ {en }. Finally, we include sum, an example of an aggregation operation, which adds together all of the elements of a set and produces a value; e.g. sum{1, 2, 3} = 6. Although this language bears little resemblance to SQL, it can be used to express SQL queries involving grouping and aggregation as well as basic select-from-where queries [11]. In particular, more general comprehensions such as {(x, y) | x ∈ R, y ∈ S, π1 .x = π2 .y} can be expanded to [ [ { {if π1 .x = π2 .y then {(x, y)} else ∅ | x ∈ R} | y ∈ S}
Pattern-matching syntax in binding lists and explicit field names can also be translated to our core language. The example query Q from the introduction can be implemented as follows in NRC: let W = {x | x ∈ T ∪ U, x.c > 500} in {(x.a, sum{y.b | y ∈ W, x.a = y.a}) | x ∈ W } Although it is more expressive than relational algebra, the version of NRC we are using is still a considerable simplification relative to full SQL. This seems necessary in order to keep the formal development to follow manageable. In full SQL, there are many more base types and operations; tables have multiset semantics rather than set semantics by default; some operations have ordersensitive (list) semantics; and values can be missing (NULL). It is easy to add extra monadic collection types such as multisets, lists and options that can model these phenomena (although translating from SQL to such a core language is nontrivial).
Γ; ∆ ` e : τ Γ; ∆ ` e1 : τ1 Γ; ∆, x:τ1 ` e2 : τ2 Γ; ∆ ` let x = e1 in e2 : τ2 Γ; ∆ ` e : {int} Γ; ∆ ` e1 : int Γ; ∆ ` e2 : int i∈Z Γ; ∆ ` i : int Γ; ∆ ` e1 + e2 : int Γ; ∆ ` sum(e) : int Γ; ∆ ` e0 : bool Γ; ∆ ` e1 : τ Γ; ∆ ` e2 : τ b ∈ Bool Γ; ∆ ` b : bool Γ; ∆ ` if e0 then e1 else e2 : τ Γ; ∆ ` e : bool Γ; ∆ ` e1 : bool Γ; ∆ ` e2 : bool Γ; ∆ ` ¬e : bool Γ; ∆ ` e1 ∧ e2 : bool Γ; ∆ ` e1 : τ1 Γ; ∆ ` e2 : τ2 Γ; ∆ ` e : τ1 × τ2 Γ; ∆ ` (e1 , e2 ) : τ1 × τ2 Γ; ∆ ` πi (e) : τi Γ; ∆ ` e : τ Γ; ∆ ` e1 : τ Γ; ∆ ` e2 : τ x:τ ∈ ∆ Γ; ∆ ` x : τ
E[[x]]γδ E[[¯ x]]γδ E[[let x = e1 in e2 ]]γδ E[[i]]γδ E[[e1 + e2 ]]γδ
= = = = =
E[[sum(e)]]γδ
=
E[[b]]γδ E[[¬e]]γδ E[[e1 ∧ e2 ]]γδ
= = =
E[[if e0 then e1 else e2 ]]γδ
=
E[[(e1 , e2 )]]γδ E[[πi (e)]]γδ E[[∅]]γδ E[[{e}]]γδ E[[e1 ∪ e2 ]]γδ E[[e1 − e2 ]]γδ E[[{e | x ∈ e0 }]]γδ [ E[[ e]]γδ
= = = = = = =
x ¯:τ ∈ Γ Γ; ∆ ` x ¯:τ
Γ; ∆ ` e1 = e2 : bool Γ; ∆ ` ∅ : {τ } Γ; ∆ ` {e} : {τ } Γ; ∆ ` e1 : {τ } Γ; ∆ ` e2 : {τ } Γ; ∆ ` e1 : {τ } Γ; ∆ ` e2 : {τ } Γ; ∆ ` e1 ∪ e2 : {τ } Γ; ∆ ` e1 − e2 : {τ } Γ; ∆ ` e1 : {τ1 } Γ; ∆, x:τ1 ` e2 : τ2 Γ; ∆ ` e : {{τ }} S Γ; ∆ ` {e2 | x ∈ e1 } : {τ2 } Γ; ∆ ` e : {τ }
Figure 4. Well-formed query expressions 2.2
Type system
Query language expressions can be typechecked using standard techniques. The only unusual feature of the type system is that we split the context into a global context Γ and local context ∆. Γ ∆
::= ::=
· | Γ, x ¯:τ · | ∆, x:τ
The rules for typechecking expressions are shown in Figure 4. We write Γ ` e : τ as an abbreviation for Γ; · ` e : τ . 2.3
Semantics
Because of the absence of recursion, we can give a simple settheoretic semantics to types. Base types are interpreted as their sets of values, pair types as the Cartesian product of the interpretations of the components, and set types as finite sets of elements of the component type. (This restriction is standard for query languages because databases and query results have to be finite.) T [[bool]] T [[int]] T [[τ1 × τ2 ]] T [[{τ }]]
= = = =
B = {true, false} Z = {. . . , −1, 0, 1, . . .} T [[τ1 ]] × T [[τ2 ]] Pfin (T [[τ ]])
We define a local environment δ to be a function from local variables to values; such an environment satisfies a context ∆ (written δ : ∆) if δ(x) ∈ T [[∆(x)]] for each x ∈ dom(∆). We define global environments γ and satisfaction γ : Γ analogously. Figure 5 gives the denotational semantics of queries. Note that we overload S notation for pair projection πi and set P operations such as ∪ and ; also, if S is a set of integers, then S is the sum of their values. It is straightforward to show that L EMMA 1. If Γ; ∆ ` e : τ , γ : Γ, and δ : ∆, then E[[e]]γδ ∈ T [[τ ]].
3. Defining provenance The purpose of this section is to define provenance annotations and to define the type-, value-, and dependency-correctness properties. We first need to introduce technical machinery that makes it possible to refer to each part of the input database and to decorate each part of the result of a query with provenance annotations.
=
δ(x) γ(¯ x) E[[e2 ]]γδ[x 7→ E[[e1 ]]γδ] i E[[e1 ]]γδ + E[[e2 ]]γδ X E[[e]]γδ
b ¬E[[e]]γδ E[[e1 ]]γδ ∧ E[[e2 ]]γδ E[[e1 ]]γδ E[[e0 ]]γδ = true E[[e2 ]]γδ E[[e0 ]]γδ = false
(E[[e1 ]]γδ, E[[e2 ]]γδ) πi (E[[e]]γδ) ∅ {E[[e]]γδ} E[[e1 ]]γδ ∪ E[[e2 ]]γδ E[[e1 ]]γδ − E[[e2 ]]γδ {E[[e]]γδ[x 7→ v] | v ∈ E[[e0 ]]γδ} [ E[[e]]γδ
Figure 5. Semantics of query expressions 3.1
Heaps
Provenance is always defined with respect to a fixed database and needs to be able to refer to all parts of the database. We use special global variables called labels as the addresses of parts of the database, and model the database as a heap mapping labels to values. Typically, we will reserve the notation x ¯ for global variables appearing in user-provided queries and use l for labels occurring only in the heap. Formally, a heap h is a mapping from labels to integers, booleans, or pair/set constructor cells: h k
::= ::=
· | h, l 7→ k i | b | (l1 , l2 ) | {l1 , . . . , ln }
Since labels are global variables, they may be used in expressions (in particular, in provenance expressions). Similarly, constructors are a syntactic subclass of expressions. We consider only acyclic heaps with no sharing (that is, a heap is a forest). Heaps must also conform to a type discipline. Heap types are simply global contexts Γ containing labels; we generally use H instead of Γ to indicate that a context is a heap type. The rules for well-formedness of heaps and constructors relative to heap types are shown in Figure 6. 3.2
Provenance-annotated values
We consider provenance annotations to be pairs written c p, where c and p are expressions which are well-formed relative to some H; we expect c to have type bool and p to have the same type as the associated value. We consider annotated values in which each subexpression has a provenance annotation. Annotated values are defined by the following grammar rules: a
::=
c p
v w
::= ::=
w(a) i | b | (v1 , v2 ) | {v1 , . . . , vn }
H ` v : τ ann, H ` w : τ ann, H ` a : τ prov
hH hH H`k:τ · · h, l 7→ k H, l:τ
Figure 6. Well-formed heaps Also, we lift expression operations to provenance annotations as b 2 = follows. If a1 = (c1 p1 ) and a2 = (c2 p2 ), then a1 ⊕a b 2 p2 ) = (c1 ∧ c2 a1 ⊕ a2 ). For example, (c1 p1 )+(c (c1 ∧ c2 p1 + p2 ). Unary operations are lifted in a similar way, S S e.g. b (c p) = c p. Similarly, constants c ∈ {i, b, ∅} are lifted to provenance annotations by defining cb = (true c). For convenience, we define pair(e1 , e2 ) := (e1 , e2 ) and sng(e) := c p) for c {p}, etc. {e}, so that we can write sng(c We define the erasure of a annotated value to be the corresponding ordinary value obtained by erasing the annotations: |w(a) | |i| |b| |(v, v0 )| |{v1 , . . . , vn }|
= = = = =
|w| i b (|v|, |v0 |) {|v1 |, . . . , |vn |}
In what follows, we need environments γ b, δb mapping global or local variables to annotated values; they will play roles analogous to ordinary environments γ, δ. We also extend erasure to annotated environments. Given a heap h H, we can define a function b h that maps the labels of H to annotated values as follows: b h(i) b h(b)
=
i
=
b h({l1 , . . . , ln }) b h(l)
=
b (b h(l1 ), b h(l2 ))
b h((l1 , l2 ))
= =
{b h(l1 ), . . . , b h(ln )} l b (h(h(l)))
where we abbreviate w by wl . Note that if h H, then b H ` h : H considered as a global context. We define the value of e in heap h as E[[e]]h = E[[e]]|b h|; that is, as the value in the global environment resulting from erasing the annotations in b h. Given h : H ⊇ Γ, we also define the annotated global environment γ bh by restricting b h to Γ (that is, γ bh = b h|Γ ) and define the global environment γh = |b γh | as the erasure of γ bh . (true l)
3.3
Correctness
An annotated value may be ill-formed if either its annotations are ill-formed or their types disagree with the values they annotate. We define a well-formedness judgment for annotated values H ` v : e ann as shown in Figure 7, using an auxiliary judgment H ` a : τ prov for well-formedness for provenance annotations. It is straightforward to show that annotated well-formedness subsumes ordinary value well-formedness: L EMMA 2. If H ` v : τ ann then ` |v| : τ and |v| ∈ T [[τ ]]. We extend well-formedness to annotated environments as follows. The judgment H ` γ b : Γ ann indicates that H ` γ b(¯ x) : Γ(¯ x) ann for each x ¯ ∈ Γ; likewise, H ` δb : ∆ ann indicates that H ` b δ(x) : ∆(x) ann for each x ∈ ∆. We can now define what it means for a technique for computing provenance annotations to be type-correct:
H ` v1 : τ1 ann H ` v2 : τ2 ann H ` (v1 , v2 ) : τ1 × τ2 ann H ` v1 : τ ann · · · H ` vn : τ ann H ` b : bool ann H ` {v1 , . . . , vn } : {τ } ann H ` w : τ ann H ` a : τ prov H ` c : bool H ` p : τ H ` wa : τ ann H ` c p : τ prov H ` i : int ann
Figure 7. Well-formed annotated values h v ∼ v, h w ∼ v, h a ∼ v h v1 ∼ v1 h v2 ∼ v2 h i ∼ i0 h b ∼ b 0 h (v1 , v2 ) ∼ (v1 , v2 ) h v1 ∼ v1 · · · h v n ∼ vn h {v1 , . . . , vn } ∼ {v1 , . . . , vn } h w ∼ v h a ∼ v E[[c]]h = true E[[p]]h = v h w(a) ∼ v hc p∼v
Figure 8. Value simulation D EFINITION 1 (Type correctness). Suppose Γ ` e : τ , H ⊇ Γ, and v is an annotated value. Then v is type-correct provided H ` v : τ ann. We turn next to the value-correctness property. We have motivated annotations c p on a value v as meaning “c holds in the current heap, and p evaluates to v”. However, the annotations on the value may have nothing to do with the value itself, or may be inconsistent with the information in the heap. For example, 1(true 5) erases to 1, but is annotated with 5; likewise, 1(l 10 and c0 = x > 15). This can be simplified to c0 {p1 | x ∈ p0 }, which is more concise. 5.2
Metaprogramming
Metaprogramming means writing programs that manipulate programs. If we consider queries and provenance expressions to be programs, then both provenance tracking and applications of provenance information are instances of metaprogramming. Thus, it is possible that existing techniques for metaprogramming or staged computation could be used to track provenance. We have experimented with using MetaML [26] to prototype provenance tracking. MetaML provides typed versions of Lisplike “quote” (called “bracket”, written hei), “antiquote” (called “escape”, written ~e), and “eval” (called “run”, written run(e)). Code values are of type hτ i, and the type system ensures that code brackets, escapes, and run expressions are used consistently (to prevent, for example, evaluating a variable whose value is not available yet). One can think of a provenance-annotated value of type τ as a value paired with a pair of quoted expressions of type hbooli and hτ i. type ’a ann = ’a * ’a prov type ’a prov = * It is straightforward to implement simple combinators that simulate provenance tracking semantics for arithmetic and boolean expressions. Here are the types and some implementations of these combinators; the implementations follow Figure 10 closely, except for the addition of appropriate brackets and escapes. val const : ’a -> ’a ann
val plus : int ann -> int ann -> int ann val eq : int ann -> int ann -> bool ann val ifthenelse : bool ann -> ’a ann -> ’a ann -> ’a ann; fun const c = (c,,) fun plus (x,(c1,p1)) (y,(c2,p2)) = (x+y, (, )) fun ifthenelse (true, (c0,p0)) (v1,(c1,p1)) _ = (v1,,p1) fun ifthenelse (false,(c0,p0)) _ (v2,(c2,p2)) = (v1,,p2) Extending this simple approach to handle pair and collection types does not appear to be easy. For example, a natural-seeming type for fully-annotated pairs is (’a ann * ’b ann) ann but, this expands to which contains annotations in the where-provenance component. Metaprogramming has been also investigated in the context of databases, in a series of papers by Neven, Van den Bussche, Van Gucht, Vansummeren, and Vossens [18, 29, 30]. Most recently, Van den Bussche, Vansummeren and Vossens [30] have developed a system called MetaSQL which provides the ability to construct and manipulate SQL query expressions (represented using XML) during query evaluation. It may be possible to translate ordinary queries to MetaSQL queries that construct provenance expressions as XML trees. MetaSQL provides a great deal of flexibility in querying provenance as XML, because it is untyped. Static analysis
It may not be feasible to track exact dynamic provenance for databases of realistic size. Therefore, it is of interest to find efficient ways to approximate provenance, using either purely static analysis of the query, or combining static analysis with information about the dynamic behavior of the query that can be obtained efficiently (for example, by looking at the input and output values). A wide variety of techniques for automatic debugging, program slicing, and program analysis may be applicable to this problem. In this section we sketch a simple type-based analysis (following [19, Ch. 5]) which statically approximates the provenance of a query. To see the idea of the analysis, first consider a naive approach which approximates the provenance of each part of the output type of a query as a set of global variables contributing to that part. This clearly captures all of the true dependences of an output value, but is very coarse-grained. For example, given the query Q in the introduction, every part of the output type {int × int} depends on both T and U . We have no way of telling which part of a table (e.g., which field values) an output value depends on. Instead, therefore, we consider types annotated with region information, by analogy with type systems for region-based memory management [28]. We define annotations b a and annotated types τb, ω b as follows: b a
τb ω b
x} ∈ Γ b b x ¯:b ω {¯ x:b ω ab ∈ ∆ b ∆ b `x b ∆ b `x:ω b ∆ b ` i : int & ∅ Γ; ¯:ω b & {¯ x} Γ; b&b a Γ; a b1 b b b b Γ; ∆ ` e : ω b ×ω b ab2 & b a Γ; ∆ ` e : {int} & b a 1
2
b ∆ b ` sum(e) : int & b b ∆ b ` πi (e) : τi & b Γ; a Γ; ai b b b b Γ; ∆ ` e1 : ω b1 & b a1 Γ; ∆ ` e2 : ω b2 & b a2 |b ω1 | = |b ω2 |
b ∆ b ` e1 = e2 : bool & b Γ; a1 ∪ b a2 b b b b b b ` e2 : ω Γ; ∆ ` e0 : bool & b a0 Γ; ∆ ` e1 : ω b1 & b a1 Γ; ∆ b2 & a b2 b b Γ; ∆ ` if e0 then e1 else e2 : ω b1 ∪ ω b2 & b a0 ∪ b a1 ∪ b a2 b ∆ b ` e1 : {b Γ; τ1 } & b a1
b ∆ b ` e2 : {b Γ; τ2 } & b a2
b ∆ b ` e1 ∪ e2 : {b Γ; τ1 ∪ τb2 } & b a1 ∪ b a2 b ∆ b ` e1 : {b b ∆, b x:b Γ; τ0 } & b a0 Γ; τ 0 ` e2 : ω b&a b b ∆ b ` {e | x ∈ e0 } : {b Γ; ω ab } & b a∪b a0
(’a ann * ’b ann) * ( * )
5.3
b ∆ b `e:ω Γ; b&b a
::=
::= ::=
{¯ x} | b a∪b a0 | α
ω b ab int | bool | τb1 × τb2 | {b τ}
Here, regions represent sets of labels in the dynamic heap; region annotations may include region variables α. We write |b τ | for the erasure of an annotated type and say that two annotated types are compatible if they erase to the same type, and define the union of two compatible annotated types, τb∪b τ 0 , as the annotated type result-
Figure 12. Type system for approximate provenance: selected rules ing from taking the union of all of the corresponding annotations (e.g. intα1 × boolβ1 ∪ intα2 × boolβ2 = intα1 ∪α2 × boolβ1 ∪β2 ). b and ∆ b for annotated contexts mapping global or We write Γ local variables to annotated types. In particular, given an ordinary b to be of the form global context Γ, we generally assume context Γ {¯ x } {¯ x } x ¯1 :b ω1 1 , . . . , x ¯n :b ωn n , in which all of the annotations in the ω bi are distinct region variables. We consider a type-and-effect-style b ∆ b ` e : ω system with judgments Γ; b & b a, meaning that in b b (annotated) contexts Γ, ∆, the term e has annotated type ω b and its approximate provenance is described by b a. Figure 12 lists a few representative inference rules. As a simple example, a comprehension e = {π1 (z) | z ∈ x ¯ ∪¯ y} satisfies: ¯ x ¯ : {(intα1 × intα2 )α }{¯x} , y¯ : {(intβ1 × intβ2 )β }{y} ;· α1 ∪β1 ` e : {int } & {¯ x, y¯} ¯ This approximates {1l1 , 2l4 , 3l7 }{π1 (z)|z∈¯x∪y} , which is correct provenance for e with respect to
x ¯ 7→ {(1l1 , 1l2 )l3 , (2l4 , 2l5 )l6 }x¯ , y¯ 7→ {(3l7 , 3l8 )l9 }y¯ since α1 resolves to {l1 , l4 } and β1 to {l7 } in this heap. More refined static analyses of provenance are also possible; for example, analyses that take control flow information into account or approximate the possible why- or where-provenance expressions for a query. Other interesting possibilities could be to use the results of static analysis to support more efficient dynamic provenance tracking, or to draw inferences about the true provenance based on static analysis and the actual input and output values. Although we believe this type system provides accurate approximations of provenance, we have not rigorously proved its correctness relative to the dynamic semantics. An alternative may be to prove dependency-correctness directly. It may be possible to translate this type system to a more general system, such as the Dependency Core Calculus [1], thereby simplifying the proof of dependency correctness. We leave further study of static analysis for provenance and dependency tracking for future work.
6. Future work This paper describes work in progress. The previous section discussed several open questions and implementation issues. Many theoretical questions concerning expression-based provenance also
remain open. For example, we do not know whether it is decidable whether an annotated value is dependency-correct relative to an expression, or whether its annotations are optimal (minimum size). We are interested in investigating expressiveness properties, for example, generalizing the results of Buneman, Cheney and Vansummeren [9]. We would also like to study the relationship between our work and adaptive functional programming techniques [3] and bidirectional programming [15].
7. Conclusions We have introduced a new approach to data provenance in which provenance consists of expressions explaining why and how each part of the output was computed from the input database. We have made this approach to provenance precise by formulating typecorrectness, value-correctness, and dependency-correctness properties which such expressions should satisfy in order to be considered correct provenance information. Although similar criteria were present in previous approaches to provenance, our approach generalizes dependency-correctness to query languages with negation, aggregation, and grouping, none of which are handled well by previous techniques. We also developed an instrumented operational semantics that constructs dynamic provenance for a rich query language and proved its correctness. However, this technique may not scale: in particular, provenance records for comprehensions may require space proportional to the size of the database. In the last part of the paper, we discussed additional ways in which partial evaluation, metaprogramming, and static analysis techniques could be used to improve provenance tracking.
References [1] Mart´ın Abadi, Anindya Banerjee, Nevin Heintze, and Jon G. Riecke. A core calculus of dependency. In POPL ’99: Proceedings of the 26th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pages 147–160, New York, NY, USA, 1999. ACM Press. [2] Serge Abiteboul, Richard Hull, and Victor Vianu. Foundations of Databases. Addison-Wesley, 1995. [3] Umut A. Acar, Guy E. Blelloch, and Robert Harper. Adaptive functional programming. In POPL ’02: Proceedings of the 29th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, pages 247–259, New York, NY, USA, 2002. ACM Press. [4] Tarun Arora, Raghu Ramakrishnan, William G. Roth, Praveen Seshadri, and Divesh Srivastava. Explaining program execution in deductive systems. In Deductive and Object-Oriented Databases, pages 101–119, 1993. [5] Deepavali Bhagwat, Laura Chiticariu, Wang-Chiew Tan, and Gaurav Vijayvargiya. An annotation management system for relational databases. In VLDB, pages 900–911, 2004. [6] R. Bose and J. Frew. Composing lineage metadata with xml for custom satellite-derived data products. In Proceedings of the 16th International Conference on Scientific and Statistical Database Management (SSDBM 2004), pages 275– 284. IEEE, 2004. [7] Rajendra Bose and James Frew. Lineage retrieval for scientific data processing: a survey. ACM Comput. Surv., 37(1):1–28, 2005. [8] Peter Buneman, Adriane P. Chapman, and James Cheney. Provenance management in curated databases. In Proceedings of the 2006 SIGMOD Conference on Management of Data, pages 539–550, Chicago, IL, 2006. ACM Press. [9] Peter Buneman, James Cheney, and Stijn Vansummeren. On the expressiveness of implicit provenance in query and update languages. In ICDT 2007, 2007. To appear. [10] Peter Buneman, Sanjeev Khanna, and Wang Chiew Tan. Why and where: A characterization of data provenance. In ICDT 2001, volume 1973 of LNCS, pages 316–330. Springer, 2001. [11] Peter Buneman, Leonid Libkin, Dan Suciu, Val Tannen, and Limsoon Wong. Comprehension syntax. SIGMOD Record, 23(1):87–96, 1994.
[12] Peter Buneman, Shamim A. Naqvi, Val Tannen, and Limsoon Wong. Principles of programming with complex objects and collection types. Theor. Comp. Sci., 149(1):3–48, 1995. [13] Yingwei Cui and Jennifer Widom. Lineage tracing for general data warehouse transformations. VLDB J., 12(1):41–58, 2003. [14] Yingwei Cui, Jennifer Widom, and Janet L. Wiener. Tracing the lineage of view data in a warehousing environment. ACM Trans. Database Syst., 25(2):179–227, 2000. [15] J. Nathan Foster, Michael B. Greenwald, Jonathan T. Moore, Benjamin C. Pierce, and Alan Schmitt. Combinators for bi-directional tree transformations: A linguistic approach to the view update problem. In ACM SIGPLAN–SIGACT Symposium on Principles of Programming Languages (POPL), Long Beach, California, 2005. [16] Paul Groth, Michael Luck, and Luc Moreau. A protocol for recording provenance in service-oriented grids. In OPODIS ’04, 2004. [17] Kiran-Kumar Muniswamy-Reddy, David A. Holland, Uri Braun, , and Margo Seltzer. Provenance-aware storage systems. In Proceedings of the 2006 USENIX Annual Technical Conference, pages 43–56, Boston, MA, June 2006. USENIX. [18] Frank Neven, Jan Van den Bussche, Dirk Van Gucht, and Gottfried Vossen. Typed query languages for databases containing queries. Information Systems, 24(7):569–595, 1999. [19] Flemming Nielson, Hanne Riis Nielson, and Chris Hankin. Principles of Program Analysis. Springer, second edition, 2005. [20] Claudio Ochoa, Josep Silva, and Germ´an Vidal. Dynamic slicing based on redex trails. In PEPM ’04: Proceedings of the 2004 ACM SIGPLAN Symposium on Partial Evaluation and Semantics-based Program Manipulation, pages 123–134, New York, NY, USA, 2004. ACM Press. [21] Abhik Roychoudhury, C. R. Ramakrishnan, and I. V. Ramakrishnan. Justifying proofs using memo tables. In Principles and Practice of Declarative Programming, pages 178–189, 2000. [22] D. Saha and C. R. Ramakrishnan. A local algorithm for incremental evaluation of tabled logic programs. In S. Etalle and M. Truszczy´nski, editors, International Conference on Logic Programming, number 4079 in LNCS, pages 56–71. Springer, 2006. [23] S.Bowers, T.McPhillips, B.Ludaescher, S.Cohen, and S.B.Davidson. A model for user-oriented data provenance in pipelined scientific workflows. In International Provenance and Annotation Workshop (IPAW), number 4145 in LNCS, pages 133–147. Springer, 2006. [24] Josep Silva and Olaf Chitil. Combining algorithmic debugging and program slicing. In PPDP ’06: Proceedings of the 8th ACM SIGPLAN Symposium on Principles and Practice of Declarative Programming, pages 157–166, New York, NY, USA, 2006. ACM Press. [25] Yogesh Simmhan, Beth Plale, and Dennis Gannon. A survey of data provenance in e-science. SIGMOD Record, 34(3):31–36, 2005. [26] Walid Taha and Tim Sheard. MetaML and multi-stage programming with explicit annotations. Theoretical Computer Science, 248(1– 2):211–242, 2000. [27] Wang-Chiew Tan. Research problems in data provenance. IEEE Data Engineering Bulletin, 27(4):45–52, 2004. [28] M. Tofte and J.-P. Talpin. Region-based memory management. Information and Computation, 132(2):109–176, 1997. [29] Jan Van den Bussche, Dirk Van Gucht, and Gottfried Vossen. Reflective programming in the relational algebra. Journal of Computer and System Sciences, 52(3):537–549, 1996. [30] Jan Van den Bussche, Stijn Vansummeren, and Gottfried Vossen. Towards practical meta-querying. Information Systems, 30(4):317– 332, 2005. [31] P. Wadler. Comprehending monads. Mathematical Structures in Computer Science, 2:461–493, 1992. [32] Jennifer Widom. Trio: A system for integrated management of data, accuracy, and lineage. In CIDR, pages 262–276, 2005. [33] A. Woodruff and M. Stonebraker. Supporting fine-grained data lineage in a database visualization environment. In ICDE 1997, pages 91–102, 1997.