Information flow security for XML transformations - CiteSeerX

8 downloads 7802 Views 201KB Size Report
We provide a formal definition of information flows in XML transfor- ... nition of security analyses of transformations, this may partially explain the absence of.
            

c

  

Information flow security for XML transformations Véronique Benzaken  , Marwan Burelle , and Giuseppe Castagna 

 LRI (CNRS), Université Paris-Sud, Orsay, France CNRS, Département d’Informatique, École Normale Supérieure, Paris, France Abstract. We provide a formal definition of information flows in XML transformations and, more generally, in the presence of type driven computations and describe a sound technique to detect transformations that may leak private or confidential information. We also outline a general framework to check middlewarelocated information flows.



1 Introduction XML is becoming the de facto standard document format for data on the Web. Its diffusion however is characterized by two correlated paradoxes: 1. Despite the increasing success of XML for exchanging and manipulating information on the Web, little attention has been paid to characterize and analyze information flows of XML transformations and, specifically, their security implications. 2. As shown by the standardization process, XML documents are intrinsically typed (cf. the notions of well-formedness and validity). Nevertheless, the “standard” programming languages used to manipulate them are essentially untyped, in the sense that even when they are equipped with a type system the latter does not use the type information of XML documents. If we consider the self-descriptive nature of XML documents, these paradoxes are less surprising than it may seem: XML documents use tags to delimit some content, and these tags can be considered as type information about the content they delimit. Therefore, XML documents—even those that do not contain a DTD—are in some sense “selftyped” constructions and this makes the definition of a type system for XML transformers difficult. As long as a type system often constitutes a first basic step toward the definition of security analyses of transformations, this may partially explain the absence of formal tools to characterize insecure information flows. Goal. The aim of this article is to define and characterize information flows in XML transformations in order to single out potentially insecure transformations, that is transformations that may leak confidential or private information. To that end we study transformations defined in a typed programming language for XML documents. There are several candidates for such a language, since several attempts have been made in the literature to overcome the second of the XML paradoxes (e.g., HaXML [21], JWIG [3], Xtatic [11], XDuce [13], XQuery [8], and YATL [4]). In this work we characterize and analyze information flows for transformations defined in  Duce [2]. There are several reasons to choose  Duce as target language for our study. First and foremost, unlike other XML-oriented languages,  Duce is general purpose, in that it provides besides XML types several other datatypes, enabling to program general (even XML unrelated) applications. Second, among the known languages for XML, it possesses the richest type algebra. Finally, its semantic and set theoretic foundations make it a good candidate for defining or hosting a declarative query language (see [2]) and, as such, it nicely fits our scenario of global queries on the Web. Problems. We said that the difficulty of defining type systems for XML transformations resides in the self-typing nature of the documents. More precisely, this self-typing

characteristic induces “type based” (or “type driven”) computations: matching on document tags roughly corresponds to matching documents’ types; similarly, producing elements with different tags corresponds to outputting results of different types. In some sense, typed XML transformations are akin to the application of typecase constructions where the different cases may return differently typed results (this is accounted for in  Duce by the use of dynamically bound overloaded functions). The presence of type driven computations makes the task of capturing information flows much harder and constitutes the main novelty and challenge of this study. In fact we cannot resort to classical data flow analyses since they are usually applied to computational frameworks whose dynamic semantics does not strictly depend on run-time types. A second challenge is that information flows (more precisely, their absence) are usually characterized in terms of the so-called non-interference property [12]. Our study demonstrates that in the presence of type driven computations this notion must be modified so as to include static type knowledge, otherwise we end up with very trivial analyses. Finally, the last challenge is to define flow analysis for a pattern-matching based language—as  Duce is—since this stands at least two obstacles:  pattern matching is a dynamic type-case, therefore we have to propagate type information in the subsequent matches;  the use of a matching policy (first-match as in  Duce or best match as in XSLT) induces dependencies among the different components of an alternative pattern as well as among different cases of a pattern-matching expression and this must be taken into account when characterizing information flows (the sole fact of knowing that a pattern did not match may produce a flow of information). Contributions. The contributions of this article are essentially three: 1. it provides a formal definition and study of information flows in the context of XML transformations and, more generally, in the presence of type driven computations; 2. it describes a sound technique to detect XML document transformations that cause insecure information flows, and formally proves its correctness; 3. by defining security annotations and by relating various kind of analyses (static/dynamic, sound/complete) to different query scenarios, it proposes a general framework for checking security of middleware-located information flows. Example. The development of our presentation can be illustrated by an example. Consider the following XML document which stores names and salaries of the workers of a fictive company. We imagine that while generic users are allowed to perform queries on this document, the information about salaries must only be accessible to Durand authorized users. Therefore we need a way to detect Paul 6500 queries that may reveal information about salaries, in order to reject them when they are performed by unau Dupond thorized users. A first naive technique to obtain it Jean would be to mark the salary elements and dynamically 1800 reject all queries that contain marks in their result. Un Martin fortunately, this approach is clearly inadequate since Jules the information about salaries can be deduced as fol1200 lows: perform a query that returns the list of all work ers whose salary is greater than  and then iterate the query by varying  until we obtain as many different results as workers.

A more effective solution is to reject all the queries whose result accesses the value of the salary elements. For example consider the following two queries: [Q1] Get the list of all workers [Q2] Get the list of all workers whose salary is greater than  1600 The first query can be always safely executed while the second one must be forbidden to unauthorized users. This can be obtained by enforcing an access control policy. For instance this is done in [7,6] by executing a query on a view (in the database sense) obtained by pruning from the XML documents all data the owner of the query has not the right to access. While enforcing access control is enough for simple policies like the above, it soon becomes inadequate with slightly more complicated policies. For instance imagine that instead of forbidding access to salaries we want to allow queries owned by generic users to access salaries (e.g. for statistical purposes) but in a way that prevents these queries from associating a specific worker with her/his salary. This corresponds to rejecting all queries whose result depends both on the value of salary elements and on that of name or surname elements (but queries like Q1 or a query that returns all salaries are acceptable). To enforce this constraint we have to switch from an access analysis to a dependency (or information flow) analysis. 3 Causal security policies, such as above, can be formalized by the notion of noninterference, that can be restated for XML documents as follows: a set of elements does not interfere with the result of a given query if for all possible contents of the elements the query always returns the same result. In our example, consider the set of all documents obtained from our XML document by replacing the content of salary elements by arbitrary numeric values. Query Q1 is interference-free since when it is applied to all these documents it always returns the same result. Query Q2 instead is not interference-free since its results may differ. A precise definition of non-interference constitutes the first step of our approach since it defines the set of queries that are safe. The following step is to devise one or more techniques to determine the safety/unsafety of queries. To that end we first classify components that store confidential information by annotating data elements by labels of the form   . The  intuitively represents a security classification of the information stored in the element (e.g., public or private, but it could be any label from a possibly unordered set) while  is a type that describes the static information publicly available about the data’s content (e.g. for salaries it records that the element stores an integer in a given range) 4. Next we recast the notion of non-interference in terms of labeled elements, namely, we say that a transformation is free of interference from all elements labeled by , if its result does not change when the content of the -labeled elements vary over the type indicated in the label. Our research plan consists of the definition of three different analyses to be used as in the scenario of Figure 1 in the next page. According to it an interactive query (that is, a query that was written to be executed just once) will first pass through a complete static analysis (that rejects transformations that 3

4

While for this specific example it is still possible to resort to access control techniques (execute the query on two different views obtained by stripping in one all salaries and in the other all names and surnames) these sole techniques soon become insufficient, as shown by the example in Section 6. We do not require the existence of any order on such labels (in our framework security policies will be expressed in terms of presence/absence rather than relative order of label), therefore our labels must not to be confounded with the security labels used in multilevel security systems

are manifestly unsafe) and then through a sound dynamic analysis. Instead, programs that are expected to be used several times will pass through a cycle of sound static analysis (possibly preceded by a complete analysis) before being executed without any further dynamic check. In this paper we concentrate on the sound dynamic analysis for  Duce programs, that is, the grayed part of the figure.

error

reject

Outline. We start in Section 2 by a brief overview of the functional core of  Duce. In Section 3 we formally define the non-interference property for  Duce programs and introduce  Duce  a conservative extension 5 of  Duce in which expressions occurring in a program may appear labeled by security labels. Section 4 is the core of our work. It defines the dynamic analysis that detects interference free programs. The idea is to define an operational semantics for  Duce  such that  it preserves the semantics of unlabeled programs and  it ensures that whenever a label  is absent from the final result of a program, then the program is free of interference from all expressions labeled by . Thanks to these properties the analyzer has simply to label and run a transformation and to refuse to return the (unlabeled) result when this contains unauthorized labels. Of course the heart of the problem is label propaTransformations Programs/methods gation in pattern matching, whose def"one time" "many times" inition is made difficult by the type driven semantics, the presence of subtyping, and the use of a first-match Complete Static analysis Sound Static analysis policy. In Section 5 we prove that our analysis satisfies the aforementioned properties; this goes through proving that  Duce satisfies the subjectreduction property, that it preserves Sound Dynamic analysis the semantics of  Duce, and that it constitutes a sound analysis for Evaluation/Execution the non-interference property. The last two points present some technical difficulties (without any practical imFig. 1. Analysis scenarios pact) due to the type system of  Duce that does not satisfy the minimum typing property and induces in  Duce  a nondeterministic semantics: thus the two properties must be proved to hold for all possible reductions of a program. In Section 6 we comment a more significant example that illustrates some security policies that cannot be expressed in terms of access control. We conclude our presentation by sketching in Section 7 some research perspectives. Related work. Security issues for XML have been addressed by several works but none of them tackles the problem of information flows. They either focus on access control (e.g., [10,7,6]) or on lower level security features such as encryption and digital signatures for which commercial products are becoming available (e.g [14]). For instance, Damiani et al. [7,6] detect accesses to confidential data by applying a static marking of the documents and by dynamically stripping off marked elements. In other words, they deal with access control (confidential data is accessed for computing the 5

The extension is conservative with respect to both the type theory and the equational (reduction) theory.

result) whereas our approach accounts for implicit flows (confidential information can be inferred from the result) like the detection of covert channels. The same holds true for the work of Gabillon and Bruno [10] where access control is performed by running queries on views of the XML documents dynamically generated by stripping off unauthorized data. Other works devise flow analyses for programming languages for XML (e.g., [3]) but these analyses are not developed to verify security properties. The study presented here draws ideas from several sources. The dynamic propagation of labels was first introduced in Abadi et al. [1], where a dependency analysis for call-by-name -calculus is defined by extending the reduction semantics to labeled -terms. Although their work was not motivated by security reasons (they address optimization issues) what we describe here essentially adapts their technique to type driven reductions. Label propagation was successively used for security purposes in later works, for instance [5,15,16,17]. In particular, in [15] Myers and Liskov use labels for the same purposes as we do; however their security model is defined for and relies on languages that explicitly manipulate labels, while in our or in Abadi’s et al. approach, properties are stated for an unlabeled language and labels are introduced on the top of it as a technique to identify (unlabeled) programs satisfying these properties. Finally, all the cited label-based approaches fundamentally differ from the study presented here in that they do not account for type driven semantics (nor for pattern matching) distinctive of XML transformations. The presence of type driven computations preclude us the use of classical definitions and detection techniques of non-interference (e.g. those of [19,20]), since in this case ignoring static type information would yield a far too weak definition of non-interference. Actually, our notion of non-interference differs from the classical one in that the latter usually relies on a hierarchical structuring of security levels (high-level inputs do not interfere with low-level outputs) while here we spot non-interference of single pieces of code (the value of some data does not interfere with the result of a query). This difference must be understood as the fact that we want to characterize the flows in a single transformation while classical non-interference rather applies to system-wide flows.

2 The CDuce language  Duce is a functional programming language tailored to the manipulation of XML

documents, for which it uses its own notation. The XML document in the previous section becomes in  Duce the expression on the right. The syntax is mostly self describing: tags are denoted by angle brackets and are followed by sequences. Sequences are delimited by square brackets and in this case are formed by other elements, but in general they may contain expressions of any type (note that some tags are followed by strings as the latter are encoded in  Duce as sequences of characters). This expression has the type Company defined right below it. The types of sequences are defined by regular ex-

[ [ "Durand" "Paul" [6500] ] [ "Dupond" "Jean" [1800] ] [ "Martin" "Jules" [1200] ] ] type Company = [Worker*] type Worker = [Sname Name Salary] type Sname = String type Name = String type Salary = [Int]

pressions on types. For example, the first type declaration states that a company is a sequence tagged by < company> and composed of zero or more worker elements. Had we defined the type of workers as follows: Worker2 = [ Sname Name Salary (Email | Tel)? ]

then workers elements would have an optional (as indicated by =?) boolean attribute and list a last optional element that is either of type Tel or of type Email. Note also that Worker is a subtype of Worker2 since every value of the former type is also a value of the latter type. The queries defined in the previous section can be expressed as: [Q1:] let x = mycompany in transform x with [ y z _ ] ➔ [ [ y z ] ] [Q2:] type MoreThanMe = [1600--*] let x = mycompany in transform x with [ y z MoreThanMe ] ➔ [ [ y z ] ]

where mycompany is a variable that denotes our previous XML document. In the first query the let expression matches mycompany against a pattern that binds the variable x to the sequence of worker elements of mycompany. The transform expression applies to each element of a sequence a pattern that returns a sequence and then concatenates all the results (non matching elements are simply discarded). In this case it transforms the sequence of Worker elements into a sequence of -tagged elements containing just surname and name elements ( y and z respectively capture the surname and name elements, while the _ pattern matches every value). Had we defined the type Company as [(Worker|Client)*], that is a heterogeneous sequence of workers and clients, then Q1 would still have returned the same result as transform would discard client elements. The query Q2 is similar, with the only difference that the transform pattern matches only if the salary element has type MoreThanMe, that is, if its content is an integer greater than or equal to 1600. 6 A detailed presentation of  Duce is out of the scope of this paper. The interested reader can refer to [2] and try the interactive prototype at www.cduce.org. For the purpose of this work it will suffice to say that all  Duce constructions are encoded in the following core language defined in [9]:   







 

 

  

 



 

 







 







¼½

We use  to range over types,  over boolean combinations, over atoms, and over basic types (Int, Char, . . . ); ¼ and ½ denote respectively the empty and top type, and together with arrows and products they form the atoms of arbitrary boolean combinations (negation, union, and intersection types) and recursive types   . 7 The expressions of the language are:  







fun  ½ ½   ´µ     ´    µ  match with  ➔  | ➔ 

The syntax is essentially that of a higher-order functional language with constants, pairs and pattern-matching. What distinguishes it from similar languages is the definition of patterns (given later) and of functions. In the latter note that the name of a function (which may appear in its body as functions can be recursive) is annotated by a non empty list of arrow types (this list is called the interface of the function). The presence 6 7

The results the two queries are the central and right expressions in Figure 4 from which we erase all the label annotations of the form  ,  , or   The distinction between atoms and combinations is introduced in order to avoid meaningless recursive types such as   

of more than one arrow type declares the overloaded nature of the function, which has all these types, thus their intersection as well (see the typing rule for functions later on). The operational semantics is defined in terms of the set  of values, ranged over by  and formed by all closed and well-typed  Duce expressions of the form  





fun  ½ ½   ´µ

The semantics is given by the reduction relation     match  with  ➔  | ➔    match  with  ➔  | ➔  

  



´  µ

, which is defined as follows:

½  fun ´µ 

    

 ½   

  

 ½ 

  



 ¾ 

where  denotes the matching failure and  is the expression obtained by applying the substitution  to . The first rule is standard  -reduction for recursive functions. The second rule states that if the first pattern matches, then the expression of the first branch is executed after having applied to it the substitution   of the variables captured by  when matching  . The second rule states that the same happens for the second branch when the first pattern fails and the second matches (the static type system of  Duce ensures that the patterns cannot both fail). Reductions can take place in any evaluation context. A context, denoted by  is an expression with a hole substituted for a sub-expression. An evaluation context, denoted by , is a context whose hole is not in the scope of an abstraction or of a pattern and  which induces a (leftmost) evaluation order on applications and pairs. Formally,  implies    , where is defined as







  



´  µ  ´  µ   



match with  ➔  | ➔ 

As anticipated, key definitions for  Duce are those of pattern and pattern matching. A regular tree built from the following grammar    







 



 



  



   

is a pattern if  on every infinite branch of it there are infinite occurrences of the pair constructor8,  patterns in an alternative capture the same variables, and  patterns in a conjunction capture pairwise distinct variables. The semantics of patterns is defined by the matching operation: the matching of value  against a pattern  is denoted by  and returns either  (matching failure) or a substitution from the variables of  into  :  



        

if 

where 

µ

       

if 

             

´







 



if 

 

          

 



if    if  is not a pair

   is  when    or    and otherwise is the substitution Dom ½ Dom ¾  defined as      when   Dom Dom , as

 

when   Dom Dom , and as    ´  µ when Dom   Dom . The semantics of patterns is rather intuitive. There are two possible causes of failure for a pattern matching: a type constraint which is not satisfied by the matched

     

8



Infinite trees represent recursive patterns while the condition on infinite branches endows the set of patterns with a well-founded order that ensures termination for the algorithms of typing and matching.

       pair         ´  µ         (for         )                   Ï match   match  with  ➔ | ➔    ¼  Î             abstr   µ ´µ  Î   fun ´      







var



 















 





 











 

½ 















 















appl







 























 

 

 









 





















subsum



Fig. 2. Typing rules value, or a pair pattern applied to a value that is not a pair. The alternative pattern has a first-match policy: the object is matched against   if and only if the matching against   fails. When a variable  appears on both sides of a pair pattern, the two captured elements are paired together as expressed by the third case of the definition of . The default-value pattern    usually appears in the righthand side of an alternative pattern to return a default value when the left-hand side fails. The combination of multiply-occurring variables and default-value patterns within recursive patterns allow very expressive captures: for instance, the recursive pattern     Int  _    nil binds  to the sublist of all integers occurring in a heterogeneous list (encoded à la lisp) when this is matched against it. On the whole, all a pattern does is to decompose values in subcomponents that are bound to variables and/or matched against types, whence the type driven computation. Values are also at the basis of the definition of subtyping. Indeed, the key characteristic of  Duce (from which all others derive) is that the subtyping relation is defined semantically via a set theoretic interpretation of the types. This interpretation is very easy to define: a type is interpreted as the set of all values that have that type. So for example    is the set of all expressions (  ,  ) where  is a value of type  ;    is the set of all closed functional expressions fun ½  () that when applied to a value of type   return a result in   ; union, intersection, and negation of types have the usual set theoretic meaning. This (semantic) interpretation is used to define the (syntactic) subtyping relation: a type  is a subtype of  if the interpretation of  is contained in the interpretation of . Typing rules for  Duce are summarized in Figure 2. For a more detailed explanation the reader can refer to [9]. Rules (var), (pair), and (subsum) are standard. The rule (appl) is less common, since one usually expects   to be of the form     and the rule to infer for the application the type   provided that     . However because of the presence of boolean combinations  could be for example of the form          . To that end we introduce a partial binary operator on types . Intuitively     and the rule (appl) does    denotes the least type (if it exists) such that     not fail only if the operator is defined (in practice, we subsume   to the least arrow type with domain  ). So, for example, if our function of type           is applied to an expression of type     , then the type system will infer that the result of the application has type                    . Similarly we introduce two projections operators   and  such that   (respectively  ) is the least type that makes      ½ (respectively ½   ) hold. It is important to note that ,  , and  are all computable (once more see [9]).  

Rule (abstr) is not very standard, either. It states that a (possibly overloaded) function is typed by the intersection of all the types declared in its interface: we repeat the check for each type in the interface and handle the typing of recursive calls by recording for  the expected type. Actually the rule in Figure 2, is a simplification of the  Duce rule in [9], the latter is very technical and makes the minimum typing property fail (so that there does not exist a canonical type representing the set of all types of a given expression). However, we will see that this does not affect our analysis and, therefore, that the rule defined in Figure 2 is enough. Finally the rule (match) states that the type of a matching operation is the union of the types of the branches 9, and for that it uses several auxiliary notations:  denotes syntactic equivalence of types whereas  denotes the semantic one (   only if  and  denote the same set of values);  denotes the type environment that assigns types to the capture variables of  when this is matched against a value of type ;  denotes the type accepted by , that is, the type whose interpretation is the set of all values for which  does not fail:         . Again,   is computable. The type system is sound since it satisfies the subject reduction property: if     and   , then   means “reduces in zero or more steps”).     (where





3 Non-interference and labels The first step toward the definition of our security analysis is the characterization of information flows. In general, these are difficult to define and therefore it is customary to rather characterize their absence via the non-interference property: given an expression  , a sub-expression occurring in does not interfere with if and only if whenever  returns some result, then also every expression obtained from by replacing by a different expression yields the same result. Note that this (informal) definition does not involve types and because of that it is unsuitable to describe non-interference for type driven semantics. A definition of non-interference for  Duce transformations must take types into account. Types have been widely used in previous work on language-based information-flow security (see [18] for a very broad review) and non-interference definitions have been given for typed languages. What is new in our framework is that the essence of the definition of non-interference relies on types. In particular, changing the type of an expression occurrence in a program can change the non-interference properties of that expression occurrence. Therefore our definition of non-interference is stated with respect to a type : Definition 1 (Occurrence). Let be a  Duce expression. An occurrence  of is a (root starting) path of the abstract syntax tree of . If  is an occurrence of , we use   to denote the sub-expression of occurring at , and     to denote the context  obtained from by inserting a hole at . Thus      . Definition 2 (Non-interference w.r.t. ). Let be a  Duce expression,  an occurrence of , and  a type. The occurrence  does not interfere in with respect to  if   . and only if for all  Duce values  and   , if      and   then    





Let be the application of a query to an XML document that stores some information of type  in an element located at . We want to define when the query allows one to 9

Precisely, of the branches that have a chance to match, that is those for which  matters when typing overloaded functions: see [9].

 ¼. This distinction

infer information about  more precise than its type. The definition above states that the information stored in  is not disclosed if the fact of storing any value   of type  in  does not change the result of the query. The definition is reasonable as it simply encompasses the static knowledge about the type of an occurrence. For instance, on our example it states that a query is safe (i.e. interference free) if and only if it cannot distinguish two documents that contain two different integers in corresponding salary elements. Similarly, a query that distinguishes documents with integer salaries from documents in which salaries elements contain boolean values is safe, too 10 : this is reasonable to the extent that such a test cannot in practice be performed as it would contradict the static knowledge we (or an attacker) have of the document. This new definition induces also a very nice interpretation of security, according to which a transformation is safe if it does not reveal (about confidential data) any information that is not already statically known. A last more technical point. In Definition 2 we tested non-interferent occurrences against values rather than against generic expressions. This simplifies the definition inasmuch as we do not have to take into account the typing and the possible capture of variables that are free in the expressions inserted in the context. However, this does not undermine the generality of our definition as shown by Proposition 1:

Definition 3 ( -contexts). Let  and  be type environments,  and   types, and  a context.    is a     -context if        is provable under the hypothesis that      holds for every extension   of .



Proposition 1 (Generality). Let be a  Duce expression such that    . Let  be an occurrence that does not interfere in w.r.t.   . Then for all   ,   , and  , if       , and     is a        -context, then        .       ,  





Our next step consists in defining a way to identify occurrences. This can be easily obtained by marking  Duce expressions by security labels. Thus we define  Duce  obtained from  Duce by adding expressions of the form “   ”, where  is a type and  is a metavariable ranging over a set  of labels. As anticipated labels are indexed by types that record the static knowledge of the expression at issue. The type is used to verify non-interference of the expression. This is considered not to interfere with a given computation if the fact of making the labeled occurrence vary over the values of the type specified by the label does not affect the final result of the computation. Of course, this is sensible only if the fact of making the occurrence vary over such a type yields well-typed expressions. In other terms, as suggested by Proposition 1, the type indexing a label must be a viable type for the expres    sion marked by the label. In order to formalize this property we endow  Duce with a type system formed by all the typing        rules of  Duce, summarized in Figure 2, plus the one on right. By definition a  Duce  expression is a  Duce expression where some subexpressions are marked by a list of labels. We call strip the function that transforms the former into the latter by erasing all the lists of labels and we denote if by  . Technically, we consider the syntax of  Duce  as a description of a decoration of the syntax tree of in the sense that the (lists of) labels are not nodes of the syntax 10

More interestingly, had we defined salaries to be of type [1600- -*], then also the query Q2 would result safe (i.e. interference free).

trees but tags marking these nodes. 11 The reason for this choice is that in this way the same path  denotes an occurrence of a  Duce  expression and the corresponding occurrence in the stripped expression. In this way we avoid to define complex mappings between occurrences of expressions in  Duce  and the corresponding ones in the stripped version. In particular this makes the following property hold Proposition 2. Let be a  Duce  term with an occurrence , then        and          

  

which yields a simpler statement of the non-interference theorem (Theorem 2).

4 Security analysis Our analysis is defined by endowing  Duce  with an operational semantics defined so that  it preserves  Duce semantics on stripped terms (        ) and  label propagation provides a sound non-interference analysis for  Duce expression (i.e., if a label is not propagated, then the labeled expression does not interfere with the result). It is not difficult to define a trivially sound analysis (just propagate all labels)12 . The hard task is to define a fine-grained analysis that propagates as few labels as possible. Needless to say that the core of the definition is label propagation in pattern matching, where several kinds of information flows are possible:  Direct (or “explicit”) information flows, due to the binding of labeled expressions to variables, such as in: match e with [ x y [z] ] ➔ . . . z. . . | . . . where the value of the salary is bound to z and flows through it into the first branch expression.  Indirect (or “implicit”) information flows due to patterns: the information flows by the deconstruction of the matched value and/or the satisfaction of type constraints, such as in: match e with [ x y [1- -900] ] ➔ e ½ | . . . where the information of the range of the salary flows into e ½ .  Indirect information flows due to use of a first match policy: information can be acquired from the failure of the previous branch:





match e with [ x y [1- -900] ] ➔e  | __ ➔e¾

where the information that the salary value is not in (1- -900) flows into e ¾ . The last example also shows that non-interference is undecidable as the non-interference of the value of the salary element is equivalent to deciding the equivalence of e  and e ( Duce is Turing-complete). Once we have established whether a label must be propagated or not, here comes the problem to decide with which type we have to decorate it so as not to infringe subject-reduction. Consider the example of the application of a labeled function value to another value:      . Surely the function, seen as a piece of data, interferes with the computation. Therefore its label must be propagated. A sound solution is to reduce the application to     , for a suitable choice of . Here the choice for  is easy: the type system ensures that   is of type  therefore it suffices to take   . When the type of the label is not an arrow but a generic type   , then we resort to the  operator and set  equal to     for some  such that    . 11 12

More formally we consider that a  Duce term denotes a pair formed by a  Duce term and a map from the (root starting) paths of the abstract syntax tree of the term to possibly empty lists of labels. The analogous trivially complete analysis is the one that does not propagate any label

   ok

     ok    ok if      fail if   if    ¼        fail        ok if               if   ¼       fail                  if   are defined and          fail otherwise ´½  µ ´µ     if  ´½  µ ´µ   ok ´ ½  µ  ´µ   fail if  ´½  µ ´µ  

    fail if      fail or      fail

           otherwise                if     ¼          fail if      fail or      fail                    otherwise ´½  µ ´µ          fail                if                   if    ¼                otherwise

   ok if    ok

    fail if  ,     and          fail

           if  ,     ,   ok , and      and      are not both equal to fail

 





¼ ¼



























Fig. 3. Propagating labels with patterns Formally, we start by  defining  Duce  values, which are obtained by adding to the definition of  Duce values the production       ;  defining the semantics of  for the new values in the following case:               ; and  adding to evaluation contexts the production     . Note however that we do not define a pattern for labels, as we want labels to describe information flow, rather than to affect it. Next we define the operator “ ” that calculates the set of labels that must be propagated at pattern matching. More precisely,  denotes the list of all labels that must be propagated when matching expression against pattern , and is defined in Figure 3 (where “@” and “::” are the “append” and “cons” list operators, respectively). This definition forms the core of our analysis, therefore it is worth explaining it in some details. Let   denote the set of labels occurring in , then all the labels in   are contained in  . The idea is to collect in   all the labels that mark expressions whose value may affect the result of the matching. Intuitively, the definition of “ ” must ensure that if a label  in   is not in  , then for all   obtained by substituting in  some -labeled occurrences by “admissible” (i.e. that respect the type constraints indexing ) values, the match    either always succeeds or always fails. 13 The first two cases in Figure 3 are the easiest ones insofar as variables and defaultvalue patterns always succeed; therefore no label needs to be propagated. Matching a constant is also simple as there is no label to propagate. Note, however, that we use an index to distinguish two different cases of non-propagation: the case in which labels are not propagated because the pattern always fails ( fail ) and the case 13

However, this is not enough to ensure non-interference: for instance consider Bool  which always succeeds.

   false    



   true 

in which they are not propagated because the pattern always succeed ( ok ). When we match a labeled value against a pattern  and the type of the label does not intersect the type accepted by , then we know that making  varying over  always fails. The case for a type constraint pattern is the key one, as  Duce’s pattern matching is nothing but a highly sophisticated type-case with capture variables: if      ¼, then we are in the previous case so  always fails; if    , then for every  of type   ,  always succeeds so no label is propagated; otherwise, for all   in the intersection of  and     succeeds, while for   in their difference it fails, thus we propagate  and check for other labels. When the matched value is a pair then we propagate the union of the labels propagated by matching each sub-value against the corresponding projection of  provided that:  the projections are defined, because otherwise we are matching a pair against a type with no product among its super-types (e.g.,   ) and the pattern always fails, and  none of the two sub-matching fails, because the whole pattern would then fail and therefore it would be silly to propagate the labels of the other component. Note that this case uses the distinction between  ok and fail , the other cases being in conjunction and pair patterns—which fail if any sub-check fails—and in disjunction patterns—which fail if both sub-checks fail—. Since there is no pattern that deconstructs functions, these can be soundly matched only against types (or captured). Note however that the type of a function value is fixed by its interface. Therefore even if we make labeled expressions occurring in the function vary, this does not affect the type of the function, ergo the result of pattern matching. For this reason  fun  ½    ´µ   is dealt in the same way as constant values. The cases of conjunction patterns as well as those for pair patterns need no particular insight. It may be worth just noticing that in the first case of pair patterns  is propagated also when the pattern always succeeds (i.e. 14      ) as the simple fact of deconstructing the pair may yield interference . The cases for alternative patterns are more interesting as they take into account the use of the first match policy. In particular in the first two equations for match, propagation is calculated only on   or  according to whether the first match always succeeds or always fails. If instead the result cannot be determined with certainty, then the label is propagated and the search for other labels is continued on  . All remaining rules are straightforward. Now that we have determined the labels to propagate we have to choose the type constraints to decorate them with. Let  be a list      of labels,  a type, and an expression, we use the notation    to denote the expression        obtained by prefixing by the -indexed labels in . This notation is used to index the labels deduced by “ ” and define the operational semantics of  Duce  as follows (a better definition can be found in the on-line extended version available at www.cduce.org):

        match with  ➔ | ➔ match with  ➔ | ➔

    

        





   

  

 

  







  



    



 

if      if

   and   is defined if    and         

if   and     







 



The rule for “unlabeled” applications does not change, while the one for applications 14

For example , consider Bool     true     false    _ 

[ [ String :"Durand" String :"Paul" [ Int :6500 ] ] [ String :"Dupond" String :"Jean" [Int :1800] ] [ String :"Martin" String :"Jules" [Int :1200] ] ]

[ [ String :"Durand" String :"Paul" ] [ String :"Dupond" String :"Jean" ] [ String :"Martin" String :"Jules" ] ]

[ ([Surname Name] : [ String :"Durand" String :"Paul" ]) ([Surname Name] : [ String :"Dupond" String :"Jean" ]) ]

Fig. 4. Labeled XML document and the results of queries Q1 and Q2. of a labeled function changes as we explained early in this section. Matching clearly is the key rule. Branch selection is performed as in  Duce but the labels determined by and indexed by the type of the reductum are prepended to the result. In particular if the first branch is selected, then we just propagate the labels in   , as the second pattern is not even checked. But if the second branch is selected then the result is prepended by both   and    : while the first set of labels highlights the first of the two kinds of indirect flows present in  Duce, namely, those due to pattern matching, the second set of labels fingers the other kind of flows possible in  Duce, that is those generated by branch dependency induced by first match policy. Since we selected the second branch  knows that  failed and, therefore, that the matched value  does not belong to (the semantic interpretation of) the type  . Therefore, we must also propagate all those labels in  whose content can be (even partially) deduced from the failure of   . By     or equivalently (see Lemma 2) of    . definition these are the labels of   The reduction semantics we obtain is non deterministic since in case of reduction of a pattern matching or of a labeled function we resort to the type system for determining the type decoration of the reductum’s labels (and we know that  Duce type system does not enjoy the minimum typing property). As a matter of facts, we have not described a single analysis but a whole family of analyses. This is slightly annoying from a theoretical viewpoint since we have to prove that all of them are sound so that, in practice, we can use any of them. However this has no consequence under the practical aspect: first, since we deal with a finite number of labels and types it is always possible to find the best analysis; second, since this problem already resides in the  Duce type system a choice was already made by the implementation. So we follow the implementation of  Duce and use in Figures 3 and in the operational semantics of  Duce  the types inferred by the  Duce’s type-checker: we end up with the best analysis (in the family described above) that is sound with the current implementation of  Duce. Let us finally apply the analysis to the queries of Section 1. In order to trace how information of each data flows in the queries, we label the content of each sub-element of elements by a different label as shown on the left expression of Figure 4. The results of the analysis of Q1 and Q2 respectively are the central and right expressions in Figure 4. 15 Note that Q1 propagates only the labels of name and surname (the 15

Sequences are encoded by right associative nested pairs ended by ‘nil, XML elements  become triples ´‘tag´attributes µµ, while transform is obtained by iterating matching expressions on the elements, and encapsulating the results into a sequence.

propagation is a consequence of an explicit flow), thus the analysis has correctly indicated that it is safe. The analysis of Q2 instead propagates also the salary label and therefore is rejected as insecure. The propagation of  [Surname Name] is caused by the “ ” in the first match rule in  Duce  ’s operational semantics. In particular when matching the pattern of transform against an element, the third rule for pairs in Figure 3 decomposes the test over each projection, one of which calculates, say in the first loop,  Int  6500 MoreThanMe which by the second rule for types in Figure 3 is equal to . Note also how the position of the label denotes different kinds of properties: for example we specified [ 1200] since we wanted to capture only transformations that depended on the content of the salary element, while if we rather had specified   [1200] we would have captured also transformations that test the presence of this element (in the case it were optional). We want to stress that our analysis can check very complex security entailments. As explained in the introduction, the independence of the result from salary can also be ensured by access control techniques or by stripping salary values from the source document. However, our technique allows one to check that even if a query can access both salaries and names it cannot correlate them. To that end it just suffices to verify that the presence of a  or of a  label in the result implies the absence of the  label, and vice-versa. According to this policy query Q1 would be accepted since  does not occur in the result while query Q2 would be rejected since all the three labels are in the result. A query that plainly returned the list of all salaries (without any name or surname) or some statistic about them, would be considered safe, too. More generally, by using label propagation and some logic (e.g. propositional one) on labels we can define complex security policies whose verification, trivial in our technique, would be very hard (if not impossible at all) by standard access control techniques, as we show in Section 6. Finally, we can recast the analysis scenarios we outlined in the introduction (Figure 1) to the present setting. A sound static analysis for our system is an analysis that computes labels that will be surely absent from result of the dynamic analysis, while a complete static analysis will determine labels that will be surely present in the same result. Therefore, here completeness is stated with respect to the dynamic analysis rather than with respect to the non-interference property. With that respect the work presented here constitutes the cornerstone of the outlined architecture.

5 Properties In this section we briefly enumerate the various properties of our approach. For space reasons proofs and less important lemmas are omitted or just sketched. They are all reported in extended version of the article available on-line. In what follows, when we state that is a  Duce or  Duce  expression we implicitly mean that is a well-typed ( Duce or  Duce  ) expression. Lemma 1 (Strip). Let be a  Duce  expression. If   , then      . This lemma has two important consequences, namely  that  Duce  is a conservative extension of  Duce with respect to the reduction theory, and  that despite being nondeterministic the reduction of  Duce  preserves the semantics of  Duce programs:     Corollary 1. Let be a  Duce  expression. If , and , then     . The soundness of  Duce  type system is proved by subject reduction and is instrumental to the soundness of the analysis.









Theorem 1 (Subject reduction). Let be a  Duce  expression. If     and   , then     . The next lemma is useful to understand the match reduction rules of  Duce  . Lemma 2. For every value  and type ,       (modulo the indexes of ) and In order to prove non-interference we need two lemmas, one that characterizes a second that is the non-interference counterpart of the standard substitution lemma in typed -calculi: Lemma 3. Let  be a pattern,  a  Duce  value with an occurrence  such that          . If    , then for all  such that    ,            holds true.



Lemma 4 (Substitution). Let  be a pattern,  a  Duce  value whose occurrence  is such that       . If  does not occur in the image of  and    , then for  all   such that     , we have that      is pointwise equal to . We can state the non-interference theorem: note that the non-determinism of  Duce  ’s reduction is accounted for by the quantifying on all results  of the expression . Theorem 2 (Non-interference). Let be a  Duce  expression, with an occurrence   , if    , then   such that      . For every value  such that  is non interfering in   with respect to , i.e.,    Duce s.t.      , we have 



  





 

By using Proposition 2 it is easy to see that the conclusion of the theorem implies the Definition 2 of non-interference, justifying in this way the “i.e.” we used in the statement of the theorem.

6 A last example We end our presentation by commenting a more articulated example to illustrate the use of our technique to define and verify com- type ExamBase = [Person*] plex security policies that cannot be expressed in type Person = [Name Birth Grade?] terms of access control. We suppose to store in type Name = String XML-documents information about persons that type Birth = [Year Month Day] Year = [Int] have to pass some examination. The form of the type type Month = MName documents is described by the type declarations type MName= "Jan"|"Feb"|"Mar"|"Apr"| Day = [1--31] on the right. As we see every document records type type Grade = [Int] a list of names, with personal information, and with an optional element that stores the numerical result of the examination. The absence of such an element denotes that the person has not passed (that is, either not taken or taken and failed) the examination, yet. An XML document that verifies this schema is shown in the left column of Figure 5, while the central column reports the result of importing the same document in  Duce. Imagine that examination documents can be accessed by three different categories of users, academic staff, administrative staff, and normal users. We want academic staff to have unconstrained access to the information stored in the examination documents while we may wish to constrain the accesses for administration and normal users. As an example of security requirements we may wish to enforce we have: 1. Only academic users can have information both on names and grades or on names and birthdays simultaneously.

Durand 1970 Aug 10 110 Dupond 1953 Apr 22 Dubois 1965 Sep 2 120

let eb : ExamBase = [ [ "Durand" [ [1970] "Aug" [10] ] [110] ] [ "Dupond" [ [1953] "Apr" [22] ] ] [ "Dubois" [ [1965] "Sep" [2] ] [120] ] ]

let eb : ExamBase = [ [ nameString : "Durand" [ [ statInt : 1970] privateMName : "Aug" [ private(1--31) : 10] ] passedGrade : [ resultInt : 110] ] [ nameString : "Dupond" [ [ statInt : 1953] privateMName : "Apr" [ private(1--31) : 22] ] ] [ nameString : "Dubois" [ [ statInt : 1965] privateMName : "Sep" [ private(1--31) : 2] ] passedGrade : [ resultInt : 120] ] ]

Fig. 5. A database of examinations: in XML, in  Duce, and in  Duce  2. The administrative users can check whether a person passed the examination (that is, they can check for the presence of a element) but cannot access the result. 3. Every user can ask for statistical results on grades upon criteria limited to year of birth and gender (so that they cannot select sufficiently restrictive sets to infer personal data). To dynamically verify these constraints we introduce five labels that we use to classify the information stored in documents: private (that classifies the month and the day of birth), stat (that classifies the year of birth and the gender attribute), name (that classifies names), passed (that classi- type Person = [ Name Birth ( passed:Grade)? ] fies grade elements), and result (that classifies Name = ( name:String) the content of grade elements). Rather than type type Year = [ stat:Int] document-wise, this classification is described type Month = ( private:MName) Day = [ private:(1--31)] directly on types as shown by the definitions type type Grade = [ result:Int] on the right. Note that in these definitions labels have no indexes (in documents they will be indexed by the types they are labeling). Before executing a query the system uses this specification to generate a labeled version of the document as shown in the last column of Figure 5. The query is then executed on the labeled document (according to the semantics of  Duce  ) and the following constraints16 are checked in the result: If the owner of the query is a normal user, then the result must satisfy: name   private  stat  result  passed  private   stat  result 16

For the sake of the example we expressed these constraints in a propositional logic where the labels are atoms, but different languages are possible. The definition of such languages is out of the scope of the paper and matter of future research: see Section 7.

if the owner of the query is an administrative user, then the result must satisfy: name   private  stat  result  private   stat  result where a propositional label is satisfied if and only if the label is present in the result. Thus, for instance, the second constraint must be read: if the label name occurs in the result then private, stat, and result cannot occur in it, and if private occurs in the result then stat, and result do not. The difference with respect to the constraint for normal users is that the second constraint allows name and passed to occur simultaneously in the same result. Therefore a query that just tested the presence of a element without checking its content would satisfy this second constraint. If the result of a query satisfies the corresponding constraint, then its stripped version is returned to the owner of the query.

7 Perspectives This paper contains exploratory work toward the definition of information flows security in XML transformations. As such it opens several perspectives both for the practical and the theoretical aspects. First and foremost the cost of the dynamic analysis must be checked against an implementation (we are currently working on it). We expect this cost to be reasonably low:  Duce’s pattern matching fully relies on dynamic type checking, therefore if we embed the dynamic generation of typed security labels in  Duce’s runtime the resulting overhead should be small. Also, to fill the gap with practice we must devise expressive and user-friendly ways to describe the labeling of XML documents in the meta-data (that is, XML Schemas and DTDs) and to express the associated security policies (for instance, in Section 6 we expressed them in propositional logic). These security properties should be defined by sets of constraints (i.e., formulæ of an appropriate logic) that are automatically generated from a specification expressed in an “ad hoc” language (e.g., like the authorizations defined in [6]). The precision of the dynamic analysis must be enhanced by program rewriting techniques. To exemplify, an obvious “optimization” consists in rewriting all closed (i.e., without capture variables) patterns into an equivalent type constraint pattern, by replacing  for & ,  for | , and  for , so as to transform the pattern    into the type         . Indeed, the analysis on types is more precise than that on equivalent patterns as the latter is recursively applied to subcomponents forgetting, in doing so, many interdependencies. But other subtler rewritings could improve our analysis: for instance    true false is equivalent to      true  false but Bool     Bool    while Bool     ok . This discrepancy can be identified with the fact that the analyses of pair, intersections, and union patterns are performed independently on the two subpatterns. Thus a possible way to tackle this problem is by transferring some information from one pattern to the other, for example by mimicking the automaton-based technique used by  Duce for the just-in-time compilation of patterns. Finally note that one of the main technical novelties of this work is to endow security labels with constraints. The constraints at issue are quite simple, since they just express the static knowledge of the type of the labeled expression. It is then natural to think of much more expressive constraints. For example we can think of endowing labels with integrity constraints and define non-interference just in terms of consistent

databases. In this perspective a program that checked the integrity of a base would be always interference-free even if it accessed private information. Of course checking non-interference in this case would be even more challenging but it could pave the way to (security) proof carrying code for XML. Acknowledgments: We are very grateful to Nicole Bidoit and Nevin Heintze for their careful reading and invaluable suggestions, and to Dario Colazzo, Alain Frisch, Alban Gabillon, and Massimo Marchiori for their feedback.

References 1. M. Abadi, B. Lampson, and J.-J. Levy. Analysis and caching of dependencies. In ICFP ’96, 1st ACM Conference on Functional Programming, pages 83–91, 1996. 2. V. Benzaken, G. Castagna, and A. Frisch. CDuce: An XML-Centric General Purpose Language. In ICFP ’03, 8th ACM Conference on Functional Programming, pages 51–63, 2003. 3. A. Christensen, A. Møller, and M. Schwartzbach. Extending Java for high-level web service construction. ACM TOPLAS, 2003. To appear. 4. S. Cluet and J. Siméon. Yatl: a functional and declarative language for XML, 2000. Draft manuscript. 5. S. Conchon. Modular information flow analysis for process calculi. In FCS 2002, Proceedings of the Foundations of Computer Security Workshop, Copenhagen, Denmark, 2002. 6. E. Damiani, S. De Capitani di Vimercati, S. Paraboschi, and P. Samarati. A fine-grained access control system for XML documents. ACM TOISS, 5(2):169–202, 2002. 7. E. Damiani, S. De Capitani di Vimercati, S. Paraboschi, and P. Samarati. Design and implementation of an access control processor for XML documents. Computer Networks, 33(1– 6):59–75, 2000. 8. M. Fernández, J. Siméon, and P. Wadler. An algebra for XML query. In FST&TCS, number 1974 in LNCS, pages 11–45, 2000. 9. A. Frisch, G. Castagna, and V. Benzaken. Semantic Subtyping. In LICS ’02, Seventeenth Annual IEEE Symposium on Logic in Computer Science, pages 137–146, 2002. 10. A. Gabillon and E. Bruno. Regulating access to XML documents. In 15th Annual IFIP WG 11.3 Working Conference on Database Security., July 15-18 2001. 11. V. Gapayev and B. Pierce. Regular object types. In Proc. of ECOOP ’03, Lecture Notes in Computer Science. Springer, 2003. 12. J.A. Goguen and J. Meseguer. Security policy and security models. In Proceedings of Symposium on Secrecy and Privacy, pages 11–20. IEEE Computer Society, april 1982. 13. H. Hosoya and B. Pierce. XDuce: A typed XML processing language. ACM TOIT, 2003. To appear. 14. IBM AlphaWorks. XML Security Suite. http://www.alphaworks.ibm.com/tech/xmlsecuritysuite. 15. A. Myers and B. Liskov. A decentralized model for information flow control. In Proceedings of the 16th ACM Symposium on Operating Systems Principles (SOSP), pages 129–142, 1997. 16. F. Pottier and S. Conchon. Information flow inference for free. In ICFP ’00, 5th ACM Conference on Functional Programming, pages 46–57, September 2000. 17. F. Pottier and V. Simonet. Information flow inference for ML. ACM SIGPLAN Notices, 31(1):319–330, January 2002. 18. A. Sabelfeld and A. Myers. Language-based information-flow security. IEEE Journal on Selected Areas in Communications, 21(1):5–19, 2003. 19. D. Volpano and G. Smith. A type-based approach to program security. In TAPSOFT ’97, number 1214 in Lecture Notes in Computer Science, pages 607–621. Springer, 1997. 20. D. Volpano, G. Smith, and C. Irvine. A sound type system for secure flow analysis. Journal of Computer Security, 4(3):167–187, 1996. 21. C. Wallace and C. Runciman. Haskell and XML: Generic combinators or type based translation? In ICFP ’99, 4th ACM Conference on Functional Programming, pages 148–159, 1999.