Detecting and Debugging Insecure Information Flows

Detecting and Debugging Insecure Information Flows Wes Masri, Andy Podgurski, David Leon Electrical Engineering & Computer Science Department Case Western Reserve University 10900 Euclid Avenue Cleveland, OH 44106 USA +1 216 368 4231, +1 216 368 6884, +1 216 368 4231 [email protected], [email protected], [email protected] Abstract A new approach to dynamic information flow analysis is presented that can be used to detect and debug insecure flows in programs. It can be applied offline to validate and debug a program against an information flow policy, or, when fast response is not critical, it can be applied online to prevent illegal flows in deployed programs. Since dynamic analysis alone is inherently unable to detect implicit information flows, our approach incorporates a static preprocessing phase that permits detection of most implicit flows at runtime, in addition to explicit ones. To support interactive debugging of insecure flows, it also incorporates a new forward computing algorithm for dynamic slicing, which is more precise than previous forward computing algorithms and is not restricted to programs with structured control flow. A prototype tool implementing the proposed approach has been developed for Java byte code programs. Case studies in which this tool was applied to several subject programs are described.

1. Introduction Insecure information flows are a classical type of security violation [5][6]. Such flows can permit either leakage of confidential information or corruption of data with high integrity requirements by data from an untrustworthy source. Information flows are also critical to the occurrence of ordinary software failures involving interactions between program components. Information flow occurs from object x (source) to object y (target or sink) whenever information stored in x is propagated directly or indirectly to object y [6]. Information flows can be either explicit or implicit. Explicit flows can result from either data or control dependences between program statements. For example, there is an explicit flow from a variable used in an assignment statement to the variable

defined by the statement. There is also an explicit flow from a variable used in the branch predicate of a conditional statement to a variable defined in the conditional statement’s scope. A conditional statement may also cause an implicit flow. For example, consider the statements x = 1; if y = 0 then x = 0; where y is either 0 or 1. Whether or not the then clause is executed, information flow occurs from y to x, because the final value of x depends on the value of y. The flow is explicit if the then clause is executed and implicit if it is not.1 Information flow analysis in computer security is closely related to program dependence analysis [10], which has many applications in software engineering and in code optimization/parallelization. Podgurski and Clarke have shown that for one statement in a program to affect the execution behavior of another, there must be a (static) chain of data and control dependences between the two statements [22]. Information flow analysis and program dependence analysis are both closely related to program slicing [30], which is a debugging technique that seeks to identify the set of program statements, called a slice, that could be responsible for an erroneous program state that occurred at a particular location in a program. There are both static and dynamic variants of program slicing. Dynamic information flow analysis (DIFA), which seeks to identify information flows between objects at runtime, is most closely related to dynamic slicing [16], which extracts a slice from an execution trace. Both dynamic information flow analysis and dynamic slicing are potentially much more precise than their static counterparts, because the outcomes of conditional branches become known at runtime. Moreover, though dynamic slicing is intended for general debugging, it is also useful for determining the cause(s) of unsafe 1 Note that this characterization of implicit and explicit flows is somewhat different from that presented in [6].

information flows discovered by dynamic information flow analysis. Given the importance of information flows in many attacks and software failures, the capability of detecting illegal or erroneous information flows potentially has great utility in intrusion detection and software testing. Surprisingly, the use of dynamic information flow analysis in these applications has received little attention in the research literature. To make DIFA more practical, we present the first precise forward computing algorithm for intra- and inter-procedural DIFA of both structured and unstructured programs. Forward computing algorithms for DIFA (and dynamic slicing) operate in tandem with program execution and have the advantage that they do not require a previously stored execution trace. Since dynamic analysis alone is inherently unable to detect implicit information flows, our approach incorporates an optional static preprocessing phase that permits detection of most implicit flows at runtime, in addition to explicit ones. Our approach is apparently the first to address the detection of implicit flows in the dynamic setting. We have implemented our DIFA algorithm in a tool for monitoring information flow in Java byte code programs2. Our information flow monitoring tool can be used to detect unsafe information flows offline with either synthetic test cases or captured operational inputs, and it can also be used online during deployment, provided the software being monitored is not time or performance critical. This last capability is possible only with a forward computing algorithm. Recognizing that security requirements often change after deployment, our tool is designed to support configurable information flow policies, a feature that static systems lack. Given the dynamic nature of our tool, it easily handles language features that are generally hard to handle in a static context, such as pointers, arrays, and dynamic binding. In checking for illegal information flows between objects, the tool considers the dynamic types of objects. It handles both intra-procedural and inter-procedural flows, though it does not currently address flows resulting from exceptions and exit statements, and it handles data flows between threads in multi-threaded programs. To support debugging of unsafe information flows (as well as other kinds of debugging), we also present a variant of our dynamic information flow analysis algorithm for use in dynamic slicing. This forward computing algorithm is applicable to both structured and unstructured programs. Forward computing algorithms for dynamic slicing are advantageous because they can be used for interactive debugging, since they do not require a 2 Note that use of constructs such as the Java language’s break, continue, and return statements lead to programs that are, strictly speaking, unstructured.

previously stored execution trace. To make our slicing algorithm precise and reasonably efficient, it was necessary to define a variant of standard control dependence called direct dynamic control dependence and to devise an algorithm to compute it efficiently (see Section 3). Finally, our slicing algorithm can optionally compute relevant slices, and since it can handle multiple procedures, unstructured constructs, and control statements involving object and array references, it is more general than the relevant slice algorithm in [11]. The next section surveys related work on both information flow analysis and dynamic slicing. Section 3 presents the definitions and equations needed to describe our algorithms. Section 4 presents our algorithms. Section 5 describes our approach to detecting implicit flows. Section 6 briefly discusses our implementation for Java byte code programs. Section 7 describes the results of evaluating our tool empirically. Section 8 is the Conclusion.

2. Related work Lampson motivated research on information flow analysis by describing the problem and listing a number of possible information leaks [18]. Fenton proposed an abstract machine called the Data Mark Machine to support dynamic checking of information flows. In this machine, each variable, including the program counter (PC), is tagged with a security level [9]. A write to a variable is inhibited if its level is lower than the PC level. (Note that the instruction set of the machine contains an instruction to reset the PC level.) Jones and Lipton proposed a similar mechanism in which only the output of a program is checked for illegal flows [13]. The Perl scripting language provides a special execution mode called taint mode, in which all user supplied input is treated as suspicious unless the programmer explicitly approves it. Taint mode is very effective at preventing flows from user supplied input data to potentially dangerous system calls, but is unable to detect or prevent any other type of flows, including ones resulting from control dependence. Static analysis techniques for validating information flows are usually safer but less precise than dynamic mechanisms, because they treat all control flow paths as executable. Denning and Denning proposed a technique based on control flow and data flow analysis for verifying a program’s compliance with an information flow policy [5][6]. A number of language-based type checking systems have also been proposed [24]. In such systems every program expression is assigned a security type in addition to its ordinary type. In type checking a program, the compiler ensures that the program cannot exhibit illegal information flows at run-time. Tip [28] provides a survey of both static and dynamic

vI v1 v2

v3 v4

v7 v5 v8 vF

v6

Figure 1 – CFG for an unstructured method program slicing techniques. Dynamic program slicing was first introduced by Korel and Laski [16]; their approach computes executable slices for structured programs. Korel later proposed another algorithm that computes executable slices for possibly unstructured programs [14]. Agrawal and Horgan used Dynamic Dependence Graphs (DDG), whose size is unbounded, and a reduced version of them to compute dynamic slices [2]. All of the aforementioned algorithms are based on “backward” analysis; forward computing algorithms for dynamic slicing and information flow analysis are the focus of this paper. Korel and Yalamanchili were the first to propose a forward computing algorithm for dynamic slicing [17], their algorithm is applicable only to structured programs. In [4][8], another forward computing slicing algorithm was proposed for handling possibly unstructured C programs. However, this algorithm employs very imprecise rules for capturing control dependences involving goto, break, and continue statements. For example, when a goto statement is executed, the slices of the target statement and of all statements that execute following it contain the goto and the statements it is dependent on. The algorithm’s precision is further reduced by the fact that it determines the variables used by a given statement statically. [19] presents an example that contrasts the slices computed using the algorithm in [8] and the slices computed using our algorithm. More recently, Zhang et al [32] analyzed the characteristics of dynamic slices, such as reoccurrence and overlapping, and identified properties that enable their space efficient representation in forward computing algorithms. They show that by using reduced order binary decision diagrams to represent a set of dynamic slices, the space and time requirements of maintaining dynamic slices are greatly reduced. Korel and Laski [15] defined a relation called potential influence which relates a control statement to other statements it might affect had it been evaluated

differently. Agrawal and Horgan [3] used the potential influence relation to define what they call relevant slices, i.e., dynamic program slices that reflect potential influences. They presented a backward computing algorithm for computing relevant slices that is based on extending the DDG. Finally, they proposed that coverage of relevant slices involving certain (limited) code changes be used as a basis for test suite reduction in regression testing. Gyimothy et al [11] presented an earlier version of the algorithm in [4] that also computes relevant slices; it is based on extending Program Dependence Graphs (PDGs). This forward computing algorithm is strictly intra-procedural and handles only scalar variables and structured code. Wang et al [29] presented a backward computing algorithm for computing relevant slices that is also based on augmenting the PDG, and they discussed the use of relevant slices in the detection of what they call omission errors in Java byte code programs. There is a clear parallel between the potential influence relation and the implicit flow relation we described earlier (surprisingly there is no reference in [15] to the work presented in [6]). Thus, our approach to detecting implicit flows (Section 5) is related to the aforementioned work on relevant slices. Our approach uses program transformations and, as in [11], it uses a forward computing algorithm, but it can additionally handle interprocedural flows, unstructured constructs, and control statements involving object and array references.

3. Definitions This section, in conjunction with Appendix A, presents a number of definitions and equations that form the basis of our algorithms. First, dynamic control dependence and data dependence are defined. Next, these and other relations between program statements are used to define the direct and indirect “influence” relations. Finally, the latter relations are used to define information flows and dynamic slices.

3.1. Dynamic control dependence In program dependence analysis, the concept of control dependence is used to model the dependence of program statements on branch predicates that potentially control their execution. To achieve precision, our dynamic information flow analysis and dynamic slicing algorithms require a new definition of dynamic control dependence that is based on the classical definition of static control dependence. “Classical” control dependence, as defined in Appendix A, is a conservative, static approximation to the runtime dependences that occur between program statements and the branch predicates that control their execution. Control dependence can be characterized more precisely at runtime, as the outcomes of conditional

Table 1 - Sample trace in the CFG of Figure 1 with corresponding DDynCD relationships. Column 3 shows the steps taken by the algorithm in Figure 2. Column 4 shows the state of CDSTACK with top of stack displayed to the right. T vI v1 v3 v4 v1 v2 v4 v5 v7 v2 v4 v5 v6 v8 vF

DDynCD 0 0 v1 0 v4 v1 v4 0 v5 v7 v7 v7 v5 0 0

Algorithm actions (tagged by line number) 7: DDynCD(vI) = 0 7: DDynCD(v1) = 0 Æ 13: pushes v1 5: DDynCD(v3) = v1 2: pops v1 Æ 7: DDynCD(v4) = 0 Æ 13: pushes v4 5: DDynCD(v1) = v4 Æ 13: pushes v1 5: DDynCD(v2) = v1 2: pops v1Æ 5: DDynCD(v4) = v4Æ11:pops v4Æ13:pushes v4 2: pops v4 Æ 7: DdynCD(v5) = 0 Æ 13: pushes v5 5: DDynCD(v7) = v5 Æ 11: pops v5 Æ 13: pushes v7 5: DDynCD(v2) = v7 5: DDynCD(v4) = v7 Æ 13: pushes v4 2: pops v4Æ5:DDynCD(v5) = v7Æ11:pops v7Æ13:pushes v5 5:DDynCD(v6) = v5 Æ 11: pops v5 Æ 13: pushes v6 2: pops v6 Æ 7: DDynCD(v8) = 0 7: DDynCD(vF) = 0

branches become known. In defining dynamic control dependence, it is useful to speak in terms of execution traces. Formally, an execution trace is a sequence of elements called actions [17]. An action represents the execution of a statement (or basic block). The kth action in a trace T is denoted by T(k) or by sk, where s is the corresponding statement (or basic block). For a given statement s occurring in a trace, sλ denotes the most recent action involving s, i.e., the most recent execution of s. A predicate action is an action sk for which s is a branch predicate. A subtrace of T, denoted T(k, m) is the subsequence of actions starting at T(k) and ending at T(m). There is an obvious correspondence between execution traces of programs and initial walks (see Appendix A) in the programs’ control flow graphs, and we will often equate traces with initial walks and actions with vertex occurrences. The static control dependence relation is due to Ferrante et al [10]. In Appendix A, we define it and rename it direct control dependence or DCD. We now provide a formal definition of dynamic control dependence that is based on the DCD relation: Definition. Let sk and tm be two actions in an execution trace T, where k < m and sk is a predicate action. Then tm is directly dynamically control dependent on sk, denoted tm DDynCD sk, iff the subtrace T(k, m) demonstrates that t DCD s. The unique predicate action (if any) that tm is directly dynamically control dependent on is denoted DDynCD(tm). Intuitively, DDynCD(tm) is the most recent predicate action sk to occur prior to action tm such that t DCD s. This is the key to precisely capturing dynamic control dependences in both information flows and slices,

CDSTACK 0 v1 v1 v4 v4Æ v1 v4Æ v1 v4 v5 v7 v7 v7Æ v4 v5 v6 0 0

because tm may be in the dynamic scope of many predicate actions simultaneously, due to loops, nested control constructs, and unstructured flow of control. Note that the DDynCD relation is applicable to both structured and unstructured programs. The DDynCD relationships for the control flow graph in Figure 1 and the execution trace T = are shown in the second column of Table 1. As defined above, DDynCD models only intraprocedural control dependences. Inter-procedural control dependences may occur when an execution trace spans different functions.

3.2. Dynamic data dependence Modeling dynamic data dependences between actions requires associating two sets of variables or objects with each action: the set of variables or objects D(sk) defined (assigned a value) at sk, and the set of variables or objects U(sk) used (referenced) at sk. (Note that different objects may be defined or used at different actions involving the same statement.) We also apply this notation to traces and sets of actions in the obvious way. Definition. Let sk and tm be two actions in an execution trace T, where k < m. Then tm is directly dynamically data dependent on sk, denoted tm DDynDD sk, iff ( D ( s k ) ∩ U (t m )) − D (T (k + 1, m − 1)) ≠ ∅ The set of actions that tm is directly dynamically data dependent on is denoted DDynDD(tm). Informally, tm DDynDD sk iff tm uses a variable or object that was last defined by sk. The DDynDD relation models both intra-procedural and inter-procedural data

dependences. The latter occur when an execution trace spans different functions and data defined in one function is used in another.

3.3. Direct influence and influence In addition to the DDynCD and DDynDD relations, we identify three other kinds of dynamic dependences between actions, each of which is inter-procedural (they are defined formally in [19]): 1) Use of a value returned by a return statement 2) Use of a value passed by a formal parameter 3) Control dependence on a calling method’s invoke instruction (this is similar to the static entry-dependence effect described in [25]) The combination of the aforementioned five types of dependences comprises what we call “direct influence”. Given two actions sk and tm with k < m, sk directly influences tm, denoted sk DInfluence tm, iff tm exhibits any of these five types of dependences upon sk. (Note the reversal of order between tm and sk.) The set of actions that tm is directly influenced by is denoted DInfluence(tm). The Influence relation represents the different ways, both direct and indirect, that one statement can influence the execution of another. Formally, it is simply the transitive closure of the DInfluence relation. Definition. Let T be an execution trace and let sk and tm be actions in T with k < m. Then sk influences tm, denoted sk Influence tm, iff there is a sequence a1, a2, ..., an of actions in T such that a1 = sk, an = tm, and for i = 1, ..., n – 1, ai DInfluence ai+1. The set of actions that influence tm is denoted Influence(tm). It follows that Influence(t m ) = DInfluence(t m ) ∪ Influence( s k )

U

s k ∈DInfluence(t m )

3.4. Dynamic information flow analysis and dynamic slicing Dynamic information flow analysis is closely related to dynamic slicing in that they both examine the execution trace to identify dynamic control and data dependence relationships. Whereas dynamic information flow analysis is concerned with identifying the set of all variables or objects that influence a variable or object at an action sk, dynamic slicing is concerned with identifying the set of all statements that influence sk or that influence a variable or object at sk. Definition. Let T be an execution trace, let sk and tm be actions in T with k < m, let x and y be variables such that x is used at sk and y is defined at tm, respectively. Then information flows from x at sk to y at tm iff sk Influence tm. The set of variables or objects from which information flows to y at tm is the set InfoFlow(t m ) = U (t m ) ∪ U ( Influence(t m ))

where U(tm) is the set of variables or objects used at tm and U(Influence(tm)) is the set of variables or objects used at the actions that influence tm. Our dynamic information flow analysis algorithm is based on the following inductive characterization of the set InfoFlow(tm): InfoFlow(t m ) = U (t m ) ∪ U InfoFlow(s k ) (I) s k ∈DInfluence(t m )

Note that to compute the variables or objects from which information flows to a variable or object y at an action tm, where tm uses but does not define y, it suffices to compute InfoFlow(dn) as above for the last definition dn of y before tm. The dynamic slice for an action tm is the set of statements executed in actions that influence tm, plus t itself. Definition. Let T be an execution trace and let tm be an action in T. The dynamic slice for tm is the set of statements DynSlice(t m ) = {t} ∪ {s : ∃k[ s k Influence t m ]} Our dynamic slicing algorithm is based on the following inductive characterization of DynSlice(tm): DynSlice(t m ) = {t} ∪ U DynSlice(s k ) (II) s k ∈DInfluence(t m )

It is often useful to consider the dynamic slice for a variable just before or after an action is executed [30]. The dynamic slice of a variable or object x just before execution of action tm is simply the slice of the last action before tm that defined x.

4. Algorithms In this section, we first describe our algorithm for computing the DDynCD relation and then describe our forward-computing algorithms for dynamic information flow analysis and dynamic slicing, which build on the DDynCD algorithm.

4.1. Direct dynamic control dependence In this section we present an algorithm for computing the direct dynamic control dependence relation DDynCD. This algorithm, which is shown in Figure 2, is crucial to computing dynamic information flows and dynamic slices precisely (i.e., avoiding spurious dependences), and it is applicable to both structured and unstructured programs. The algorithm applies to an individual method m and requires that its control flow graph G(m) and immediate postdominators (see Appendix A) have been computed beforehand, which can be done statically or when a class is loaded. Another important data structure used by the algorithm is a stack, denoted CDSTACK(m), which stores references to predicate actions whose “dynamic scope”

Summary of Steps: 1. Push decision node on CDSTACK(m) when reached 2. Pop decision node off CDSTACK(m) when its ipd is reached 3. Current node is directly control dependent on TOS(m) 4. Pop control node off CDSTACK(m) when a node with the same ipd is reached (this step ensures that stack size is bounded)

Compute Direct Dynamic Control Dependence( ) Input: Executing statement s, CDSTACK(m) Output: DDynCD(sλ) or null ipd(s): immediate postdominator of a statement s TOS(m): top of CDSTACK(m) 1 if ¬Empty(CDSTACK(m)) and s = ipd(TOS(m)) then 2 pop CDSTACK(m) 3 endif 4 if ¬Empty(CDSTACK(m)) then DDynCD(sλ) = TOS(m) 5 6 else 7 DDynCD(sλ) = null 8 endif 9 if s is a decision statement then 10 if ¬Empty(CDSTACK(m)) and ipd(s) = ipd(TOS(m)) then 11 pop CDSTACK(m) 12 endif 13 push s onto CDSTACK(m) 14 endif

Figure 2: DDynCD algorithm

has not yet been exited. The dynamic scope of a predicate action sk is exited when ipd(s) is reached, at which time sk is popped from CDSTACK(m). Note that because of loops an arbitrary number of dynamic scopes may be “open” at a given point during execution of m. To ensure that the size of CDSTACK(m) is bounded by the number of decision vertices in G(m), only the latest action in a sequence of predicate actions involving predicates with the same immediate postdominator is kept on CDSTACK(m). This is permissible because of the following results: (1) any trace can be decomposed into a sequence of n ≥ 1 segments, the last n – 1 of which demonstrate a chain of DDynCD relationships between statements and (2) the sequence of vertices involved in these relationships can be decomposed into blocks of vertices with the same immediate postdominator. These results are stated formally and proved in the extended version of this paper [19]. Also, the presentation found in [20] graphically illustrates the steps taken by the DDynCD algorithm. Table 1 shows how computing the DDynCD relationships using the algorithm in Figure 2 leads to the same results as directly applying the definition of the DDynCD relation. Note that the slicing algorithm described in [32] also computes the DDynCD relation (there termed dynamic

Compute InfoFlow and DynSlice ( ) Input: Action tm m Compute DInfluence(t ) (see section 3.3)

DynSlice(tm) = {t} InfoFlow(tm) = U(tm) for all sk ∈ DInfluence(tm) InfoFlow(tm) = InfoFlow(tm) ∪ InfoFlow(sk) DynSlice(tm) = DynSlice(tm) ∪ DynSlice(sk) endfor

Store InfoFlow(tm) for subsequent use Store DynSlice(tm) for subsequent use if

tm is a sink defined in policy and InfoFlow(tm) contains a sensitive object

then

Stop execution Log InfoFlow(tm) and DynSlice(tm)

Figure 3 - InfoFlow and DynSlice algorithms control dependence). For each statement t, the set CD(t) of predicate statements on which t is statically control dependent is computed. Action tm is determined to be dynamically control dependent on action sk if s ∈ CD(t) and k is the latest time stamp of any executed predicate from CD(t). This algorithm requires Ω(lg n) worst-case time per action processed, where n is the number of predicate statements in a method. Our DDynCD algorithm, on the other hand, requires only constant time per action processed, assuming that immediate postdominators are precomputed. Also, the dynamic control dependence chain captured by CDSTACK is potentially useful during debugging.

4.2. Dynamic information flow analysis and dynamic slicing Figure 3 shows a high-level description of our dynamic information flow analysis and dynamic slicing algorithms (integrated together). The algorithms apply equations (I) and (II) sequentially at every action ai of an executing program and then store the results for subsequent use. Since the algorithms are forward computing, when the equations are applied at ai, all the values they depend on have already been computed and are available. In this respect, our algorithms are similar to the dynamic slicing algorithm presented in [4][8]. However, our slicing algorithm computes more accurate slices because of the precision of its control dependence sub-algorithm (DDynCD) and because data dependences (DDynDD) are computed dynamically, as object and array references are resolved. Let m be the number of statements and n be the number of active objects in a program. As discussed in [19], the time complexities at an action tk of InfoFlow and

1. x1 = x2 = 0; 2. if (y1 == 1) { 3. x1 = 1; 4. if (y2 == 1) { 5. x2 = 1; } 6. print(x2); } 7. print(x1); 8. print(x2);

1. 2. 3. 4. 5.

t1.

addFlow(4, x2_id);

6.

print(x2); }

t2. addFlow(2, x1_id); t3. addFlow(2, x2_id);

7. 8. a)

x1 = x2 = 0; if (y1 == 1) { x1 = 1; if (y2 == 1) { x2 = 1; }

print(x1); print(x2); b)

Figure 4 – a) Original code, b) Transformed code

DynSlice are O(n2) and O(n⋅m), respectively. Also, the space complexity at an action tk of InfoFlow and DynSlice are O(n2) and O(n⋅m), respectively.

5. Implicit flows The DIFA algorithm presented in Section 4.2 is capable of detecting all explicit information flows as described in Section 1, but its dynamic nature impedes it from detecting any implicit flows. Note that in [6] all flows resulting from control constructs are considered implicit. Our characterization of implicit flow is somewhat different in that when a statement defining the target of a flow actually executes the flow is considered explicit. Such a flow is detectable by our DIFA algorithm. An innovative aspect of our approach is that it includes an optional preliminary static analysis phase that identifies implicit flows and transforms them into explicit ones. The resulting hybrid mechanism is therefore capable of detecting both explicit and implicit information flows. A clear advantage of our hybrid mechanism over purely static ones is that it can accurately determine the sources of the flows even if they involve object or array references. On the other hand, false positive instances of illegal flows are more likely to arise when implicit flows are considered; hence analysis of implicit flows is an optional feature of our tool. For the purpose of the aforementioned transformation, our tool’s API provides a method declared as ‘void addFlow(s_id, x_id)’ where s_id identifies a decision statement s and x_id identifies a variable x. addFlow records the influence of s on x by adding the InfoFlow and DynSlice sets last computed for s to those for x. Suppose addFlow(s_id, x_id) is invoked after a branch from s that does not define x but realizes an implicit flow to x from a variable y used at s. This call has the same effect on the analysis as would executing the statement x = x along the branch. In effect, it converts an implicit flow into an explicit one. The code transformation

process involves statically identifying implicit flows (or implicit influences) and then inserting calls to addFlow at appropriate locations. Next we discuss a few example transformations and then present a transformation algorithm and discuss its limitations. Note that in this section we assume that an intruder trying to access sensitive information may have knowledge of the code whose information flows are being analyzed. We also assume, for the sake of discussion, that information flows are being monitored online.

5.1. Examples Consider the snippet of code in Figure 4-a where y1 and y2 are binary variables. If line 3 executes (i.e. (y1 == 1) holds) our basic DIFA algorithm will detect the explicit information flow from y1 at line 2 to x1 at line 3 and then to the print statement at line 7. On the other hand, if line 3 does not execute (i.e. (y1 == 0) holds) the basic algorithm will not detect the implicit flow from y1 at line 2 to x1 at line 7, by which an intruder can infer that y1 == 0 after observing that x1 == 0. Other implicit flows will also missed by our algorithm unless the transformation shown in Figure 4-b is performed. In the transformed code, each inserted addFlow call records an implicit influence from a statement to a variable allowing our DIFA algorithm to track the resulting implicit flows. Note that an intruder familiar with the code will realize that: for x1 to stay 0, the condition (y1 == 0) must hold; and for x2 to stay 0, the condition ((y1 == 0) or (y2 == 0)) must hold. The following cases illustrate why the transformation is needed: 1) When y2 is sensitive and line 5 did not execute, the addFlow call at t1 prevents line 6 from executing and disclosing that x2 == 0, from which an intruder could infer that y2 == 0. Note that at line 6, InfoFlow of x2 contains y2 as a result of t1 recording the implicit influence of line 4 on x2. 2) When y1 is sensitive and line 3 did not execute, the addFlow call at t2 prevents line 7 from executing and disclosing that x1 == 0, from which an intruder could infer that y1 == 0. 3) When y1 is sensitive, y2 is known to be 1, and line 5 did not execute, the addFlow call at t3 prevents line 8 from executing and disclosing that x2 == 0, from which an intruder could infer that y1 == 0. Note that when y2 is sensitive and y1 is not, line 8 executes only if y1 == 0, disclosing that x2 == 0 and consequently ((y1 == 0) or (y2 == 0)). Nothing is actually disclosed about the value of y2 since the conditional expression is true no matter what y2’s value is. In addition, the only predicate that uses y2 (at line 4) does not actually execute if y1 == 0. Therefore it cannot

1. if (y1 == 1) {

addFlow(1, x2_id);

2. x1 = 1; } else {

addFlow(1, x1_id);

3. x2 = 1; }

1. switch(y1) { case 0:

proc transformImplicitToExplicit(G) begin

2.

for each decision vertex p in G for each s ∈ successors(p) for each s’ ∈ successors(p) – {s} insertCalls(s, p, Defs(reachable(s’, ipd(p), G – {p}))) end

addFlow(1, x2_id);

x1 = 0; break; case 1:

addFlow(1, x1_id);

3. x2 = 1; a) 1. for(y1= 0; y1 maxRank) { thePayingEmployee = (Employee)vList.elementAt(v); maxRank = thePayingEmployee.getRank(); } } thePayingAccountNumber = thePayingEmployee.getAccountNumber(); String faxText = "Order size: " + vList.size() + "\n" + “Bill to account#: " + thePayingAccountNumber; // outlet to the outside world FaxComponent.faxFoodOrder(faxText); } public static void main(String[] args) { Vector vAttendeeList = new Vector(); // Populate list with objects of type Employee UIComponent.askForEmployeeList(vAttendeeList); // Get meeting time information Time startTime = UIComponent.askForTime(“start time?”); Time endTime = UIComponent.askForTime(“end time?”); // Reserve a room large enough for all attendees reserveRoom(vAttendeeList.size()); if (endTime.exceeds(“20:00”)) // meeting to last past 8pm? orderFood(vAttendeeList); } }

Figure 7 – Insecure Java code implementing the meeting scheduler

are object or array references. There are tools that can perform points-to analysis on Java programs, such as Soot [26], but these are based on employ their own intermediate representation rather than Java byte code, making their integration with our tool infeasible. The current transformation algorithm does not handle inter-procedural implicit flows, i.e. flows where the targets are defined in called functions. Providing such support is straightforward for static functions but not for non-static ones. Given a statement calling a non-static function, the algorithm must determine with reasonable accuracy the functions that could actually be called. Type inference [1], which is related to points-to-analysis, could be used to increase the accuracy of this process. Finally, array region analysis [31], which summarizes the subregion of an array that is affected by statements or loops,

could enable our algorithm to identify with more accuracy implicit flow targets that are array references.

6. Implementation We have implemented our information flow analysis and dynamic slicing algorithms in a prototype tool comprising 12,100 lines of Java code. In the current version of the tool, the user can define desired information flow policies by programmatically describing the following three entities: 1) The source objects to monitor, referred to as Sources. 2) The sensitive objects among Sources that might be involved in an illegal flow, referred to as SensitiveSources. 3) The sink methods, at which illegal flows are checked, referred to as Sinks. In illegal flow detection, Sources and SensitiveSources are typically the same. In debugging, SensitiveSources are typically a subset of Sources. The Sinks are typically methods that expose internal application data to the outside world, such as print and send. The tool comprises two main components: the Instrumenter and the Profiler. The preliminary step in applying our tool is to instrument the target byte code classes and/or jar files using the Instrumenter, which was implemented using the Byte Code Engineering Library, BCEL [27]. The Instrumenter inserts a number of method calls to the Profiler at given points of interest. At runtime, the instrumented application invokes the Profiler, passing it the information that enables it to monitor information flows and build program slices. Note that when applying our tool to a program, one can instrument any subset of program classes (no source code required) and Java libraries. Generally, the more classes that are instrumented, the more accurate but costly the analysis is.

7. Experimental results To verify the utility of our dynamic information flow analysis and dynamic slicing tool, we applied it in several case studies presented in [19], two of which are presented here. At the end of this section we briefly discuss the performance impact of our tool.

7.1. Case study 1: DefaultServlet We applied our tool to the DefaultServlet component of the open source Java application server Apache Tomcat, which exhibits the file disclosure vulnerability described by McClure et al [21]. This vulnerability has been found in versions of several other widely used Java

application servers, including BEA Weblogic and IBM WebSphere. McLure et al show how this vulnerability in Weblogic’s FileServlet component, which is a servlet designed to serve publicly available text files to clients, can be exploited to reveal Java Server Pages (JSP) source code of another application, which may contain sensitive data. McLure et al describe how what is learned from the disclosed JSP source code can be used to execute malicious remote commands. Tomcat’s DefaultServlet, which is also intended to serve publicly available text files, can be exploited similarly to reveal the contents of any file within the servlet root directory, simply by passing it the relative path of the file. Initially, in this case study the capability of our tool to detect illegal flows is not utilized; instead we log information flows for later analysis by a security analyst. Note that a servlet may send response data to a client through a PrinterWriter object or through a ServletOutputStream object. Therefore, the write and print methods of these two objects are information sinks that are natural to monitor for illegal information flows. Accordingly, we instrumented DefaultServlet (with the implicit flow detection option turned off) and defined an information flow policy such that Sources = SensitiveSources = {all objects} and Sinks = {write and print methods of PrintWriter and ServletOutputStream}. We issued requests to DefaultServlet to serve various files including ones located in directories not intended for public access, e.g., directories containing JSP source code. In these tests, the only information flows into a sink object involved a single write method call (with three parameters) associated with a ServletOutputStream. The log recorded the information flows and slices for the ServletOutputStream and each of the write parameters, computed at the time when they were last defined. For each request, there were flows from 17 different source objects into the sink, and there were 45 byte code instructions that affected it. The logged state of 4 of the 17 source objects included the full pathname of the requested file, which should be sufficient for a security analyst to determine whether an illegal request took place. Note that if we opted to instrument all of Tomcat’s libraries or even the Java API library, much more information would have been logged. Once it is learned that illegal requests are occurring it makes sense to redefine the information flow policy as follows: Sources = SensitiveSources = {objects of type java.io.File or java.io.FileInputStream4 associated with files not belonging to public directories} and Sinks = {write and print methods of PrintWriter and The FileInputStream class in the Java API does not provide access to the name of the file it is associated with. Since some information flow policies require such information, our tool loads a modified version of FileInputStream that does not have this limitation. 4

ServletOutputStream}. This policy would prevent any leaks such as the ones described above.

7.2. Case study 2: MeetingScheduler We applied our tool to a Java program implementing a meeting scheduler for Pentagon employees (an application we built based on fictitious requirements). Suppose that at the Pentagon there is a class of employees whose activities must be concealed from the outside world and that throughout the Pentagon’s information systems this type of employee is represented by class HSEmployee (high security), whereas LSEmployee (low security) represents other employees. Both HSEmployee and LSEmployee derive from the abstract class Employee, which contains information such as employeeId, name, rank, and expense accountNumber. Suppose that one of the Pentagon’s systems is a meeting scheduler. In order to set up a meeting, an office administrator enters the start time, the expected end time, and the identification numbers of the attendees. When the meeting is expected to last past 8 PM, the system automatically orders food from an outside caterer via a fax server. The fax includes the number of attendees and the expense account number of the employee paying the bill, who is always the highestranking attendee. If the recipient of the fax is aware of a pattern exhibited by the expense account numbers of employees of type HSEmployee, in some cases he or she will be able to infer that a high ranking, high security employee is attending a meeting late that night and that perhaps some special operations are being planned. Figure 7 shows the Java code for an insecure implementation of the described application (only relevant code is shown), which exhibits a potential information leak from an HSEmployee to an object representing a fax machine. Such an insecure implementation might result if: a) the developer was not aware of the security requirement, b) the security requirement did not exist at the time when the application was developed, or c) the developer failed to check the type of thePayingEmployee and the bug remained despite subsequent visual code inspection. In order to test this implementation using our tool, we instrumented it (with the implicit flow detection option turned off) and defined an information flow policy such that Sources = {all objects}, SensitiveSources = {attributes identifying a high security employee, i.e. name, employeeId or accountNumber of an HSEmployee} and Sinks = {methods of the FaxComponent API}. When executing test cases in which the highest-ranking employee was represented by an HSEmployee object and the meeting ended after 8 PM, the tool detected insecure information flows from an accountNumber attribute to the faxFoodOrder()

method. The logged InfoFlow set comprised the following objects: thePayingEmployee, thePayingAccountNumber, maxRank, v, vAttendeeList, endTime, one instance of accountNumber belonging to an HSEmployee object, several instances of rank belonging to LSEmployee and HSEmployee objects, and two local variables declared in the Employee constructor, which is not shown in Figure 7. The accountNumber instance was determined to be sensitive since it was an attribute identifying an HSEmployee object. That object corresponded to the highest ranking attendee whose expense account number would have been printed on the fax order. Note that if we defined Sources to be the same as SensitiveSources above, InfoFlow would have comprised a single object, i.e. the sensitive accountNumber. Although not necessary for the detection of the illegal leaks, the additional objects in InfoFlow can in some cases be valuable to the analysis. In [19] we applied our tool to a number of Java programs in order to assess its time and storage impact. As expected, in most cases there was a considerable slowdown and storage impact. The results made it clear that for processing intensive applications (those having relatively large execution traces per request) such as JarScan, Jtidy, Diff, javac and javap with an average slowdown of 3000%, it is not feasible to apply our tool online. On the other hand, for interactive and/or nonprocessing-intensive applications such as JConsole, MeetingScheduler, DefaultServlet and JDictionary, with an average slowdown of 97%, it seems feasible to apply our tool either offline or online. Since web and interactive business applications are typically not processing intensive, we believe that they are good candidates to be deployed within our tool.

8. Conclusions New algorithms for dynamic information flow analysis and dynamic slicing were presented. The DIFA algorithm is the first precise forward-computing algorithm to be proposed that applies to both structured and unstructured programs. The slicing algorithm incorporates an innovative and efficient way of computing dynamic control dependences. We have described an implementation of our algorithms that works with Java byte code programs. To our knowledge, it is the first information flow analysis tool to support configurable information flow policies and implicit flows in the dynamic setting. Currently our algorithms do not address intraprocedural and inter-procedural control dependences resulting from halt/exit instructions [25] or from exceptions. We plan to augment our algorithms and tool to provide this capability, e.g., by modifying the DDynCD algorithm to employ an extended form of control flow

graph similar to that used for static inter-procedural control dependence analysis in [25]. We also plan to address the limitations of our implicit flow detection algorithm as described in Section 5.

9. Appendix A: Background definitions As is customary in control flow, data flow, and dependence analysis of computer programs, we model programs and subprograms by control flow graphs, which are a type of directed graph (digraph). The vertices of a control flow graph G represent simple program statements or basic blocks, while the edges represent possible transfers of control between them. The initial vertex vI and final vertex vF represent the entry point and exit point, respectively, of G. A vertex of G with two successors is called a decision vertex; it represents a conditional branch predicate. A walk in G is a sequence of vertices v1, v2, ..., vn such that n ≥ 0 and (vi, vi+1) is an edge in G for i = 1, 2..., n – 1. (We call a walk beginning with vI an initial walk). The concept of postdominance plays an important role in characterizing control dependence between program statements. A vertex u in G postdominates a vertex v in G iff (if and only if) every v-vF walk in G contains u. The first postdominator of v, which is the first postdominator of v to occur on every v-vF walk, is called the immediate postdominator of v and is denoted ipd(v). (See Table 2.) The classical graph-theoretic definition of static control dependence is due to Ferrante et al [10], who used it in code optimization research. This form of control dependence, which we call “direct control dependence”, is applicable to both structured and unstructured programs. Definition. Let u, v be vertices of a control flow graph G. Then u is directly control dependent on v, denoted u DCD v, iff v has successors v′ and v′′ such that u postdominates v′ but not v′′. The set of vertices that u is directly control dependent on is denoted DCD(u). (See Table 2.) Table 2 - ipd(v), CD(v) and DCD(v) relationships of the control flow graph in Figure 1 v vI v1 v2 v3 v4 v5 v6 v7 v8 vF

ipd(v) v1 v4 v4 v4 v5 v8 v8 v8 vF 0

CD(v) 0 v4, v5, v6, v7 v1, v4, v5, v6, v7 v1, v4, v5, v6, v7 v4, v5, v6, v7 v5, v6, v7 v5, v6, v7 v5, v6, v7 0

DCD(v) 0 v4 v1, v7 v1 v7, v4 v7 v5 v5, v6 0

0

0

In her seminal work on certifying secure information flow in programs, Denning presented a graph-theoretic characterization of the conditions for implicit information flow to occur [7]. The relation Denning defined is actually the transitive closure of the DCD relation [22]. We call the former relation “control dependence”, without the qualifier “direct”. Definition. Let u, v be vertices of a control flow graph G. Then u is control dependent on v, denoted u CD v, if there is a v-u walk in G that does not contain ipd(v); such a walk is said to demonstrate that u CD v. The set of vertices that u is control dependent upon is denoted CD(u). (See Table 2.) It is proven in [23] that u DCD v iff there is a walk vWu that demonstrates u CD v and for which there is no vertex x in W such that u DCD x; vWu is said to demonstrate that u DCD v. It is easily seen that in such a walk u strictly postdominates each vertex in W but not v.

10. References [1] Agesen

O. Constraint-Based Type Inference and Parametric Polymorphism. First International Static Analysis Symposium, SAS’94. [2] Agrawal H. and Horgan J. Dynamic Program Slicing. SIGPLAN Notices, 25(6), pp. 246-256, June 1990. [3] Agrawal H., Horgan J., Krauser E., London S. Incremental Regression Testing. Proceedings of the IEEE Conference on Software Maintenance (Montreal, Canada, 1993). [4] Beszedes, A., Gergely, T., Szabó, Z.M., Csirik, J., and Gyimothy, T. Dynamic Slicing Method for Maintenance of Large C Programs. 5th European Conference on Software Maintenance and Re-engineering (Lisbon, Portugal, March 2001). [5] Denning D. E. A Lattice Model of Secure Information Flow. Comm. ACM, Vol. 19, No. 5 (May 1976), 236-242. [6] Denning D.E. and Denning P.J. Certification of programs for secure information flow. Communication of the ACM 20, 7 (1977), 504-513. [7] Denning, D.E. Secure information flow in computer systems. Ph.D. Thesis, Computer Science Dept., Purdue U., W. Lafayette, Ind., May 1975. [8] Faragó C. and Gergely T. Handling Pointers and Unstructured Statements in the Forward Computed Dynamic Slice Algorithm. Acta Cybernetica 15 (2002), 489-508. [9] Fenton, J.S. Memoryless Subsystems. The Computer Journal 17, 2 (1974), 143-147. [10] Ferrante J., Ottenstein K.J., and Warren J.D.. The Program Dependence Graph and its Use in Optimization. ACM Transactions on Programming Languages and Systems 9, 3 (October 1987), 319-349. [11] Gyimothy, T., Beszedes, A., Farago C. An Eficient Relevant Slicing Method for Debugging. In Proceedings of the 7th European Software Engineering Coinference, pages 303-321. (Toulouse, France, September 1999).

[12] Hind M. and Pioli A. Which Pointer Analysis Should I Use? International Symposium on Software Testing and Analysis 2000. [13] Jones, A. and Lipton, R. The Enforcement of Security Policies for Computation. Journal of Computer and Systems Sciences (1978). [14] Korel B. Computation of Dynamic Program Slices for Unstructured Programs. IEEE TSE, 23 (1), (1997), 17-34. [15] Korel B. and Laski J. Algorithmic Software Fault Localization. Proceedings of the 24th Annual Hawaii International Conference on System Sciences, Volume II, pages 246-251 (1991) [16] Korel B. and Laski J. Dynamic Program Slicing. Information Processing Letters 29 (October 1988), 155163. [17] Korel B. and Yalamanchili S. Forward Computation of Dynamic Program Slices. ISSTA (1994), 66-79. [18] Lampson, B.W. A Note on the Confinement Problem. Communication of the ACM 16, 10 (Oct. 1973), 613-615. [19] Masri, W., Podgurski, A., and Leon, D. Detecting and Debugging Insecure Information Flows, (extended version). Tech. Rep., http://softlab4.cwru.edu/if/index.html. [20] Masri, W. Dynamic information flow analysis, slicing, and profiling. http://softlab4.cwru.edu/if/index.html. [21] McClure S., Shah S. and Shah S. Web Hacking: Attacks and Defense. Addison-Wesley Pub. Co. (2003). [22] Podgurski A. and Clarke L. A Formal Model of Program Dependencies and its Implications for Software Testing, Debugging, and Maintenance. IEEE TSE, 16(9):965-979, September 1990. [23] Podgurski A. Significance of program dependences for software testing, debugging and maintenance. Ph.D. Thesis. Computer Sc. Dept., U. of Mass., (Sept. 1989). [24] Sabelfeld A. and Myers A.C. Language-Based InformationFlow Security. IEEE Journal on Selected Areas in Communications, 21(1), Jan. 2003. [25] Sinha S., Harrold M. J. and Rothermel G. Interprocedural Control Dependence. ACM Transactions on Software Engineering and Methodology, (2000). [26] Soot: a Java Optimization Framework. http://www.sable.mcgill.ca/soot. [27] The Byte Code Engineering Library (BCEL), The Apache Jakarta Project, http://jakarta.apache.org/bcel. Apache Software Foundation 2003. [28] Tip F. A survey of program slicing techniques, Journal of Programming Languages 3(3), (1995), 121-189. [29] Wang T. and Roychoudhury A. Using Compressed Bytecode Traces for Slicing Java Programs. Intl. Conf. on Software Engineering (Edinburgh, UK, May 2004). [30] Weiser M. Program Slicing. IEEE Transactions On Software Engineering 10, 4 (1984), 352-357. [31] Wolfe M. High Performance Compilers for Parallel Computing. Addison-Wesley Pub. Co. (1996). [32] Zhang X., Gupta R. and Zhang Y. Efficient Forward Computation of Dynamic Slices Using Reduced Ordered Binary Decision Diagrams. Intl. Conf. on Software Engineering (Edinburgh, UK, May 2004).

13

Detecting and Debugging Insecure Information Flows

Detecting and Debugging Insecure Information Flows

Suggest Documents

Detecting and Debugging Insecure Information Flows

Detecting insecure person in crowd with human sensor - Science Direct

Visualization of Information Flows and Exchanged Information ...

Migration and information flows - UOC

Detecting Knowledge Flows in Weblogs - Semantic Scholar

DCAP: Detecting Misbehaving Flows via Collaborative ... - Sahara

Detecting Malicious TCP Network Flows Based on

Geodesic Information Flows

Inbound and Outbound Information and Communication Flows ...

Debugging, Advanced Debugging and Runtime Analysis - CiteSeerX

DEBUGGING ONTOLOGY MAPPINGS - Department of information ...

The 7 dwarves: debugging information beyond gdb

Detecting Dominant Motion Flows and People Counting in ... - WSCG

Debugging

Insecure work and Ethnicity - TUC

The Insecure American

Understanding Information and Knowledge Flows as Network ...

Financial Investments, Information Flows, and ... - Elpub Wuppertal

Complementary quantum observables and resulting information flows ...

Financial Investments, Information Flows, and ... - Elpub Wuppertal

Detecting driver distraction - Information Services and Technology

Autonomic Information Flows - Semantic Scholar

Information Flows in Nature Areas

Demographic Information Flows - Semantic Scholar