A General Dynamic Information Flow Tracking Framework for Building Security Applications Lap Chung Lam
Tzi-cker Chiueh
[email protected]
[email protected]
Rether Networks, Inc.
Rether Networks, Inc.
99 Mark Tree RD Suite 301
99 Mark Tree RD Suite 301
Centereach, NY 11720, USA
Centereach, NY 11720, USA
http://www.rether.com
http://www.rether.com
Abstract Many security applications require tracking of control/data dependencies among information objects that flow through programs. While many static analysis tools have been developed to derive these dependencies at compile time, they are not very effective in practice, especially for programs written in C. This paper presents a general dynamic information-flow tracking framework (GDIF) for C programs that allows an application developer to associate application-specific tags to input data, instruments the application to propagate these tags to all the other data that are control/data-dependent on the input data, and invokes application-specific processing on output data with particular tag values. To use GDIF, an application developer only needs to implement input and output proxy functions to tag input data and to apply tag-dependent processing on output data, respectively. GDIF features a novel tagging technique that incurs minimal performance overhead while maintaining compatibility with legacy library functions. To demonstrate the usefulness of GDIF, we implement a complete GDIF application that enables selective sandboxing of application programs based on whether their inputs are ”tainted” or not, and show that its performance overhead ranges from 20% to 170% for the programs we tested. Keywords: information flow control, sandboxing, malicious software Word count: 9,090
1
Introduction
Conventional computer system security relies on access control mechanism to protect data. Once a program is granted the privilege to access a piece of data, the program can use the data in any way it wants. Therefore, access control mechanism cannot prevent a program from leaking confidential information. In contrast, 1
information flow control aims to exert fine-grained control over how data is used by authorized programs. Information flow control requires objects (e.g. users, files, network connections, etc.) to be annotated with a security class [5], and ensures that no information can flow from objects with a higher security class to objects with a lower security class. An information flow control policy can be dynamically enforced at run time, such as Data Mark Machine [6], or statically enforced at compile time, such as the languages Jif [11] and Flow Caml [18]. Dynamic information flow control allows dynamic binding of objects to security classes, and can support a wider range of information flow control policies. For example, it is possible to permit an unclassified officer in a military organization to access confidential information temporally. However, dynamic information flow control cannot handle implicit flows [5] efficiently. Consider the code {H=H%2; L= 0; if(H) L = 1;}. The value of H is indirectly assigned to L no matter whether H is 1 or 0. Assuming H has a high security class and L has a low security class, it is impossible to detect the illegal flow from H to L when H is 0, since the path ”L=1” is not executed. Although there is a large body of research work on static information flow control [17], none of these have been used in practice, because of the management complexity of flow control policies. A developer must not only understand the algorithm she is implementing, but also the desired security policy and how to formalize it using annotation [22]. Furthermore, static information flow control does not allow dynamic binding of objects to security classes, and thus are less flexible than what practical applications need. In this paper we present a General Dynamic Information-flow Tracking Framework (GDIF) for applications written in the C programming language, which provides an API for developers to implement their own information flow control policy and enforcement. This framework can be applied to protect both the confidentiality and integrity of a computer system, and can be used for other purposes, which require tracking of control/data dependencies among information objects that flow through programs. To demonstrate the usefulness of GDIF, we build a file source marking system on top of GDIF called Automatic Sandbox Support Module (Aussum), which tracks the provenance of files so as to selectively enable sandboxing on some applications when they operate on files with suspicious source. Behavior blocking or sandboxing technology is generally considered as one of the best solutions to zero-day attacks or exploits. However, because behavior-blocking systems execute programs in a restricted execution environment, they may limit the functionality of legitimate programs when they need full privilege to run. Therefore, to be least intrusive, a sandboxing system should sandbox only application instances that are truly suspicious. Aussum leverages GDIF to instrument applications programs to mark contents downloaded from the network, and instructs a sandboxing system to sandbox only (1) system calls that use downloaded data as arguments, (2) execution of downloaded applications, and (3) execution of applications
2
that operate on downloaded files.
2
General Dynamic Information-Flow Tracking Framework (GDIF)
GDIF allows an application developer to associate application-specific tags to input data, instruments the application to propagate these tags to all the other data that are control/data-dependent on the input data, and invokes application-specific processing on output data with particular tag values. In particular, GDIF associates each data variable in an application with a 4-byte tag, which the application developer uses to store some attribute of the data, such as the packet id, owner, file name, network IP, and security class. If more space is required, a tag can contain a pointer to a more complex data structure. that provides a mechanism for users to update and look up tag values in the following ways: GDIF is implemented as a compiler extension that provides the following services: • Intercepting input channels to initialize the tags associated with input variables: The developer can specify a set of input functions in an application that the GDIF compiler should intercept and redirect to the corresponding developer-provided proxy functions. For example, the developer can ask GDIF to redirect all calls to read to the function read proxy(int fd, void *buf, int size), which can tag the variable pointed by buf, and then call the original read function. • Intercepting assignment statements to propagate tags according to developer-propagating methods:
After each assignment statement, GDIF inserts a call to a developer-provided callback function gdif do set tag(vo *ltag, int number, void *rtags), that takes as arguments the assignment statement’s right-hand side and their tags, and propagates tags from the right-hand side of an assignment statement to its left-hand side. • Intercept output channels to examine output data and their associated tags. The developer can specify a set of output functions in an application that the GDIF compiler should intercept and redirect to the corresponding developer-provided proxy functions. For example, a user can intercept write system calls using a proxy function write proxy(int fd, void *buf, int side), which can examine the tag associated with the variable pointed by buf and decide to reject a write call that tries to write a password file to a socket descriptor. In addition, GDIF provides the following APIs for the developer-provided functions to access the tags of the arguments: • void * gdif lookup tag(void *address). If the input parameter is a pointer to a memory block, this function looks up the address of the tag of the memory block pointed by address. 3
• void * gdif lookup parameter tag(int index). If the input parameter is an integer, this function returns the address of the tag of the argument pointed by index. • void gdif save return tag(void *return address, void *tag). If a proxy function needs to return a data value, this function is called to save a copy of its tag into the shadow stack. GDIF is derived from GCC 3.3.3. The user does not need to modify their existing applications to use GDIF. To compile a program using GDIF, the user needs to implement all proxy functions and gdif set tag in a object file and link the original program with the object file. For example, if a developer wants to compile a file called myprogram.c, she should invoke GCC using the format “gcc -fgdif -fgproxy=myproxy.pro myprogram.c myproxy.o”. The -fgdif option enables the GDIF framework. The names of the intercepted functions and their corresponding proxy functions are specified in the file called myproxy.pro. The file myproxy.o contains the developer-provided proxy functions and the tag propagation function.
3 3.1
Design and Implementation of The GDIF Framework Tag management
GDIF associates a tag with each data memory block, which could be a local variable, a global variable, or a memory area returned by a malloc() call. Because a data memory block can be manipulated through pointers, it is essential to associate the tag of a data memory block and each pointer to it. There are three ways to associate a pointer with some metadata about what it points to. The fat pointer approach [2] represents a pointer with the address of a data item and its metadata. This approach cannot interoperate with legacy code and it could break applications because of change in pointer size. The shadow variable approach [21] allocates a separate shadow variable for each pointer to store the metadata. This approach cannot deal with pointers within a structure satisfactorily. Each time malloc() allocates a structure, a shadow variable needs to be created for each pointer in the structure. However, the type information of a structure is not always available when it is allocated. For example, the memory block returned by the statement void *ptr = malloc(40) may be a data structure that contains pointers. Jones [9] uses a separate search data structure to associate pointers with metadata, and completely does away with modification to pointer representation. For each data memory block, Jones’s implementation creates a node in a splay tree, which stores the base address and the size of the memory block, and the memory block’s metadata. Given a pointer, this scheme first looks it up in the splay tree to identify the corresponding metadata. A pointer matches a splay tree node if it falls within the node’s extent as defined by the base and size.
4
Originally we implemented GDIF using the splay tree scheme, where each tree node contains the base, extent, and tag of a data memory block. Our initial experiment shows that the splay tree scheme incurs a substantial performance penalty for computation intensive applications due to tree lookups. To avoid this performance penalty, we use a shadow variable to store the tag of every global/local memory block instead of putting the tags in the tree node. The key idea is that if a memory block is accessed through its name, the GDIF compiler directly looks up its tag in its shadow variable without looking up the splay tree. However, if a data variable is accessed through a pointer, GDIF creates a splay tree node for it, which contains a pointer called int *tag ptr that points to the shadow variable holding the data variable’s tag. When a data variable is accessed through a pointer, GDIF looks up the splay tree to locate its shadow variable. If a memory block is returned by malloc, GDIF stores its tag in its corresponding splay tree node, and the tag ptr pointer of its splay tree node points to the node’s int tag field. All variables whose name ends with the taint info suffix in Figure 1(B) are shadow variables inserted by the GDIF compiler. Line 12 and 14 in Figure 1(B) show that the tags of b and c, i.e. b tag info and c tag info, are directly accessed because these two variables are accessed through names. The GDIF compiler inserts the following functions into a program to manage tags: • int gdif add locals to tree(int number, [void *address, int size, void *tag addr]). This function is inserted in each function’s prologue to create splay tree nodes for all local data variables and input parameters,as illustrated by the call at Line 8 of Figure 1(B). This function takes a variable number of arguments. The first argument number indicates how many pairs of [void *address, int size, void *tag addr] are passed to the function, where address is the base address of a data variable, size is the size of the data variable, and tag addr is the address of the associated shadow variable. The return value of this function is a transaction ID, which is used to remove the tree nodes when the function returns. • void gdif remove locals from tree(int fun index). This function is inserted in each function’s epilogue to remove all tree nodes allocated for the current function. The parameter fun index is the transaction ID returned by gdif add locals to tree. • void gdif add globals to tree (int number, [void *address, int size, void *tag addr]). This function is used to allocate tree nodes for all global and static data variables in a source file. GDIF creates a global constructor function for each source file as indicated by the function at Line 37 of Figure 1(B). • void gidf add heap to tree(void address, int size) is called in the malloc proxy functions to create tree nodes for heap memory block.
5
• void gidf remove heap from tree(void *address) is called in the free proxy function to remove the tree node for the heap memory block. 1 1 int buffer[10]; 2 2 3 3 void work(void) 4 { 4 5 5 int a, b, c, d; 6 6 b = 10; 7 c = 20; 7 8 read(socket_fd, &a, 4); 8 9 d = decode(&a, b, c); 9 10 10 write(output_fd, &d, 4) 11 11 } 12 12 13 int decode(int *a, int b, int c) 13 14 14 { 15 15 int r; 16 16 r = a+*b+30; 17 return r; 17 18 } 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
int buffer[10]; void work() { int a, b, c, d; int a_tag_info=0, b_tag_info=0, c_tag_info=0, d_tag_info=0; int fun_index; fun_index = gdif_add_locals_to_tree(2, &a, 4, &a_tag_info, &d, 4, &d_tag_info); b = 10; b_tag_info = 0; c = 20; c_tag_info = 0; gdif_save_argument_tag(aussum_read, 1, &socket_fd_tag_info) aussum_read(socket_fd, &a, 4); gdif_save_argument_tag(decode, 2, &b_tag_info, 1, &c_tag_info, 2); d = decode(&a, b, c); gdif_copy_return_tag(decode_return_address, &d_tag_info, 0); gdif_save_argument_tag(aussum_write, 1, &output_fd_tag_info) aussum_write(output_fd, &d, 4) gdif_remove_locals_from_tree(fun_index); } int decode(int *a, int b, int c) { int r; int r_tag_info=0, b_tag_info, c_tag_info; gdif_init_parameter_tag(decode, 2, &b_tag_info, 1, &c_tag_info, 2); r = *a+b+30; gdif_set_tag(&r_tag_info, 0, 2,
a, 1, &b_tag_info, 0);
gdif_save_return_tag(return_address, &r_tag_info); return r; } __GLOBAL_FileName() { gdif_add_globals_to_tree(1, buffer, 40); }
(A) Original
(B) Transformed
Figure 1: This code segment illustrates how GDIF optimizes away splay tree look-ups by accessing the memory blocks’ tag, which are stored in shadow variables directly.
3.2
Dynamic tag tracking
For each non-pointer assignment statement, GDIF inserts a call to gdif set tag to set the tag of the left-hand-side memory block based on those of the right-hand-side blocks, as demonstrated by the call at Line 31 in Figure 1(B). The address of r’s shadow variable, the address of the memory block pointer by a, and the address of b’s shadow variable of the expression at Line 30 are passed to gdif set tag. In this example, gdif set tag only needs to lookup the shadow variable address in the splay tree for the memory block pointed by a. Because GDIF does not understand how a tag is used, after the addresses of all shadow variables are resolved, gdif set tag calls a user-supplied function gdif do set tag(void *ltag, int number, void *tags) to do the actual work of propagating the tags. The argument “void *tags” of this function are the addresses of the shadow variables of the right-hand-side memory blocks. When an input argument of a function call is not a pointer, GDIF needs to propagate its tag to the
6
callee. Instead of passing the tag through the standard stack, which could cause compatibility problems with legacy code, GDIF allocates a shadow stack, one per thread, to pass the tags of non-pointer function call arguments. Specifically, the caller calls gdif save argument tag, e.g., Line 17 of Figure 1(B), to place in the shadow stack the entry point of the callee and the tags of non-pointer arguments before the call and deletes them after the call. If a function call argument is an expression, GDIF creates a temporary variable to hold the expression’s result and its tag. The callee first compares its entry point with the first argument in the shadow stack, and calls gdif init parameter tag if they match, to initialize the tags of its input parameters with the tag values passed. If a function returns a non-pointer value back to the caller, it also needs to propagate the tag of the return value to the caller. It uses the same shadow stack mechanism as in the case of function call. But the arguments pushed to the shadow stack are the return address and the address of the return value’s tag, as shown by the call to gdif save return tag at Line 33 of Figure 1(B). The caller compares the return address on the shadow stack with the call site, and propagates the tag of the return value accordingly if they match, as shown by the function call to gdif copy return tag at Line 19 of Figure 1(B). When a legacy function calls a GDIF function, the comparison between the callee’s entry point and the first argument fails, and the GDIF function simply initializes the tag of all its input arguments to 0. When the GDIF function returns, the information that the GDIF function puts on the shadow stack is ignored by the legacy function. When a GDIF function calls a legacy function, the information that the GDIF function puts on the shadow stack is ignored. When the legacy function returns, the return address comparison fails, and the GDIF function simply marks the return value as non-tagged. In either case, the program continues working without any disruption. Consequently, the above tag propagation scheme solves the compatibility problem associated with the shadow variable approach. To avoid complete re-compilation of the LIBC library, we implemented a set of wrapper functions for memory and string copying library functions in LIBC, such as memcpy and strcpy. The GDIF compiler redirects all calls to memory/string copying functions to their corresponding wrapper functions, which set the tag of the destination memory block using the tag of the source memory block.
3.3
Implicit flows
Tracking implicit flows is not only essential for preventing information leaking, but also important to protect a system’s integrity. Consider the code of Figure 2(A), in which the value of euid is control-dependent on uid and gid. This code is not secure if implicit flows are not tracked. An attacker may purposely assign 0 to uid and gid to control the value of euid. Even if a policy says that no data read from network connections can be used as arguments to seteuid, we cannot detect such attack. 7
1 euid = 0; 2 if(uid > 0 || gid>0) 3 euid = uid; 4 seteuid(euid);
1 euid = 0; 2 { 3 int temp_tag_info; 4 gdif_set_tag(&temp_tag_info, 0, 2, &uid_tag_info, 0, &gid_tag_info, 0); 6 if(uid > 0 || gid >0) 7 { 8 euid = uid; 9 gdif_set_tag(&euid_tag_info, 0, 2, &uid_tag_info, 0, &gid_tag_info, 0); 10 } 11 gdif_set_tag(&euid_tag_info, 0, 1, &temp_tag_info, 0); 12 } 13 seteuid(euid);
(A) Original Code
(B) Transformed Code
Figure 2: The inserted code in bold font demonstrates how GDIF addresses the implicit flow problem. To trace the implicit flows, the GDIF compiler discovers all variables that are modified in a if statement. GDIF inserts code before the if statement to save the tags of the variables in the condition expression. Then it inserts code after the if statement to set the tags of the variable that are modified in the if statement, for example, the inserted call at Line 11 of Figure 2B. This duplicated gdif set tag call is to insure that the tag of seteuid is properly set even if the statement at Line 9 is not executed.
3.4
Array, union, and structure
GDIF treats each array, union, and structure variable as a single memory block, and allocates to it a single tag shadow variable. This means that the tag of an array/union/structure should be used to store all annotations associated with all its data fields, or it can be used to store the most important annotations. For example, in case of protecting information leaking, the tag may be used to store the highest security class. Because of this simplification, false positives may occur. If fine-grained tagging is really needed, the developer can use the shadow variable as a pointer to a more complex tag data structure. For example, for the SQL query “SELECT low1 FROM table1”, SELECT and FROM are hard-coded by the developers, and low1 and able1 are the inputs from the user. To detect SQL injection attacks, we must check if the user input contains SQL commands. Therefore, for this example, we can allocate a tag data structure that contains four sub-tags to store the tag information associated with the four different string segments.
3.5
Shared memory
Information can also flow in and out an application address space through shared memory. Therefore, intercepting the shared memory operation is also required. Although the current version GDIF does not handle the shared memory, intercepting shared memory operations is relatively straightforward. A shared memory block is created through shmget, which allocates a shared memory segment, and shmat, which attaches the shared memory segment to a pointer variable. We can intercept these two functions in the same way to malloc, and create a splay tree node for the shared memory block. In addition, we should extend 8
the splay tree node to include a field called int shared to indicate a memory block is a shared memory block or not, and should intercept the assignment statements that write to a shared memory to allow users to decide if the operation is permitted.
3.6
Slicing optimization
To focus only on those program statements that handle data that are read from input channels, GDIF uses the program slicing result from a commercial tool called Codesurfer [8]. Users specify a set of input and output functions, and GDIF performs the forward slicing on input functions and backward slicing on output functions. The intersection between these two slicing results is a set of statements that GDIF instruments because they are affected by input data and their results may be used by output functions. While conceptually promising, the performance gain from this optimization depends on the pointer usage in the applications. If an application uses pointers and/or function pointers extensively, the pointer analysis in Codesurfer is not very effective. Even when Codesurfer just performs flow insensitive pointer analysis, it requires an inordinate amount of memory resource and could run very slow if one turns on the most accurate pointer analysis option. In the case of less accurate pointer analysis option, the result that Codesurfer produces is not much different from the original program, and as a result slicing does not result in any noticeable performance improvement. Application Binary
Original Code Application Source Code
GDIF Compiler
Sandbox Tool
Kernel
Inserted Code
network connections
Input Monitor
Dynamic Taint Tracker
System Call Monitor
Security Policy Module
hook functions
Figure 3: The Aussum compiler instruments an application to mark the downloaded data or files according to a configurable security policy. The Input Monitor intercepts data read from a network connection and marks it according to this policy. The Dynamic Taint Tracker propagates the taint mark throughout the program. Finally the System Call Monitor marks files and system call arguments that are derived from network inputs.
4 4.1
Automatic Sandboxing Support Module Using GDIF Overview
Many computers are infected by malicious programs because the users explicitly download network objects containing malicious programs by reading email attachments, file transfer, web browsing, and peer-to-peer 9
file sharing. Since anti-virus tools cannot deal with the zero-day malicious software attacks, behaviorblocking technology is considered as the alternative defense mechanism against the malicious software attacks. Behavior blocking system sandboxes a program by monitoring and restricting its system calls according to a pre-defined security policy, which may break many applications that need full privileges. Therefore, to effectively stop zero-day exploits while providing users a non-intrusive computing environment, a behavior blocking system needs to address three fundamental issues: which application should be sandboxed, what policy should be used to sandbox a given application, and how to sandbox it according to a chosen policy. How to sandbox an application and what sandboxing policy to use for a given application are well understood and implemented by all existing behavior blocking systems. However, none of the existing behavior blocking systems satisfactorily solves the problem of automatically determining which application to sandbox. Traditional sandboxing systems such as Tiny Firewall [20] rely on the user to specify which applications to sandbox. This strategy is annoying and inconvenient. Eventually the user is likely to choose convenience over security and turn off the protection. Some other systems such as SurfinGuard [7] and SEES [3] attempts to address the problems by intercepting the download mechanism of web browsers and email clients and marking each downloaded file. These systems sandbox a program if it is marked. However, this approach is limited because the way they marks downloaded files or programs is tailored to individual applications and cannot be generalized to arbitrary client applications. This section presents a compiler solution called AUSSUM (Automatic Sandboxing SUpport Module) to the problem of determining when to sandbox which application. Aussum uses GDIF to instrument an application to automatically track and mark contents downloaded from the Internet. When marked contents are written to a file, the instrumented code marks the file. If a downloading application uses marked contents as arguments to some dangerous system calls such as exec, open, and unlink, the application is automatically sandboxed. Consequently, Aussum allows a legitimate application such as a web browser to run with full privilege when it operates on local files but is properly sandboxed when it operates on downloaded objects. Moreover, Aussum relieves application developers of the burden of identifying downloaded objects and determining when to trigger the sandboxing mechanism. Aussum itself does not perform sandboxing. It properly marks data items so as to trigger the underlying sandboxing system. Currently, Aussum assumes that the operating system will invoke the sandboxing system (1) when a marked file is being used as an executable binary or as an operand, and (2) when a system call argument is marked as ”network-related.” Figure 3 shows the system architecture of Aussum. The Input Monitor marks as tainted inputs from network connections that are considered suspicious according to the Security Policy module. The Dynamic Taint Tracker propagates the taint attribute of data items throughout
10
a program’s computation. The System Call Monitor checks the arguments of sensitive system calls such as unlink, open, and exec, and invokes the sandboxing system when some arguments are tainted. In addition, when tainted data is written to a file, the System Call Monitor marks the file as tainted subject to the security policy. The Security Policy Module provides a security policy in the form of callback functions, which determines if a network connection should be considered suspicious and if a tainted file should be marked as tainted.
4.2
Input monitor
Because both network connections and files are accessed through descriptors using functions such as read and fread, the Input Monitor first needs to identify suspicious descriptors. For each application, Aussum maintains a descriptor table to store information about each descriptor used in the application. To mark the descriptors’ suspicious flags, Aussum uses GDIF to redirect function calls to accept, connect, dup, and dup2 with calls to their proxy functions that create entries in the descriptor table. If a callback function int mark connection(sockaddr *addr) is provided by a sandbox system, the proxy functions call this function to mark network connections; otherwise, all network connections are marked as supicious. To mark the data read from a network connection, the Input Monitor intercepts the system calls and LIBC function calls such as read, fread, fgets, recv, and recvfrom using proxy functions, such as the aussusm read call at Line 16 of Figure 1(B). The proxy function first calls the original function, and then checks if the socket/file descriptor is marked as suspicious by consulting with the descriptor table. If data is read from a suspicious descriptor, Aussum locates the corresponding node in the splay tree, and sets its tag to true.
4.3
Dynamic taint tracker
GDIF requires users to implement their own gdif do set tag to propagate tag values in their own way. A tag value in Aussum’s implementation is either 1 or 0, which 1 means the corresponding memory block is tainted. Aussum’s way to propagate taint attributes is to perform a bitwise OR of all tag values of the memory blocks on the right-hand-side of a assignment statement. This means that if any memory block on the right-hand-side is tainted, the left-hand-side memory block should be tainted. The following code shows the Aussum’s implementation of gdif do set tag:
void gdif_do_set_tag(void *lhs_tag, int num, void *rhs_tags) { int i, temp = 0; int **rt = (int **)rhs_tags; for(i = 0; i < num; i++)
11
tmp = tmp|(*rt[i]); *(int *)lhs_tag = tmp; }
4.4
System call monitor
The System Call Monitor marks as tainted output files that contain tainted data. Ideally the taint attribute of a file should be an inherent part of the file so that when it is copied to a new file, its taint attribute is copied automatically. Under an UNIX-like operating system, there is no unused file attribute that can be used for this purpose. One way to implement the taint attribute on such a system is to create a group called TAINTED on the system. If a file is considered tainted, its group owner attribute is set to TAINTED. To support this general file marking mechanism, Aussum intercepts file output functions such as write and fwrite using proxy functions. Malicious mobile code can be executable scripts embedded in a document file such as an HTML file. After a legitimate application such as IE or Netscape downloads an embedded script, it parses and executes the script in its own address space. Although existing applications try to limit the execution privilege of embedded scripts, some malicious embedded scripts can still bypass their security check and cause damages to the local systems by exploiting design or implementation flaws of the applications. One such example is the Internet Explorer vulnerability [1], which allows a remote embedded script to open a local html file in a dialog frame using the Local Machine Zone privilege. Unfortunately, there are some local html files that take a remote file as the argument from the dialog frame and render the remote file using the Local Machine Zone privilege. As a result, remote embedded scripts can read and write arbitrary local files. Embedded scripts present a technical challenge because we don’t want to sandbox the application that downloads them, such as IE or Firefox, at all time. Ideally, we want the downloading application to be sandboxed when it is executing embedded scripts, but to run at full privilege when it operates on local files. Aussum solves this problem by selectively turning on and off sandboxing for a given application process based on whether the operands it is currently operating on is tainted or not. It makes the following three assumptions. First, a malicious script cannot cause damage if it cannot make system calls. Second, a malicious script can only make system calls through the application that interprets it. Finally, if a malicious script causes damage through system calls, some of the system call arguments must be tainted. Based on these assumptions, Aussum intercepts sensitive system calls such as open and exec to check if any of the input arguments is tainted. By default, Aussum terminates any applications containing system call invocations that use tainted arguments, except those that use file names read from the network to open or create non-existing files. Aussum allows applications to open the non-existing files because applications such as Netscape and IE need to save to cache or temp directory files whose names are derived from an html 12
Application
Description
Lines of code
tnftp yafc gnut cg wget lynx
FTP client FTP client Gnutella client newsgroup binary downloader mirroring tool text web browser
35,573 26,587 22,298 10,974 35,018 161,605
gcc (bytes) 183,820 204,444 190,804 54,260 164,904 1,038,340
Space Overhead splay shadow/splay 64.11% 61.49% 49.93% 44.14% 58.11% 46.44% 79.28% 76.81% 50.15% 48.54% 36.12% 29.81%
shadow/splay/slice 50.71% 34.33% 31.65% 61.21% 43.57% 28.85%
Table 1: Description of the client applications used in the performance evaluation study and the increase in binary size for each of three versions of the Aussum compiler. document file. Although this default policy may open doors to deny of service attack, at least it prevents malicious scripts from modifying or deleting the existing files. This policy is also consistent with the taint data policy used in Perl. This policy may also break certain client applications such as peer-to-peer applications, which allow a remote application to request a local application to perform sensitive system calls. To address this issue, Aussum again provides a callback function called aussum check argument to allow the underlying sandboxing system to override the default policy by specifying, for example, allowed read and write accesses to certain directories or execution of certain existing binaries.
4.5
Security policy module
The Security Policy Module provides the following three callback functions for the underlying sandboxing system to determine when a network connection is suspicious, to mark a tainted file, and to decide whether a sensitive system call invocation using tainted arguments should be rejected:
int aussum_mark_connection(sockaddr *addr); /* if this function return true, the connection to addr should be marked as suspicious.*/ int aussum_mark_file(int descriptor, char *file_name); /* this function actually marks the output file as tainted. */ int aussum_check_argument(char *function_name, char *argument); /* if the return value is 0, terminate the application. if the return value is 1, simply return without calling system call. if the return value is 2, let the system call go through. */
The above three functions need to be implemented in a loadable object file called Aussum hook functions.so. Aussum’s initialization code will load this object file and initialize the function pointers before a program runs. Since these three functions are very simple, it will not take much effort to add these functions into an existing sandbox system.
13
4.6
Performance Evaluation
4.6.1
Methodology
We applied Aussum on a set of network client applications listed in table 1 to evaluate the performance overhead of Aussum’s taint attribute propagation mechanism. Because every assignment statement in a program may need to be instrumented, we purposely chose CG, WGET, and LYNX, which are computation intensive, to stress-test the effectiveness of the GDIF compiler. CG is a newsgroup binary downloader, which needs to parse every mail in a group, download and decode all the attachments. WGET parses every downloaded html to find new files to download. LYNX parses and displays a web page on a terminal. We modified LYNX to exit immediate after the page is displayed, so that we could test the performance of LYNX programmatically. For the interactive programs such as CG, we also made similar modifications to them so that they could perform the required operations and exit immediately. Our server machine is a 2.8GHz P4 machine with 256MB RAM, and it runs Fedora core 3.0. The client machine is a 2.8GHz P4 machine with 1.5GB RAM, and it runs Redhat 7.3. To test the ftp client, we fetched a 6MB file. To test the CG program, we set up a newsgroup with a 40KB image and two 700KB binary programs, and used CG to read all the information from the newsgroup and download the image and the binaries. Gnut is peer-to-peer application. To test Gnut, we fetched a 500KB file and a 10KB file. We used WGET to download all Apache manual pages from the Apache server, and used the LYNX to download and display the main page of the Apache manual. We implemented a very simple Sandbox system to test Aussum, which provides callback functions to mark tainted data and files. If an application uses tainted data to call sensitive functions such as fopen and open to open an existing file for write, the Sandbox system copies the file to some temp location and opens it there. It is a simplified version of Alcatraz [10]. We do not use Alcatraz because Alcatraz is written in Java, whereas we need a sandbox system that can be loaded into Linux as a Linux module. To evaluate the performance impacts of individual techniques used in Aussum, we tested three versions of the Aussum compiler: Aussum with splay tree (splay), Aussum with both splay tree and shadow variable (shadow/splay), and Aussum with splay tree, shadow variable and program slicing (shadow/splay/slice). 4.6.2
Performance penalty
We measured the performance overhead using both elapsed time and CPU time. Since Aussum aims for network client programs, elapsed time overhead can demonstrate how the users feel about the slowdown due to the program transformation. The elapsed time overhead can be a couple magnitudes smaller than the CPU time overhead because the I/O time can amortize the CPU time overhead. However, if a user do not
14
Application tnftp yafc gnut cg wget lynx
Splay Tree Elapsed Time CPU Time Overhead Overhead 6.52% 24.14% 6.41% 78.18% 89.12% 257.14% 308.97% 1120% 77.06% 534.76% 152.86% 400%
Shadow/Splay Elapsed Time CPU Time Overhead Overhead 0.70% 19.54% 2.92% 15.45% 5.44% 28.57% 8.97% 180% 14.29% 127.2% 30.00% 166.67%
Shadow/Splay/Slice Elapsed Time CPU Time Overhead Overhead 0.90% 16.54% 1.32% 10.00% 1.16% 20.00% 6.41% 160.00% 9.75% 117.13% 31.43% 166.67%
Table 2: Performance overhead of the three Aussum compiler versions when compared with the vanilla GCC. feel the slowdown, she is likely to use our transformation technique. Table 2 shows that the CPU time overhead ranges from 24.14% to 1120% for the pure splay tree configuration. As expected, the pure splay tree version incurs the highest performance penalty for computation intensive programs. For example, the runtime overhead of CG is 1120% (or 12 times as slow), and this high overhead is attribute to the decode computation in CG, where taint attributes are propagated across each decoding computation step. WGET and LYNX also have very high overhead (534.76% and 400% respectively), which is due to the parse of each html tag. However, the pure splay tree version only has minor impacts on the performance of ftp clients because ftp clients do not perform any computation. They just fetch files and write them to disk. The shadow variable/spray tree hybrid tag management scheme is very effective. With the shadow/splay version, the CPU time overhead is from 15.45% to 180%. The CPU time overhead of CG drops from 1120% to 180%. Table 3, which records the number of splay tree lookup in each run, shows the reason that the hybrid scheme is so effective is because it eliminates the majority of the splay tree look-up. Under the pure splay tree version, the CG program needs to lookup the splay tree 2,975,687 times, but only 100,178 times under the hybrid splay tree/shadow variable version. The elapsed time overhead of Aussum’s taint attribute propagation when compared with GCC is from 6.41% to 308.97% for the splay version, from 0.70% to 30.00% for the shadow/splay version, and from 0.90% to 31.43% for the shadow/splay/slice version. LYNX only incurs 30% elapsed time overhead while its CPU time overhead is 166.67%. LYNX reads an html file from a network connection and displays it on the screen. Therefore, for the network client programs, even if our transformation technique can incur a very high CPU time overhead, the users still do not notice the difference because the network I/O time and the screen output time could significantly amortize the actual CPU time overhead. Program slicing can further reduce the performance overhead of taint attribute propagation for some programs but not the others such as TNFTP and LYNX. The main reason is that Codesurfer’s pointer analysis is not very effective against TNFTP and LYNX. When we applied the most accurate pointer analysis option, Codesurfer ran out the 1.5GM memory quickly. Therefore, we had no choice but to use the less
15
Application tnftp yafc gnut cg wget lynx
Splay Tree 23,032 51,369 72,136 2,975,687 55,537,361 193,818
Shadow/Splay 4,879 14,699 24,265 100,178 22,285,332 47,894
Shadow/Splay/Slice 4,751 14,433 20,796 83,138 22,180,508 46,392
Table 3: Number of tree lookups of each configuration. accurate pointer analysis option. However, in this case, Codesurfer’s slicing result is almost the same as the original program. From Table 3, although program slicing slightly decreases the number of splay tree look-ups for TNFTP and LYNX, the performance overhead of TNFTP and LYNX actually increases from the Shadow/Splay version to the Shadow/Splay/Slice version.
4.7
Effectiveness of Aussum
GDIF is designed to track information objects that enter a program through its official input channels. If a piece of data is injected into a program through a vulnerability such as a buffer overflow, GDIF cannot track such data. Therefore, our current GDIF implementation cannot be used to detect injection attacks. Experiments on the Aussum prototype show that all the client programs we tested can correctly mark all files that are downloaded from the suspicious network addresses specified in a security policy. If a file is downloaded from a trusted host or it is generated by the programs themselves, Aussum does not mark the file. Aussum also intercepts system calls to check system call arguments. However, if a system call argument is entered through injection attacks, Aussum cannot identify the origin of the argument, either. But Aussum still is useful in detecting attacks that exploit security weakness. For example, a web browser may allow local scripts to access the local file system while deny such access to downloaded scripts. However, the web browser may make a mistake when checking the privilege of a downloaded script and incorrectly give it access to the local file system. Because Aussum assumes that the argument of the system calls made by a downloaded script either come with the script, or are calculated from data that come with the script, it will invoke the sandboxing system to stop the script’s accesses to the local file system.
5
Repairable LAMP Using GDIF
LAMP is the abbreviation for Linux, Apache, Mysql, and PHP/Perl/Python, which is an increasingly popular open source platform for the Internet web services assuming a three-tier architecture, which consists of a front-end web server, a mid-tier application server, and a back-end DBMS. The architecture of LAMP is shown in Figure ??. When a request comes into Apache web server, Apache passes the request to the PHP 16
mysql_query
Apache
Disk write setenv − − − getenv write − − − read
CGI Program
Mysql Database mysql_query
CGI Module
Network Input
PHP Module
requests
write
Figure 4: The architecture of LAMP. module or the CGI module for processing. If the request is handed down to the PHP module, the PHP module interprets the PHP script and directly updates the file system or the Mysql database. If the request is passed to the CGI module, the CGI module invokes the corresponding CGI program and passes the request to it. Then the CGI program processes the request and the updates the file system or the Mysql database. We maintain two logs for the Repairable LAMP. For each incoming request, we generates a request ID and write {request ID, client host IP, request string} to our FrontEndLog. Then we propagate the request ID and the client host IP to the mid-tier program. If the mid-tier program updates the Mysql database, we save {request ID, client host IP, query string} to the BackEndLog. If the mid-tier program writes to the file system, then the {request ID, client host IP, file name, and the contents} are written to the BackEndLog. When an intrusion detection system detects an attack, it can use the attack packet signature to search our FrontEndLog to find the request ID, whose request string matches the signature, and uses the request ID to search the BackEndLog for the side effects caused by the request. Once the side effects are identified, it can undo the side effects. The actual file system and DBMS undo operations are outside the scope of this paper, and are being addressed in the Repairable File Service (RFS) [?] and Repairable Database (RDB) [?] systems. To log the coming requests, we intercept the input functions such as read and recv of Apache using proxy functions. Inside the proxy functions, we log the request information and tag the data variable used to store the request string using the following tag data structure: struct tag_info{ unsigned long RequestID; sockaddr * ClientAddr; struct tag_info *next; };
For this scheme, the shadow variables that GDIF inserted into Apache are used to store the addresses of the struct tag info data structures. The data structure is a linked list because a single database update operation may come from more than one request. For example, for the statement a = b + c, b and c 17
come from two different requests, and a is used in a query string to update the database. Internally GDIF enables Apache to propagate the tag information to all other variables that depend on the incoming request string. If the request is passed to PHP module for processing, we intercept the output functions such as write to log tags and the corresponding file system operations. We also intercept the Mysql functions such as mysql query and mysql real query to log the tags and the corresponding query string. However, if the request is passed to the CGI module, the CGI module invokes the corresponding CGI program to process the request. Therefore, we have to propagate the tags from Apache to the CGI program. Apache uses setenv or write family functions to pass the query string to the CGI program depending on if the GET method or the POST method is used. To propagate the tags to the CGI program, we intercept setenv and write family functions of Apache using proxy functions to save the tags in the environment variable called GDIF QUERY TAGS. We intercept the getenv and read family functions of the CGI program to retrieve the tags when getenv is used to get the environment variable QUERY STRING or when read is used to read from stdin. In the proxy functions, we call getenv(‘‘GDIF QUERY TAGS’’) to retrieve the tags and use the tags to initialize the shadow variable of the data block used to store the query string. Finally, we intercept the Mysql functions and output functions of the CGI program to log the file system or the Mysql database updates.
6
Related Work
Flow Caml [16, 18] is an extension of the Objective Caml language with a type system tracing information flow. Each variable in a Flow Caml program is annotated with a principal, which can be any entity such as a user, a file, stdin, and stdout. Information owned by one principal cannot flow to another principal unless the programmer specifically permits the operation using the keyword flow. The example flow !bob < !alice;; means that the programmer specifically allows Bob to send information to Alice. To correctly implement a program in Flow Caml, a programmer must first list all principals appeared in the program and carefully layout who can read whose information. Jif [11, 12] is a java extension, which enables programmers to annotate variables with security labels, as in the declaration int{o1:
r1, r2; o2:
r2, r3} x;. The security policy in this declaration
specifies x is own by o1 and o2. O1 allows r1 and r2 to read the data, and o2 allows r2, and r3 to read the data. In this example, only r2 can read the data x since both owners grant the permission to it. Jif’s type checking system verifies that no information from one variable can flow to another if the policies prohibit the transaction. Unlike Flow Caml, which only uses static analysis, security label in Jif can be first class
18
value, which can be assigned and checked at run time. The weakness of Flow Caml and Jif is that they require programmers to layout the information flow security policies at programming time. Although Jif enables users to set the labels of some variables at runtime, the labels of most of the variables are set at the programming time. Therefore, unlike our GDIF mechanism, Flow Caml and Jif have to modify existing applications to use their information control mechanism. Furthermore, GDIF can be used for the purposes other than security, such as logging application file system and database access requests and operations to enable fast intrusion recovery. Fenton’s Data Mark Machine [6] is an early abstract machine that implements dynamic information flow control. The data mark machine associates each variable and the program counter (PC) with a security class. The value of a variable must be computed from the variable with the same security class. Data Mark Machine solves the implicit flows problem by associating the PC with the class of the variable in the condition expression of an if-statement. Consider the code {H=H%2; L=0; if(H) L = 1;}. Data Mark Machine sets the security class of the PC using the security class of H. After that the machine does not allow any information flow that has lower security class than the PC. In this example, if H has high security class, and L has low security class, then the assignment statement L = 1 is illegal. However, if H is 0, then L is 0, where the illegal implicit flow still occurs and the machine cannot detect it. In contrast, GDIF can solve the implicit problems by passing the tags of the variables used in the condition expression to all variables modified inside the if statements no matter if the statements are executed or not. Recently, dynamic information flow technique is mainly use to detect control hijacking attacks, such as in the implementations of [19, 4, 14]. The dynamic information tracking system implemented by Sub et al.
[19] is a hardware implementation. Every memory byte and register has a one-bit hardware tag
to tag the data. All tags are initialized to zero and the operating system tags the data with one if they are from a potentially malicious input channel. Instruction sets are augmented to propagate the tags. The processor ensures that no tagged data can be used as execution control transfer. Minos [4] is also a hardware implementation that is very similar to [19]. Newsome and Song [14] implemented the similar mechanism in software. They implemented their TaintCheck system using Valgrind [13]. Valgrind is an x86 emulator that can instrument a program as it is run. Each byte of memory, including the registers, stack, heap, etc., has a four-byte shadow memory to store a pointer to a Taint data structure. Input data that come from untrusted source are marked as tainted, and the TaintCheck system instruments a program at runtime to propagate the taint. All control transfer instructions are intercepted to make sure no tainted data can be used as the transfer destination. The disadvantage of this scheme is its performance overhead, which can be more than 20 times slower. Another weakness is that this scheme can use 4 times more memory in the worse case.
19
Perl also implements taint check [15] to lock out the security bugs existed in perl scripts. When the taint check is turned on, Perl marks all user input data as tainted. The tainted data may not be used in a call to the functions such as system, eval and open. If the interpreter detects an operation that uses tainted data in a manner it considers unsafe, it stops the execution with an error.
7
Conclusion
This paper presents the design and implementation of a general dynamic information-flow tracking framework for C programs called GDIF, and a complete application of GDIF called Aussum. GDIF provides an interface for developers to specify their own tag initialization, propagation, and processing functions. Then GDIF automatically propagate these tags through an application program along data dependencies and control dependencies that actually take place at run time. The flexibility of GDIF allows it to support a wide range of applications. Aussum is an example GDIF application that is designed to minimize the disruption of sandboxing or behavior blocking systems to end users by turning on sandboxing only when absolutely necessary. Aussum is not a sandboxing system itself. Rather, it transforms network applications to mark downloaded data, including in-memory data variables and persistent files, so that the sandboxing system only sandboxes tainted executables or applications that operate on tainted files, and selectively sandboxes system call invocations that use tainted data as input arguments. Moreover, Aussum is designed to support flexible sandboxing policies that the underlying sandboxing system can customize by providing callback functions. Although Aussum minimizes the disruption of the sandboxing technology by file marking, there is a modest performance overhead associated with Aussum’s file marking operation. Measurements on the first Aussum prototype show that the CPU overhead is less than 170% for computation-intensive applications, and is less than 20% for non-computation intensive applications. The elapsed time overhead is less than 35% even for the computation-intensive applications. We are currently working on another application of GDIF: a repairable LAMP system, which uses the GDIF technology to track a request’s trajectory across processes in a three-tier architecture. In this application, GDIF instruments web/application servers to log incoming request, and correlate input requests with persistent operations against the file system and the backend DBMS server. When an error or intrusion is detected, one can use this log to identify the side effects due to the error or intrusion, and quickly repairs the compromised system by undoing these and only these side effects.
20
References [1] CERT. Microsoft internet explorer does not properly validate source of redirected frame. http://www.kb.cert.org/vuls/id/713878, 2004. [2] L. chung Lam and T. cker Chiueh. Checking array bound violation using segmentation hardware. In Proceedings of 2005 International Conference on Dependable Systems and Networks (DSN 2005), June 2005. [3] T. cker Chiueh, L.-C. Lam, Y. Yu, P.-Y. Cheng, and C.-C. Chang. Secure mobile code execution service. In Proceedings of Virus Bulletin Conference (VB), Chicago, August 2004. [4] J. R. Crandall and F. T. Chong. Minos: Control data attack prevention orthogonal to memory model. In 37th Annual International Symposium on Microarchitecture, pages 221–232, December 2004. [5] D. E. Denning. A lattice model of secure information flow. Commun. ACM, 19(5):236–243, May 1976. [6] J. S. Fenton. Memoryless subsystems. Computing Journal, 17(2):143–147, May 1974. [7] Finjan Software. Surfinguard Pro 5.7. http://www.finjan.com/products/ HomeUsersSurfinGuardPro/. [8] GrammaTech, Inc. Codesurfer. http://www.grammatech.com/products/codesurfer/. [9] R. W. M. Jones and P. H. J. Kelly. Backwards-compatible bounds checking for arrays and pointers in c programs. In Proceedings of Automated and Algorithmic Debugging Workshop, pages 13–26, 1997. [10] Z. Liang, V. Venkatakrishnan, and R. Sekar. Isolated program execution: An application transparent approach for executing untrusted programs. 19th Annual Computer Security Applications Conference, December 8-12 2003. [11] A. C. Myers. JFlow: Practical mostly-static information flow control. In Symposium on Principles of Programming Languages, pages 228–241, 1999. [12] A. C. Myers and B. Liskov. A decentralized model for information flow control. In Symposium on Operating Systems Principles, pages 129–142, 1997. [13] N. Nethercote and J. Seward. Valgrind: A program supervision framework. In Proceedings of the Third Workshop on Runtime Verification (RV‘o3), July 2003. [14] J. Newsome and D. Song. Dynamic taint analysis for automatic detection, analysis, and signature generation of exploits on commodity software. In Proceedings of the 12th Annual Network and Distributed System Security Symposium (NDSS ?05), February 2005. [15] Perl. Perlsec. http://perldoc.perl.org/perlsec.html. [16] F. Pottier and V. Simonet. Information flow inference for ML. In Proceedings of the 29th ACM Symposium on Principles of Programming Languages (POPL’02), pages 319–330, Portland, Oregon, Jan. 2002. [17] A. Sabelfeld and A. C. Myers. Language-based information-flow security. IEEE Journal on Selected Areas in Communications, 21(1):5–19, January 2003. [18] V. Simonet. Flow Caml in a nutshell. In G. Hutton, editor, Proceedings of the first APPSEM-II workshop, pages 152–165, Nottingham, United Kingdom, March 2003. 21
[19] G. E. Suh, J. W. Lee, D. Zhang, and S. Devadas. Secure program execution via dynamic information flow tracking. In Proceedings of the 11th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 85–96, October 2004. [20] Tiny Software. Tiny firewall 6. http://www.tinysoftware.com/home/tiny2?s= 7807136686411155049A1 &&pg=content05&an=tf6 home. [21] W. Xu, D. C. Duvarney, and R. Sekar. An efficient and backwards-compatible transformation to ensure memory safety of c programs. In 12th ACM SIGSOFT International Symposium on the Foundations of Software Engineering, November 2004. [22] S. Zdancewic. Challenges for information-flow security. In Proceedings of the 1st International Workshop on the Programming Language Interference and Dependence (PLID’04), 2004.
22