Dynamic Multi-Process Information Flow Tracking for Web Application Security Susanta Nanda, Lap-Chung Lam† , and Tzi-cker Chiueh Computer Science Deptt., Stony Brook University, Stony Brook, NY, 11794-4400 † Rether Networks Inc., Stony Brook, NY, 11790
[email protected],
[email protected],
[email protected]
Abstract. Although there is a large body of research on detection and prevention of such memory corruption attacks as buffer overflow, integer overflow, and format string attacks, the web application security problem receives relatively less attention from the research community by comparison. The majority of web application security problems originate from the fact that web applications fail to perform sanity checks on inputs from the network that are eventually used as operands of securitysensitive operations. Therefore, a promising approach to this problem is to apply proper checks on tainted portions of the operands used in security-sensitive operations, where a byte is tainted if it is data/control dependent on some network packet(s). This paper presents the design, implementation and evaluation of a dynamic checking compiler called WASC, which automatically adds checks into web applications used in three-tier internet services to protect them from the most common two types of web application attacks: SQL- and script-injection attack. In addition to including a taint analysis infrastructure for multi-process and multi-language applications, WASC features the use of SQL and HTML parsers to defeat evasion techniques that exploit interpretation differences between attack detection engines and target applications. Experiments with a fully operational WASC prototype show that it can indeed stop all SQL/script injection attacks that we have tested. Moreover, the end-to-end latency penalty associated with the checks inserted by WASC is less than 30% for the test web applications used in our performance study.
Key words: web application security, dynamic checking compiler, SQL injection, Cross-site scripting, taint analysis, information flow tracking
1
Introduction
Although there is a large body of research on detection and prevention of memory corruption attacks such as buffer overflow attacks [1], integer overflow attacks [2] and format string attacks [3], the web application security problem receives much less attention from the research community by comparison. The majority of web
2
Dynamic Information Flow Tracking for Web Application Security
application security problems originate from the fact that web applications fail to perform sanity checks on inputs from the network that are eventually used as operands of security-sensitive operations. For example, the SQL injection attack [4] takes advantage of such lack of sanity check to alter the semantics of SQL queries submitted to the back-end DBMS server; the cross-site script injection (XSS) attack [5] exploits the same lack of sanity check to send a scriptcontaining HTML page to a victim host via an unknowing third-party site. For the SQL injection attack, the security-sensitive operation is submission of an SQL query; for the script injection attack, the security-sensitive operation is either return of an HTTP response or storing information to the file or DBMS server. An effective way to solve the web application security problem is to apply proper checks on “tainted” portions of the operands used in security-sensitive operations, where a byte is tainted if it is data/control dependent on some network packet(s). Compared with checking network packets directly, checking only tainted operands of security-sensitive operations has several advantages. First, the checking logic can bypass most of the encoding, compression and encryption issues associated with network packets by leveraging the underlying web application’s ability to handle these issues. Second, the checking logic can pinpoint those bytes that actually affect security rather than process all network packet bytes indiscriminately, thus reducing the number of false positives without increasing the number of false negatives. Finally, checking before a security sensitive operation is performed allows incorporation of the context in which a tainted operand is used into the checking logic. Despite these advantages, how to efficiently identify the tainted portions of those operands used in security-sensitive operations is a technical challenge, especially in an internet service that spans multiple processes and/or crosses multiple machines. Even when one could successfully identify tainted portions of security-sensitive operands, it is not always possible to apply attack type-specific checks on them in a way that does not generate any false positives/negatives. A well-known technique to evade signature matching-based attack detection methods is to exploit subtle differences in byte interpretation between the attack detection engine and the target program. For example, although “;” is an SQL operator, it could be a legitimate part of a piece of text that is to be stored in a character string field. Therefore, an SQL attack detection engine cannot simply check for the existence of SQL operators in tainted operands. As another example, some web browsers such as Microsoft’s Internet Explorer (IE) can take a syntactically incorrect HTML tag such as and treat it as . Therefore, the script injection attack detection engine must behave exactly the same as IE when parsing HTML pages in order to catch all embedded scripts. This paper describes the design, implementation and evaluation of a dynamic checking compiler called WASC, which automatically adds checks into web applications used in three-tier internet services to protect them from the SQL injection and script injection attacks. WASC is built on GIFT [6], which provides a customizable dynamic information flow tracking service for a single
Dynamic Information Flow Tracking for Web Application Security
3
process. However, to support information flow tracking in enterprise grade web services, we first augment GIFT with several new features that include support for (1) implicit flows (see section 3.2), (2) information flow tracking across processes/machines through traditional IPCs as well as persistent storage such as files and databases, and (3) information flow tracking for scripting languages. WASC then tailors this augmented infrastructure to solve a taint analysis problem for multi-process multi-language web applications. In addition, to defeat evasion techniques that exploit byte interpretation difference, WASC uses a fullblown SQL parser to detect SQL operators in tainted portions of a submitted SQL query, and IE’s HTML parser to detect scripts in tainted HTML pages. As a result, WASC is able to minimize the number of false negatives without increasing the number of false positives. The rest of this paper is organized as follows. Section 2 reviews previous works on web application attack detection and prevention. Section 3 describes the new features added to the original GIFT . Section 4 details how WASC applies this new GIFT framework to address the web application security problem, which focuses on SQL/script injection attack prevention. Section 5 presents the evaluation results of the WASC prototype and their analysis. Section 6 concludes this paper with a summary of main research contributions.
2
Related Work
Researchers have tried to address web application security issues from various angles. One approach is to stop the attack traffic before it reaches the web applications. Web application firewalls [7] (WAF) were designed along this line by extending the concept of standard firewalls to look deep into the packets that are sent/received by HTTP/HTTPS/SOAP/XML-RPC/Web Service layers. Then the packets are inspected for either application-specific attack signatures, or abnormal traffic patterns. Such firewalls can be either software or hardware appliance based. Modsecurity [8] is one such example. Even with all the technologies behind WAFs, they do not work all the time and are often susceptible to evasion. In most cases they only look for specific attack signatures that were identified earlier, but cannot do much for zero-day attacks. Although detection techniques have been developed for automated attacks launched by tools such as Nikto [9], Whisker [10], Nessus [11], etc., smarter attacks like HTTP header attack [12], SQL injections [4] can easily bypass the WAFs. Another approach to detect/prevent attacks on web applications is to allow the traffic to reach the server application; however, the packets are then thoroughly monitored as they travel within the application. Ultimately, before the attacker’s packets are just about to inflict the damage, they are detected and prevented from doing so. Recent research efforts as proposed in [13–16] use taint analysis to identify inputs from the network. However, all of these systems can only propagate the taint information within a single process. As a result, they cannot be directly applied to distributed applications in general, and three-tier internet services in particular, where a web server process may receive network
4
Dynamic Information Flow Tracking for Web Application Security
inputs and use them to construct strings, which are eventually passed to different processes for further processing. WASC, in contrast, supports a general tag propagation framework that can effectively propagate taint information both across processes as well as across persistent storage. The system developed by Xu et al. [13] is a comprehensive taint analysis system that detects and prevents a wide range of attacks, such as buffer overflow, format string, cross-site scripting, and SQL injection attacks. However, we provide a more general information flow tracking framework that can be customized to propagate additional information such as packet ID, client IP, time-stamp, and so on. For example, when an attack is detected, it should be possible to repair/remove the exact disk blocks that were affected due to the detected attack. Our general information flow tracking framework enables the discovery of persistent side effects corrupted by an attack. Moreover, it is capable of propagating tag information across processes and machines. For example, a web server can accept connections from both the local network and the internet. It can be configured to mark all requests from the Internet as tainted, while leaving requests from the local network as non-tainted. After receiving a request, the web server passes the request to a CGI program, which in turn issues an SQL query to an SQL server. If the CGI program receives a request that has come from the Internet, it needs to limit the SQL commands that the request can perform. Therefore, the web server needs to pass the tag information of the request to the CGI program, so that it can distinguish between packets from the intranet and those from the Internet. The implementations of [15] and [14] also focus on detecting and preventing SQL injection attacks and cross-site scripting attacks. Both the implementations manually modify the PHP interpreter to mark the network input data and propagate the taint information through out the PHP program. Compared with WASC, these two implementations are limited to only the PHP programs. In contrast, WASC can work with any programs or interpreter written in C. Furthermore, WASC can automatically instrument C programs or interpreters written in C to dynamically propagate tag information across applications. Unlike other taint analysis systems, which instrument applications to tag and propagate taints, Su et al. [16] use a different approach to identify user inputs. They track the user’s input by using ’(|’ and ’|)’ to mark the beginning and end of each input string. They assume this marking will be preserved through assignments, concatenations, etc., so that when a query is ready to be sent to the database, it has matching pairs of markers identifying the sub-strings from the input. However, adding the two markers into an input string may break many applications. For example, some applications may check the length of an input string, and adding these two markers can make the string longer than the buffer that holds the string. Furthermore, an application may only extract part of the string to create an SQL command, such as take the sub-string ”user” from ”(|user:password|)”. In this case, it will break the application or one of the markers will be lost. TightLip [17] attempts to address the issue by preventing
Dynamic Information Flow Tracking for Web Application Security
5
applications from leaking sensitive information. However, it suffers from a high false-positive rate due to its stringent assumptions on sensitivity. Dynamic information flow technique has also been used to detect control hijacking attacks, such as in the implementations of [18–20]. The dynamic information tracking system implemented by Sub et al. [18] is a hardware implementation. Every memory byte and register has a one-bit hardware tag to tag the data. All tags are initialized to zero and the operating system tags the data with one if they are from a potentially malicious input channel. Instruction sets are augmented to propagate the tags. The processor ensures that no tagged data can be used as execution control transfer. Newsome and Song [20] implemented a similar mechanism, but in software. However, the disadvantage of this scheme is its performance overhead, which can be more than 20 times slower. The Asbestos Operating System [21] implements information flow control at process level. It uses labels to track the flow of information between processes with different privileges. Each process in Asbestos has two labels, a send label that represents the process’s current level of contamination, and a receive label that represents the maximum contamination level the process is able to accept from others. A process P can only send messages to a process Q if Q is able to receive messages from processes at P’s current contamination level, and also that Q is willing to accept contamination at P’s level. After Q receives the data from P, Q’s send label is contaminated with P’s send label. This mechanism allows the contamination level to propagate from one process to another. Since this mechanism is implemented on the process level, it cannot label each individual data object and therefore is bound to produce more false positives/negatives when compared with WASC. On other hand, WASC instruments applications to tag each individual data object in the applications and propagate the tag information from one process to another. This implementation allows WASC to effectively identify the inputs that come from users, and verify if the inputs contain SQL injection or XSS attack vectors.
3
The Extended Information Flow Tracking Framework
WASC needs to determine if any byte in the operands used in selective operations is tainted, i.e., data-dependent or control-dependent on some input network packets. Taint analysis is a special case of information flow tracking, and can be done either statically or dynamically. WASC chooses to take the dynamic approach by leveraging GIFT [6], a general compiler framework that provides a customizable dynamic information flow tracking service. Because the original version of GIFT supports only intra-process information flow tracking, we first extend it by adding several new features and then tailor it for our need. Throughout this paper, GIFT would refer to this extended version while the old version would be mentioned as ”original GIFT” or ”base GIFT”. WASC builds on top of this extended GIFT, or simply GIFT, to solve its problem, as discussed in the next section. This section first gives a brief overview of the original GIFT and then discusses the new features added to GIFT.
6
3.1
Dynamic Information Flow Tracking for Web Application Security
Basic Framework
To track information flow within a program requires associating tags with data variables, and propagating tags from dependee variables to dependent variables. GIFT is a compiler framework [6] that takes programmer-specified applicationspecific rules for tag initialization, propagation and combination, and automatically instruments application programs written in C or C++ with the corresponding tag manipulation logic. GIFT not only significantly improves the accuracy of information flow tracking through dynamic tag propagation, but also automates the process of embedding information flow tracking logic into individual applications. GIFT associates each data variable in an application with a four-byte tag that an application developer could use to store some attribute about the data variable, or pointer to more complex tags allocated in heap. Being an applicationindependent information flow tracking framework, GIFT does not interpret the tags, and leaves their interpretation to the application programmer. GIFT allows an application programmer to specify a set of interception points and the corresponding functions (a.k.a. proxies) that needs to be invoked at each interception point. The proxy functions could then be used to initialize, combine, act on tags and/or call the proxied functions. The instrumentation that the GIFT compiler adds invoke these proxy functions at the interception points. In addition, the GIFT compiler is also responsible for propagating tags throughout the entire program, including function calls and returns, and moving data into and out of an address space. Currently, GIFT supports three types of interception points: (1) Input channel functions that bring in external data into a process’ address space (e.g. read()), (2) Output channel functions that move data from a process’ address space to outside world (e.g. write()), and (3) Assignment statements that propagate information from one or more data variables to another within the process address space. In all such cases, GIFT inserts calls to the proxies supplied by the programmer. To add application-specific information flow tracking logic to an application program using GIFT, the programmer composes proxy functions for input/output channel functions and for assignment statements in an object file and links the original program with the object file. Because dynamic tag propagation incurs substantial performance overhead, GIFT directly proxies a set of commonly used library functions such as memcpy(), bcopy(), strcpy(), strcat(), etc., by summarizing the data dependency relationship between their input and output arguments, and avoids instrumenting them completely. The reader should refer [6] for all the details. 3.2
Support for Implicit Flows
The original GIFT prototype [6] did not support implicit flows, i.e. information that flows through control dependencies. This deficiency is a serious limitation at times. In our current prototype, we add the support for implicit flows by maintaining a global context variable, called gift context, which at any execution point is a list of tags associated with the control variables that affect the
Dynamic Information Flow Tracking for Web Application Security
7
program’s control to reach that point. For example, in the line marked (5) in the program below, gift context contains the list: {ti (1), tj (1), tm (0), tidx (0), tk (1)}. #include int result; void main() { int i,j,m; scanf("%d %d", &i, &j); // {} (1) if(i