International Conference on Computer Systems and Technologies - CompSysTech’13
Identification of Vulnerable Parts of Web Applications Based on Anomaly Detection in HTTP Rastislav Szabó, Ladislav Hudec Abstract: This paper describes a concept of an algorithm, which can be used for automated identification of vulnerable parts of web applications based on increased occurrence of anomalies detected by the anomaly-based IDS (Intrusion Detection System).The output of our anomaly evaluation algorithm can direct security engineers and application developers to those modules of a web application, which are “attractive” for the attackers, or even point to some security vulnerabilities in particular modules of the application. The methods of anomaly detection in HTTP traffic and process of building the model of the application’s structure by analysis of requested URLs are also discussed in this paper. Key words: Computer Security, HTTP Attacks, Intrusion Detection Systems, Anomaly Detection.
INTRODUCTION Intrusion Detection Systems (IDS) based on the detection of anomalies often achieve higher detection rates (especially for zero-day attacks) than IDS systems based on the matching of samples of particular attacks (so-called signature-based IDS). One of the disadvantages of anomaly-based IDS is lower flexibility resulting from the need of a learning phase, which must precede the productive deployment of the detection system. These systems also tend to produce a greater number of false alarms (or false-positive alarms) as signature-based IDS systems. Moreover, the less similar the traffic of learning phase to the real operation is, the more false alarms system produces. The frequent occurrence of false alarms in the IDS can result in increasing ignorance of the alarms by security personnel, who should review such security incidents. In extreme cases, this may lead to a state in which true and serious incidents are not evaluated at all, because they are lost in the number of false alarms. Our work suggests dealing with this problem in quantitative way. It focuses on the identification of the application modules use of which produces increased rate of anomalies in comparison to other application modules. Knowledge of such modules can be useful for security engineers and application developers – it can direct them to those modules of the application, which are “attractive” for the attackers, or even point to some security vulnerabilities in particular modules of the application. To allow such an evaluation, it is necessary to match each HTTP request to a particular web application module, which processes it. Given that the proposed system should be universal and applicable to multiple web applications without knowledge of their internal structure, we should achieve it only by analysing the content of HTTP requests. In earlier times of the Internet the connection between particular HTTP request and web server module that processes it can be found quite easily. In the requested URL, we can directly find the name of a script, which is responsible for processing the request. In today's complex applications that are most likely based on different design patterns (mostly MVC, or Model-View-Controller), it is not so simple – in such applications a single script (sometimes called the “bootstrapper”) is responsible for processing all the HTTP requests. Fortunately, the identification of modules and their individual actions triggered by the “bootstrapper” script usually tend to be encoded in requested URLs. This way we can reuse the learning phase of the IDS (which is necessary for anomaly detection algorithm) for construction of a simple model of the web application’s structure by analysis of Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from
[email protected]. CompSysTech'13, June 28-29, 201, Ruse, Bulgaria. Copyright ©2013 ACM 978-1-4503-2021-4/13/06...$10.00.
209
International Conference on Computer Systems and Technologies - CompSysTech’13
requested URLs. RELATED WORK Kruegel and Vigna [4] presented an Intrusion Detection System (IDS) that used several anomaly detection algorithms to detect attacks against web applications. It is based on the analysis of parameters in HTTP queries from clients and we will discuss them in the next chapter. Ingham [3] and Barosso [1] proposed some generalizations of these algorithms and added some heuristics into their detection mechanisms. Both of them used a combination of known anomaly detection algorithms which has to agree in labelling of a HTTP query as an attack or legitimate query. Chrun, Cukier and Sneeringer [2] defined some practical metrics based on imperfect data collected by an intrusion detection system (such as number of alerts per target or number of alerts per attacker). Using those they were able detect some security issues that had not been previously identified. ANOMALY DETECTION ALGORITHMS Before putting into production, an HTTP anomaly based IDS system requires a learning phase, during which the detection system tries to find the basic characteristics of legitimate HTTP queries. It focuses in particular on the existence, presence, content and structure of the variables (GET and POST parameters) in the HTTP requests. In the detection phase, the system observes anomalies from the learned characteristics. For this purpose, we used combination of algorithms described in [3] and [4] outlined below. Attribute Length The values of each parameter are often either the same or similar length (e.g. session identifiers, identifiers of individual articles), or shorter strings (e.g. data entered by the user through login or password input of the login form). Some attacks try to insert some malicious code into the parameter values (e.g. XSS, SQL injection, or buffer overflows attacks) which can extend the length of the parameter value. The detection of anomalies in attribute length is based on the standard deviation of length of parameter values. The probability p that an attribute would have the length l is determined by following formula (expressed from Chebyshev Inequality) [3]:
ߪଶ ൌ ݈ െ ߤଶ
(1)
Attribute Character Distribution This detection method is based on the assumption that the values of the parameters are similar to themselves in terms of characters used. Many of parameters consist only of numerical digits, or characters of English alphabet. If we compute the frequency of each character during the learning phase, we could compare it with the parameter’s value during the normal operation. If the parameter contains unusual characters, such a request can be considered as an attack. In the training phase, the relative frequencies of each character of the attribute’s value are calculated. Characters are then sorted by its relative frequency. Kruegel and Vigna [4] noted that the sorted relative frequencies decrease slowly for legitimate requests, but have a much steeper decline for attacks, and no decline for random data (see Figure 1).
210
International Conference on Computer Systems and Technologies - CompSysTech’13
Figure 1: Sorted relative frequencies of characters in HTTP parameters. The Order of Attributes Given that most of the web server requests are generated by clicking through the links in the HTML documents, it can be assumed that the order in which the parameters are listed in the HTTP query should be clearly defined in the referencing HTML documents. Thus, if the order of parameters in particular HTTP request does not fit learned model, we can assume that it could be some way of "guessing" of the parameters by the attacker. Presence or Absence of Attributes Some parameters are usually found only in combination with other specific parameters. If in the detection phase the used combination of parameters does not fit learned model, such a request can be evaluated as an anomaly. Enumerated or Random Attribute Values Many parameters can contain only a finite number of values - such as a valid usernames, integer numbers, and so on. If in the process of learning is found that the values of some parameters are enumerated, a value out of that enumeration, captured in the detection phase can be evaluated as an anomaly. REPORTING OF ALARMS When the IDS detects an anomaly in particular HTTP request, it should report an alert to the user. When using multiple detection algorithms (described above), it is needed to combine partial results from each algorithm to the resulting anomaly score of particular HTTP request. Kruegel and Vigna [4] for example calculate it as weighted summation of partial results (each detection model m has assigned its weight w applied to partial probability p):
ܽ݊ ݁ݎܿݏݕ݈ܽ݉ൌ
ݓ כሺͳ െ ሻ
(2)
אௌ௦
If the resulting score is above defined threshold, the alert is reported. In current works we identified a problem in the process of defining weights and the threshold, which is rather subjective. If the threshold was set too high, the detection capability of the IDS would be lower. On the other hand, if it was set too low, the IDS would produce too many false-positive alarms. Our work deals with this problem in quantitative way. We suggest setting the threshold to lower values and evaluating the average number of alerts produced by each module of a web application, rather than reporting of all alerts. Moreover, we suggest skipping the calculation of resulting anomaly score (thus eliminating subjective weights for 211
International Conference on Computer Systems and Technologies - CompSysTech’13
each anomaly detection algorithm) – an alert of each detection algorithm will be equivalent. The result is that one HTTP query could produce multiple alerts (by multiple detection algorithms). IDENTIFICATION OF VULNERABLE PARTS OF WEB APPLICATIONS The aim of this work is identification of the application modules usage of which produces an increased rate of anomalies to other application modules. To allow such an evaluation, it is necessary to match each HTTP request to a particular web application module which processes it – or in other words, to construct a simple model of a web application’s structure using analysis of requested URLs. Web Application’s Structure Learning For building a simple model of the web application’s structure we can use the learning phase, which is necessary for anomaly detection algorithm. We presume that the application could be created using one of the commonly used design patterns (e.g. MVC), which cause that all HTTP queries will be processed by a single “bootstrapper” script. Identification of the module of a web application which is responsible for processing particular HTTP query is usually transferred in the URL address of the query. The format of this information can take many forms. Consider the following example of a very common type of such query: http://modern-application.com/controller/action/parameter1/parameter2/
This request includes the identification of the application’s module (contorller), the action performed by this module (action) and several parameters (parameter1, prameter2). Unfortunately, we can’t predict the character (or a group of characters) which is used by an application as a separator of these identifiers (in this case the character /) and where exactly are these identifiers located. But a simple rule that resides in the field of current web applications is that the importance of these parameters decreases from left to right of the URL. To build the model of an application using analysis of requested URL’s we use the radix tree data structure (see Figure 2). Radix tree or compact prefix tree (in which each node with only one child is merged with its child) allows getting of all the leaves of the tree with the common (specified) prefix.
212
International Conference on Computer Systems and Technologies - CompSysTech’13
Figure 2: Example of a radix tree with the structure of a simple web application. If we store all requested URLs into this radix tree, with the number of total queries to each URL stored in its leaves, we can later get information about the frequency of use of each application’s modules during the operation. Identification of Vulnerable Parts In the previous section of the paper, we described the concept of building the web application’s structure based on analysis of the HTTP requests during the application’s operation. If we store the number of reported anomalies for specified URL in the leaves of the radix tree (in addition to the total number of queries for each URL), we could express the ratio of anomalous HTTP requests to the total number of HTTP requests for each URL. Furthermore, given that the data is stored in the radix tree, we can examine this indicator not only for leaves of the tree, but also for any subtree of the tree. Based on this information we can easily evaluate which parts of an application (subtrees of a tree) have an increased occurrence of anomalies to the number of total requests and deliver this information to security engineers, and application developers. EXPECTED RESULTS For now, we have tested our approach only on simple custom-made application and manually simulate some attacks. We have produced only some partial results we outline below. Given that some parts of the applications are naturally more interesting for attackers (e.g. processing of authentication or login forms) we can expect there increased rate of anomalies to the total number of queries processed by this particular part of the application (see Figure 3).
213
num. of alerts / num. of requets
International Conference on Computer Systems and Technologies - CompSysTech’13 0,0300 0,0250 0,0200 0,0150 0,0100 0,0050 0,0000
Figure 3: Rate of anomalies to the total number of queries processed by particular modules.
num. of alerts / num. of requets
If we intentionally omit these parts of the application from our results, we could get closer focus on rate of anomalies in other parts of the web application (see Figure 4). 0,0200 0,0180 0,0160 0,0140 0,0120 0,0100 0,0080 0,0060 0,0040 0,0020 0,0000 /article/
/forum/
/news/
/calendar/
/other/
/old/
Figure 4: Rate of anomalies to the total number of queries processed by particular modules (some results intentionally omitted). This look at the results can point to those parts of the web application that should be additionally checked against security vulnerabilities (in this case: module/ or calendar/). Increased rate of anomalies in a particular module may indicate either the existence of security vulnerabilities, or increased interest of hackers in attacking this module. CONCLUSIONS AND FUTURE WORK In this paper we have presented concept of the algorithm, which aims to automatically identify vulnerable parts of web applications based on the increased number of anomalies by the processing of HTTP queries in production use. We used several well-known anomaly detection algorithms for HTTP protocol and introduced the concept of the automatic generation of a web application’s structure model using analysis of requested URLs and their storage in the radix tree. We then used this model for flexible assessment of anomalies occurred in individual application modules.
214
International Conference on Computer Systems and Technologies - CompSysTech’13
Such an approach to the evaluation of anomalies could be helpful especially if the IDS reports many anomalies and false-positive alarms so that true and serious incidents are lost in the number of false alarms. The output of our anomaly evaluation algorithm can direct security engineers and application developers to those modules of a web application, which are “attractive” for the attackers, or even point to some security vulnerabilities in particular modules of the application. REFERENCES [1] Barroso, G.: Anomaly Detection of Web-based Attacks. PhD. thesis, Universidade de Lisboa, (2010). [2] Chrun, D., Cukier, M., Sneeringer, G.: On the Use of Security Metrics Based on Intrusion Prevention System Event Data: An Empirical Analysis. In: Proceedings of the 11th IEEE High Assurance Systems Engineering Symposium, IEEE, (2008), pp. 49–58. [3] Ingham, K..: Anomaly Detection for HTTP Intrusion Detection: Algorithm Comparisons and the Effect of Generalization on Accuracy. PhD. thesis, The University of New Mexico, Albuquerque, (2007). [4] Kruegel, Ch., Vigna, G.: Anomaly detection of web-based attacks. In: Proceedings of the 10th ACM conference on Computer and communications security, ACM Press, (2003), pp. 251–261. ABOUT THE AUTHOR Ing. Rastislav Szabó, Institute of Applied Informatics, Faculty of Informatics and Information Technologies STU in Bratislava, Phone: +421 949 183 126, E-mail:
[email protected] Assoc. Prof. Ladislav Hudec, PhD, Institute of Applied Informatics, Faculty of Informatics and Information Technologies STU in Bratislava, Phone: +421 (2) 60 291 243, E-mail:
[email protected]
215