Application of divide-and-conquer algorithm paradigm to improve the detection speed of high interaction client honeypots Christian Seifert
Ian Welch
Peter Komisarczuk
Victoria Univestity of Wellington P. O. Box 600 Wellington 6140, New Zealand
Victoria Univestity of Wellington P. O. Box 600 Wellington 6140, New Zealand
Victoria Univestity of Wellington P. O. Box 600 Wellington 6140, New Zealand
[email protected]
[email protected]
[email protected]
ABSTRACT We present the design and analysis of a new algorithm for high interaction client honeypots for finding malicious servers on a network. The algorithm uses the divide-and-conquer paradigm and results in a considerable performance gain over the existing sequential algorithm. The performance gain not only allows the client honeypot to inspect more servers with a given set of identical resources, but it also allows researchers to increase the classification delay to investigate false negatives incurred by the use of artificial time delays in current solutions. Keywords: Security, Client Honeypots, Intrusion Detection, Performance Improvement, Algorithms
1. INTRODUCTION Broadband connectivity and the great variety of services offered over the Internet have contributed to the success of the World Wide Web. In 2007, we are more connected than ever. The Internet has become an important source for information, entertainment, and a major means of communication at home and at work. Full-length movies, voice over IP, and large web-based department stores are some examples of services found on the Internet. With connectivity to the Internet, however, come certain security threats. Unauthorized access, denial of service, or complete control of machines by malicious users are all examples of security threats encountered on the Internet. Since the Internet is a global network, an attack can be delivered remotely from any location in the world and with great anonymity. Security professional have responded to these threats and are offering a wide range of mitigation and prevention strategies. Machines connected to a network are usually equipped with at least antivirus software and a firewall [2]. These measures are usually very successful in fending off many of these attacks.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC’08 March 16-20, 2008, Fortaleza, Cear´a, Brazil Copyright 2008 ACM 978-1-59593-753-7/08/0003 ...$5.00.
As attack vectors are barred by defenses, malicious users seek out unprotected paths to attack. One of these new major attack types is the client-side attack [1], which targets a client applications. As the client accesses a malicious server, the server delivers the attack to the client as part of the servers response to a client request. Common examples of these attacks are web servers that attack web browsers. As the web browser requests content from a web server, the server returns a malicious page that attacks the browser. If successful, the web server could, for example, install arbitrary programs on the client machine. Traditional defenses, such as firewalls and antivirus software, are ineffective against these new threats. To develop new defense mechanisms against these threats, it is necessary to study malicious servers. Prior to studying these servers, one needs to be able to find malicious servers on a network, which is the primary concern of our work. Fortunately, malicious servers need to be accessible to be successful, which enables us to search for them. Client honeypots are the new emerging technology that can perform such searches. Once a malicious server has been identified, access to it can be denied or law enforcement can be involved to aid shutting down this bad element on the network. In addition, it is possible to study the way malicious servers operate, which helps security researchers to devise more targeted and effective defense mechanisms. Client honeypots, however, are faced with some challenges of their own. First, they are faced with crawling the Internet with its millions of servers. Finding a malicious server might be similar to finding a needle in a haystack. Speed is a crucial aspect to identify malicious servers quickly and protect against them. Current client honeypot technology is slow. Unfortunately, detection algorithms that are based on monitoring unauthorized state changes of a client honeypot are expensive, which requires time and consumes resources. According to the lead researcher Wang of Honeymonkey (a client honeypot developed at Microsoft Research), many malicious websites install the first malicious file within 30 seconds [8]. Faced with the quantities of servers on the Internet, a duration of 30 seconds to classify a server 1 as malicious makes comprehensive searches of the Internet infeasible. 1 More precisely, we mean server response. For readability, we use the term server because a malicious server response implies that the server is malicious. Client honeypots, however, always deal with server responses.
Our research is targeted at increasing the performance of client honeypots by applying a divide-and-conquer strategy in the way a client honeypot interacts with and classifies potentially malicious servers. The remainder of this paper is organized as follows. In section 2, we describe existing client honeypots and review the existing algorithms. In section 3, we introduce a new algorithm to client honeypot technology that applies a divide-and-conquer strategy in interacting with potentially malicious servers. In section 4, we compare our new algorithm to the existing algorithm that guarantees identical detection accuracy. We review related work in section 5, conclude in section 6 and provide an outlook on our future research in section 7.
2. CLIENT HONEYPOTS
Figure 1: Client Honeypot Architecture Client honeypots find malicious servers on a network. They do so by generating a queue of server requests, issuing these requests to the servers one-by-one and consuming the response of the servers as shown in Figure 1. After a response is consumed, the client honeypot can perform an analysis that determines whether the server is malicious or benign. This classification is based on monitoring the system for unauthorized state changes occurring on the system after the client honeypot has interacted with a server. Client honeypots are dedicated machines and since no other activity is occurring on them, unauthorized state changes such as new processes, newly installed files, etc., can be detected by the client honeypot. Once state changes are detected and the classification has been made, the machine needs to be reset into a clean state before it can interact with another server. Client honeypots that make use of this approach are also referred to as high interaction client honeypots. Finding malicious web servers, for example, one would retrieve web pages with a browser. This will cause various state changes to occur on the system, such as files being written to in the cache. These are authorized events that we will ignore for making a classification on whether the web page was malicious or benign. However, if a new exe file appears in the start-up folder, we classify the web page as malicious because only an attack that originated from that web page could have caused the placement of this file in the start-up folder. Few high interaction client honeypot implementations ex-
ist today: HoneyClient [7]; Honeymonkey [9]; the client honeypot of the University of Washington (UW) [3]; the CHP System [12]; and Capture-HPC [6]. These client honeypots focus on malicious web servers, which they interact with by driving a web browser on the dedicated honeypot system. HoneyClient detects successful attacks by monitoring changes to a list of files, directories, and system configuration after the HoneyClient has interacted with a server. Honeymonkey also detects intrusions by monitoring changes to a list of files and registry entries, but it goes a step further by adding monitoring of the child processes to its repertoire to detect client side attacks. Besides the ability to detect additional state changes, Honeymonkey is also event based, which allows it to detect state changes as they occur. The UW client honeypot and Capture-HPC use event triggers of file system activity, process creation, registry activity, and browser crashes to identify client-side attacks. The CHP system, which uses the CWSandbox [10] as the underlying mechanism to detect state changes, is able to monitor additional types of state changes on event triggers, such as access to virtual memory areas, protective storage areas, and initiated network connections. The malicious server detection speed of these client honeypots is influenced by various factors. First, there is the underlying technology on how state changes are detected. HoneyClient, for example, utilizes snapshots that take a long time to create whereas the other implementations utilize event triggers that permit detection of the state changes as they occur. Independent of the implementation, there are additional factors that influence the total time tt to inspect a set of n servers. The network bandwidth b and average size of the request/response s, which influences the time ti to retrieve a server response, the overhead of resetting the client honeypot into a clean state after a malicious server has been encountered tr , which overall is impacted by the percentage of malicious servers that exist on a network pm , and lastly the classification delay tw . The classification delay is a purposefully introduced waiting period after a response of a server has been received before a classification is made. This is introduced because some time passes before many exploits trigger. This might be due to the nature of the exploit or intentionally introduced by the attacker to avoid detection. In a setting in which web servers are inspected, the classification delay consumes most of the time when inspecting servers. In addition to the duration to retrieve and analyze server responses comes the cost of creating a queue of server requests to issue Tq , which is usually constant. Tq , ti , tw , tr are the four factors that we take into account for determining the computational complexity of the various algorithms. To reliably attribute a malicious server response to a request, a client honeypot instance needs to interact with servers sequentially and make a classification after each server has visited. If server responses were to be retrieved in parallel, which must be supported by the available bandwidth to the client honeypot, one would be unable to reliably attribute the occurring state changes to one response or the other. Figure 2 contains the pseudo code of this algorithm. After a queue of server requests has been created, each server is visited. After each visitation, the client honeypot waits before checking for state changes on the system to classify the server as malicious or benign. If the server was indeed malicious, the state of the system is reset. The computational time complexity is O(n) as the time to inspect servers
Figure 3: Sequential Algorithm Duration Example ular buffer, but not which particular server it is. The server responses contained in the buffer would have to be manually inspected to determine this information. However, if requests of a server are grouped in a buffer, the fine grained information on which response actually caused the malicious classification might not be important. The advantage of this approach is a great reduction in the time needed to inspect a set of servers. The computational time complexity remains with O(n) as the time to inspect servers increases by constant CBulk with each additional server as shown in Figure 5. However, the constant CBulk is smaller than CSeq of the sequential algorithm, providing the performance gain. The total duration to inspect a set of servers n is given by the following Equation:
Figure 2: Sequential Algorithm
tt = Tq + CBulk n ((1 − (1 − p )k )(t k + t + t )) m i w r = Tq + k ((1 − pm )k )(ti k + tw ) n k
increases by constant CSeq with each additional server as shown in Figure 3. The total duration to inspect a set of servers n is given by the following Equation: tt = Tq + CSeq n = Tq + (pm (ti + tw + tr ) + (1 − pm )(ti + tw ))n
(1)
where ti = sb . This algorithm is implemented by the majority of high interaction client honeypot implementations, such as Honeymonkey, Capture-HPC, and the client honeypot of the University of Washington (UW).
Figure 4: Bulk Algorithm Alternatively, one can design systems to interact with servers in bulk. Figure 4 contains the pseudo code of this algorithm. The server requests are divided into buffers of size k and state changes are only checked after one classification delay per buffer. Server requests contained in the buffers can be visited sequentially or also in parallel, providing a performance gain. This approach, however, does not provide the same information as the previous algorithm. It can only determine whether a malicious server exists in a partic-
(2)
, where ti = sb . This algorithm is used by the first opensource client honeypot HoneyClient and CHP.
3.
DIVIDE-AND-CONQUER ALGORITHM
In this section, we introduce a new algorithm for client honeypot interaction and classification of potentially malicious servers. Similarly to the sequential algorithm, this algorithm is able to identify the specific malicious server response that causes the unauthorized state change, but the algorithm is designed to be faster. It is based on the divideand-conquer algorithm paradigm. The computational algorithmic complexity of this algorithm is much better than a linear approach. Instead of O(n), it has an algorithmic complexity of O(log(n)). We apply the divide-and-conquer design paradigm to our problem as shown in Figure 6. Similarly to the bulk algorithm, we divide the set of server requests into buffers of size k and issue the server requests on the buffer. State changes are only checked after all the server responses have been retrieved. Should a malicious server response be detected, the buffer is divided into two halves and the algorithm recursively applied to each half. If the buffer shrinks to size of one and a malicious classification has been made, the algorithm has identified the server response that caused the malicious classification. The overhead of retrieving server responses repeatedly over the network can be overcome by caching the server responses on a proxy server. Dividing the set of servers into groups of size k will select malicious servers according to the binomial distribution. Depending on how many malicious servers have been selected, the number of operations to identify each malicious server using the algorithm above differs. With the
Figure 5: Bulk Algorithm Duration Example licious servers is: opm (k, m) = m(log2 (k)log2 (m) + 2)1
Figure 6: Divide-and-conquer Algorithm
selection of zero malicious servers, the algorithm will operate once on the buffer and exits. If one malicious server appears in the group, the algorithm will operate 2log2 (k)+1 times on the buffer to identify the malicious server. If two or more malicious servers m appear in the group, the algorithm will traverse down the binary tree for each malicious server response m, so there will be m(2log2 (k) + 1) operations. However, some operations at the top of the binary tree are shared as the branching occurs lower down the tree. The shared number of operations has to be subtracted, so the worst case total number of operations to identify servers using this approach is: if m = 0, 1 op(k, m) = m(2log2 (k) + 1) (3) (2mlog (m) − m + 1) if m > 0. 2
The number of operations executed that do not identify a malicious server is: if m = 0, 1 opb (k, m) = ((log2 (k) + 1) (4) (log (m) + 1))m if m > 0. 2
and the number of operations executed that do identify ma-
(5)
where m > 0. Figure 7 shows an example. A buffer of 32 servers is inspected by the client honeypot using the divide-and-conquer algorithm described above. First, servers 1-32 are inspected. A malicious server is detected in this buffer, so the buffer is divided in two and servers 1-16 and 17-32 are inspected. Since both halves indicate that malicious servers exist, the buffers are further divided and inspected (servers 1-8, 9-16, 17-24, and 25-32). In the buffer with servers 1-8 and 17-24, no malicious server are identified and as such, no further investigation will be made into those buffers. In the remaining buffers, however, malicious servers are once again identified and the algorithm is applied recursively until malicious servers 10 and 26 are identified. The tree is traversed twice to identify each server, but branching took place after buffer with server 1-16 and 16-32 identified a malicious server. A total of 19 operations are counted of which 11 operations require the state of the client honeypot to be reset. As described above, the number of operations to identify a malicious server in a buffer is determined by the actual number of malicious servers in the buffer. This number of malicious servers m that are selected with a buffer of size k and a likelihood of selecting malicious servers pm is given by the binomial distribution: ! k k−m pm (6) f (m; k; pm ) = m (1 − pm ) m . The binomial distribution needs to be taken into account when calculating the total time to inspect servers: tt = Tq + CDAC n n = Tq + ti n + (f (0; k; pm )tw k X + f (m; k; pm (opb tw + opm (tw + tr )) 0