SecArt - Windows 7 Ultimate Full Version Free Download ISO

4 downloads 14097 Views 5MB Size Report
Jul 12, 2010 - expert users, making pentesting frameworks more accessi- ble to non-experts. Finally ...... Facebook viruses or old Trojan viruses which used victim's instant ..... Online, http://www.csoonline.com.au/index.php/id ;25583. 0211;fp;32768;fpid; ..... Net-Api exploit as an attack and the MiscActivity-. Observation as ...
Working Notes for the 2010 AAAI Workshop on Intelligent Security (SecArt) Mark Boddy Adventium Labs Minneapolis, MN, USA [email protected]

Robert Goldman SIFT, LLC. Minneapolis, MN, USA [email protected] 12 July 2010

Stefan Edelkamp TZI,Universit¨at Bremen Germany [email protected]

Contents Preface

1

Coordinated Management of Large-Scale Networks Using Constraint Satisfaction Martin Michalowski, Mark Boddy, and Todd Carpenter

2

Attack Planning in the Real World Jorge Lucangeli Obes, Carlos Sarraute, and Gerardo Richarte

10

Toward the Semantic Interoperability of the Security Information and Event Management Lifecycle Mary C. Parmelee 18 Unbelievable Agents for Large Scale Security Simulation Jerry Lin, Jim Blythe, Skyler Clark, Nima Davarpanah, Roger Hughston and Mike Zyda 20 Typed Linear Chain Conditional Random Fields And Their Application To Intrusion Detection Carsten Elfers, Mirko Horstmann, and Karsten Sohr 26 A Dynamic Knowledge Base for Intrusion Detection Mirko Horstmann, Carsten Elfers, and Karsten Sohr

31

Identifying Malware Behaviour in Statistical Network Data Sascha Bastke, Mathias Deml, Sebastian Schmidt, and Norbert Pohlmann

39

Efficient Automated Generation of Attack Trees from Vulnerability Databases Henk Birkholz, Stefan Edelkamp, Florian Junge, and Karsten Sohr

47

Efficient Text Discrimination Gary Coen

56

Lexical Ambiguity and its Impact on Plan Recognition for Intrusion Detection Christopher W. Geib

65

Preface This is the second in what we hope will be a series of workshops exploring issues at the intersection of Computer Security and Artificial Intelligence. This is a fertile area for research, and has been attracting an increasing amount of interest in both communities. Prior to this workshop there has been the ICAPS-09 workshop on Intelligent Security, as well as two workshops held in conjunction with the ACM Conference on Computer and Communications Security (CCS), and so organized primarily from the Computer Security community AI and security is a large and growing area, both for research and for applications. Our increasingly networked world continues to provide new opportunities for security breaches that have severe consequences at the personal level (identity theft, and resulting financial losses), for businesses (theft of intellectual property, or business plans, or costly responses to the theft of customer data), and for governments. Computing and the internet have become crucial parts of the infrastructure of almost every significant commercial or governmental enterprise. Turning off the computers or disconnecting from the network has become tantamount to turning off the power. The use of techniques drawn from AI is increasingly relevant as the scale of the problem increases, in terms of the size and complexity of the networks being protected, in terms of the variety of applications and services provided using that infrastructure, and with the sophistication of the attacks being made. Filtering the faint signals of intrusion from a flood of data related to normal operations can be viewed as data-mining. Learning methods can be applied to generate classifiers for this process, or to detect the presence of new means of attack. AI planning methods can be used to generate compact representations of possible attacks, which can then be used to deploy counter-measures. Plan or intent recognition are important areas of research as well, and are the focus of a growing number of researchers. The detection of anomalous operations or network traffic can be viewed as a component of many security functions, including both intrusion detection and plan recognition. Another recent topic is improving anomaly detection using the ubiquitous and increasingly powerful graphics processors in our computers. Because of the distributed nature of computer networks, they are susceptible to attack that comes from multiple directions, which can be mounted by an individual in a single location. Thus, the issue of information fusion (combining indications drawn from separate data-streams) is an important tool, as well. With this workshop, we hope to encourage dialogue and collaboration, both between the AI and Security communities, and among researchers on both sides, working on similar problems. Further, we hope that this will foster a continuing interaction, rather than a single (or even an occasional) gathering together.

1

Coordinated Management of Large-Scale Networks Using Constraint Satisfaction Martin Michalowski and Mark Boddy and Todd Carpenter Adventium Enterprises, LLC 111 Third Avenue South, Suite 100 Minneapolis, MN 55401 USA {first.lastname}@adventiumenterprises.com

Abstract

complete cycle, from initial network modeling and extraction of the relevant constraints, through translation into a formal constraint model, and finally the application of a Linear Programming solver to determine feasibility. This system has been demonstrated on realistic cyber-defense network models provided by domain experts, as well as on automatically-generated models, used to explore the scaling behavior of the system.

In this paper, we describe a toolset for managing the configuration and management of large-scale networks. In particular, we focus on managing limited processing and communication resources for coordinated network cyber-defense applications. Our implementation encompasses the complete cycle, from initial network modeling and extraction of the relevant constraints, through translation into a formal constraint model, and finally the application of a Linear Programming solver to determine feasibility. This system has been demonstrated on realistic cyber-defense network models provided by domain experts, as well as on automatically-generated models, used to explore the scaling behavior of the system.

Introduction Due to the scale and diversity of modern network architectures and the increasing range of missions being supported by those networks, current means for designing, fielding, controlling, and maintaining network-wide cyber-defense applications do not scale to real world applications. For example, the United States Air Force (USAF) must protect a large and diverse set of interconnected networks spanning from unstable, low bandwidth, weakly connected ad hoc tactical components (e.g., ground sensors) to more stable operational components (e.g., aircraft, ground control, ISR platforms) through to very large, stable strategic components (e.g., SATCOM). These networks support a broad range of missions with real-time, mission-critical, and life-critical requirements. Misconfigured defensive deployments that accidentally consume more resources than expected, make unplanned system modifications, or otherwise unfavorably affect mission performance can have severe consequences. On a small scale, coordinated cyber-defense operations can be managed using carefully constructed deployment rules and controls. For networks consisting of hundreds of thousands of nodes, manual oversight and configuration is infeasible. In this paper, we describe a toolset for managing the configuration and management of large-scale networks. In particular, we focus on managing limited processing and communication resources for coordinated network cyberdefense applications. Our implementation encompasses the

Figure 1: Cyber Architecture Reasoner Inferring Network and Application Environments (CARINAE) Given a network model and information regarding current and planned network operations in support of both missions and network defense, the Cyber Architecture Reasoner Inferring Network and Application Environments (CARINAE) system shown in Figure 1 provides Cyber-Defense System developers and operators with the means to detect and resolve resource conflicts in network cyber-defense operations. Focused on large, service-oriented net-centric enterprise systems, CARINAE leverages constraint-based reasoning and open source, industry standard tools to create a robust analytical architecture that can analyze the interactions between network configurations and mission requirements for large-scale defensive cyber applications. CARINAE provides bandwidth, memory, and computational performance guarantees for large networks supporting diverse operational missions and defensive applications. CARINAE

c 2010, Association for the Advancement of Artificial Copyright ! Intelligence (www.aaai.org). All rights reserved.

2

might consist of bandwidth requirements from each of the remote locations to the video server, as well as processing time and memory requirements on both the server and the remote nodes. These bandwidth requirements may specify a particular route through the network (for example, an encrypted channel), a set of routes, or make no specification at all beyond the need for a certain level of throughput, however it is to be satisfied, in which case CARINAE may allocate that bandwidth across multiple paths through the network. For example, the bandwidth requirement associated with b2 can be realized by distributing the bandwidth across the direct route R3 to R1, and the less direct route through R2.

has been employed to analyze networks consisting of up to 1,000,000 nodes.

Motivating Example Figure 2 shows a simple network model supporting a fourway video conference, where users on four different networks are using a video conference server a2. This video conferencing task places access, bandwidth, and quality of service (QoS) obligations on the satellite link S1, the firewalls protecting the individual networks, and the routers on the WAN. It also levies memory and processing obligations on the endpoint hosts, most notably a2. Whether this task is feasible depends on the current state of the network. If S1 or any one of the routers becomes inoperable, or if one of the firewalls or routers is configured to deny access on the ports used for the video conference traffic, one or more of the participants will be unavailable. Similarly, it is possible that the satellite link, or some combination of links on the WAN will be bandwidth limited to the point where they cannot support the throughput required. The issue of bandwidth limitations becomes more of a concern when we introduce a second task. In this task, users on network B, C and D collaborate in performing a data fusion task, which levies its own connectivity and bandwidth obligations on the network infrastructure. If both this task and the previous video conference task are performed simultaneously, there may be infeasibility resulting from bandwidth constraints on one or more of the network links.

Figure 3: Example Mission Tasks and Network Configuration Timeline As a result of this extraction process, we have a set of resource requirements, or demands, derived from the tasks associated with one or more planned missions, and a set of resource availabilities, or capacities, derived from the current network configuration. The tasks associated with each mission must be scheduled: each will have a specified starting and ending time. There may also be planned network configuration changes, which also take place at specified times. As shown in Figure 3, these times partition the timeline into periods over which both the set of tasks and current configuration are unchanging. We refer to this as a multiperiod model. Multi-period models involving continuous allocations of bandwidth, CPU, and memory resources can be represented by sets of linear inequalities, and solved using linear programming (LP) solvers, which are capable of handling problem sizes involving hundreds of thousands of variables, and millions of constraints.

Implementation The CARINAE implementation has several components, as well as a set of well-defined interfaces between them. The task and network models are maintained in the Architecture Analysis and Design Language, using the Eclipse-based Open-Source AADL Tool Environment (OSATE). From these models, an Eclipse plug-in is used to extract resource demands associated with tasks, and resource capacities as determined by the projected network configuration. The output of this plugin is formatted in the MINIZINC (Nethercote et al. 2007) constraint modeling language. The resulting constraint problem can then be submitted to the accompanying MINIZINC solver or other CSP solvers that accept the MINIZINC language. However, both the MINIZINC solver and other solvers capable of accepting or translating MINIZINC input, such as

Figure 2: Video Conference Task

Constructing the Feasibility Model Determining mission feasibility starts with the extraction of a mathematical statement of the feasibility problem from the mission and network models. More specifically, this process starts with a set of tasks to be supported and the network configuration that will be in place at the time those tasks are active. Each one of these tasks has an associated set of resource requirements, specifying the need for system resources for the duration of that task. In the video conferencing example shown in Figure 2, these resources

3

GeCode1 are problematic for this application, based on their expressive limitations (for example, the use of floating-point numbers), or their failure to scale to very large problem instances. Consequently, our use of CSP solvers accepting MINIZINC directly was limited to small test cases, used to debug the constraint extraction and modeling process. The scalability desired for CARINAE was a primary motivation for starting with a multi-period model, simple forms of which can be represented as linear programs, and more complex versions of which can be addressed either through repeated solutions of an LP, or using Mixed-Integer Linear Programming. Therefore we translate the MINIZINC representation of the problem into MPS, an input language accepted by a broad array of linear programming systems.2 This model is then solved using CLP, an open-source C++based linear programming solver.3 CLP has proved remarkably efficient and robust, scaling effectively to network models involving over one million nodes. This architecture has several advantages. First, it isolates the extraction of the feasibility model from the network and mission models, separating that process completely from the choice of solver and solver input format. Second, using M INI Z INC as an intermediate representation provides considerable expressive flexibility. In addition to the LP models currently being employed, M INI Z INC can represent finitedomain constraints, with built-in constructs supporting highlevel specification of constraints. Thus, we can use the same intermediate representation while varying either or both of the network and mission models, or the solver and solver input language. This provides an ideal basis for exploring alternative modeling formalisms, languages, or techniques.

C = !E, P, L, D"

comprising • a set E of processing elements, • a set P of ports, • a set L of links, and • a set D of demands. For the work reported in this paper, we employ a static CSP model, which describes the state of the system at a single instant (or over a single period) of time. Consequently all resource limits and demands are expressed in time-free terms. CPU demand is expressed as a MIPS requirement. Communications demand is expressed as a requirement for a specified data-rate. Memory demand is expressed as an amount of memory that must be allocated, out of a finite store. Processing Elements A processing element e ∈ E has the following attributes 4 • CPU capacity (in MIPS): mips(e) $→ R+ • memory capacity (in MB): mem(e) $→ R+ • A set of sub-elements: sub(e) $→ E • CPU-aggregate-utilization: aggmips(e) $→ R+ • memory-aggregate-utilization: aggmem(e) $→ R+

Ports Ports collect communication flows into and out of a particular processing element. A port p ∈ P has these attributes: • an associated processing element: pe(p) $→ E • throughput capacity, defined in Mbps: Mbps(p) $→ R+

Constraint Model

Links Communication connectivity is provided by links. Links are directional, with the following attributes for a link l ∈ L: • a source port: src(l) $→ P • a destination port: dest(l) $→ P • throughput capacity, defined in Mbps: Mbps(l) $→ R+ We’ve chosen to use directed rather than undirected edges, and do not explicitly model hyper-edges (connections among larger sets of nodes than pairs).

Once extracted from the network and mission models, the feasibility model is represented as a set of constraints among a set of variables, corresponding to decisions about network configuration, mission schedules, and possibly choices among different means of achieving a particular mission. Additional variables represent the network’s state, comprising its current (fixed) physical and logical structure, the set of active or scheduled missions, and the resulting network demands. For example, modeling the video conference task requires representing all of the hardware and bandwidth requirements specified, then solving to find an assignment of those demands to nodes or links that have available the required capacity. In the CARINAE model, we represent computing nodes as processing elements, which may be organized hierarchically. Communication between nodes is via a set of links among ports. Task resource requirements are represented as a set of demands on the available resources.

Demands We have three resources, and so three kinds of demand. CPU and memory demands d ∈ D have the following attributes: • a processing element with which the demand is associated: orig(d) $→ E • a demand level: demand(d) $→ R+ A communication demand d ∈ D has • a source port: src(d) $→ P • a destination port: dest(d) $→ P • a demand level: demand(d) $→ R+ • a set of allowed links: allowed(d) $→ 2L

Definition of a CSP Model We start by defining an instance of a CSP problem C as a tuple 1

www.gecode.org lpsolve.sourceforge.net/5.5/mps-format.htm 3 projects.coin-or.org/Clp 2

4

4

R+ denotes the set of non-negative real numbers.

Demand Constraints

The symbol 2L denotes the power set of L. To differentiate among the different types of demands, we define subsets of D: Dcpu , Dmem , Dcomm , each comprising all of the CPU, memory, and communication demands, respectively.

We model two different types of demand constraints. Memory and CPU Demands Memory and CPU demands are imposed by constraining the corresponding aggregate utilization for orig(d). For memory: ∀d ∈ Dmem , aggmem(orig(d)) = demand(d) (7) and for CPU: ∀d ∈ Dcpu , aggmips(orig(d)) = demand(d) (8) Additionally, we constrain sub(orig(d)) to equal ∅, restricting the imposition of memory and CPU demands to leaf elements in the processing element hierarchy. This does not restrict our ability to model demands at non-leaf nodes, because those demands can be assigned to a dummy leaf node which is then added as a sub-element of the appropriate processing element.

Processing Element Constraints Processing elements are defined in a part/whole hierarchy of elements and sub-elements. For processing elements ei , ej , where i &= j: • ei ∈ sub(ej ) ⇒ ej &∈ sub(ei ) • ei ∈ sub(ej ) ⇒ ei &∈ sub(ek ), ∀k &= j In AADL terminology, we may have a node, which has multiple CPUs, which support multiple virtual machines (VMs), which support multiple platforms. The term processing element can be applied to either hardware or software. There are two possible views of the processing element hierarchy. In the aggregate model, processing elements at any level in the hierarchy impose constraints corresponding to usage attributes, interpreted as resource limits to be compared to their aggregate-utilization attributes. For processing elements having no sub-elements, the aggregateutilization attributes are set directly (see the Demand Constraints section). For processing elements with sub-elements the aggregate-utilization attributes are computed from the aggregate-utilization attributes of their sub-elements.

Communication Demands Communication demands are imposed by adding the required throughput to the specified ports. See Constraints 12 and 13, below.

Link Constraints

We can view the set of ports P and any set of links L ⊆ L in a given CSP model as a directed graph G = !P, L", with vertices P and edges L, where each edge l ∈ L is labeled with Mbps(l). Then G fits the definition of a flow network.5 ∀e ∈ E, aggmips(e) ≤ mips(e) (1) Because different communication demands are represented ! " ∀e ∈ E : sub(e) &= ∅, aggmips(e) = aggmips(e ) as different flows, with distinct sources and sinks, we need to represent this as a multi-commodity flow problem, with each e! ∈sub(e) demand d corresponding to a different commodity. Conse(2) quently, we define network flow on a specific link with reSimilarly for memory: spect to a given set of links L and demand d: ∀e ∈ E, aggmem(e) ≤ mem(e) (3) flowL (l, d) $→ R+ ! ∀e ∈ E : sub(e) &= ∅, aggmem(e) = aggmem(e" ) Then we add the following constraint: ! e! ∈sub(e) ∀l ∈ L, flowL (l) = flowL (l, d) (9) (4) d∈Dcomm This is a good model for aggregated global resources such as The total flow on a given edge (link) must be less than the power or communication bandwidth, where at any level of capacity: the hierarchy the sum of the budgets for the next level down ∀l ∈ L, flowL (l) ≤ Mbps(l) (10) may be more than the capacity limit imposed (we assume not Demands can only flow on allowed links: everyone will draw their maximum budget concurrently). In the budget model, processing element capacities im∀l ∈ L, ∀d ∈ Dcomm : l ∈ / allowed(d), flowL (l, d) = 0 pose constraints both down the hierarchy (resource limits) (11) and up the hierarchy (resource demands). In this case, we We define flow at a vertex (i.e., port) p, which also enforces replace the constraints 2 and 4 above with conservation of flow in the network: ! " ∀d ∈ Dcomm , ∀p ∈ P, flowL (p, d) = ∀e ∈ E : sub(e) &= ∅, aggmips(e) = mips(e ) ! ! flowL (l, d) + demand(d) e! ∈sub(e) (5) {l∈L | src (l)=p} {d∈ Dcomm | dest(d)=p} ! ∀e ∈ E : sub(e) &= ∅, aggmem(e) = mem(e" ) (12) e! ∈sub(e) ∀d ∈ Dcomm , ∀p ∈ P, flowL (p, d) = (6) ! ! flowL (l, d) + demand(d) This is more appropriate for something like weight, or power budgets for sub-assemblies that don’t get switched on and {d∈ Dcomm | src(d)=p} {l∈L | dest(l)=p} off. There is no requirement that either the aggregate or (13) budget models be uniformly applied in a given processing 5 element hierarchy; both may be needed, in different places. en.wikipedia.org/wiki/Flow_network

5

varying the depth of the tree from two through seven. In all cases, demands were automatically generated, so that each internal node in the network has at least one demand passing through it from one child node to another child node, in addition to any demands that require communication up or down the tree through that node. The number of demands in the network thus grows slightly faster than linearly. These results demonstrate how the solving time increases as the size of the network grows by increasing the depth of the tree. In this configuration, the system is permitted to look for communication paths along any link in the subtree containing both the source and destination node. For nodes whose only common ancestor is the root, the demands can potentially use any link in the entire network.

These constraints add the communication demand, as well. Note that according to this definition, there is no requirement that a given communication flow use a single path from one port to another. The throughput required may be spread over any or all of the possible paths between the two points. Finally, there are constraints on flows through ports, due to the capacities we allow to be specified on ports: ! ∀p ∈ P, flowL (p) = flowL (p, d) (14) d∈Dcomm

flowL (p) ≤ Mbps(p)

(15)

Experimental Evaluation For CARINAE, we demonstrated two somewhat-separate properties. First, we demonstrated the end-to-end process, starting with AADL representations of an assortment of network management problems related to cyber-defense. These problems were designed to reflect properties of real-world networks, and either generated by or validated by domain experts. The largest of these networks consisted of up to a million nodes, though most of our testing focussed on considerably smaller networks. For the purposes of testing for scalability, we used an automated generator, so as to be able to more systematically control different properties of the networks being evaluated. The instances produced by our instance generator take the shape of a balanced tree where the user can control the depth and width of the tree. Additionally, the user can adjust the capacities of ports and the procedure used to generate the communication demands. Node connections in the tree are modeled using two communication links in opposite directions. Consequently, some path exists from any network node to any other node.

Figure 5: Branching Factor = 3, 20 demands Figure 5 shows the growth in solution time with increasing network size for a fixed number of demands. Again, the branching factor is constant at three and the depth of the tree varies. A comparison to Figure 4 demonstrates that increasing demands as a function of network time makes the problem significantly larger and more difficult.

Linear Programming Solver Performance In the graphs in this section, the x-axis shows the number of nodes in the problem instance, and the y-axis is the time required to generate a feasible solution.

Figure 6: Branching Factor = 3, functional demands, path only nodes Figure 4: Branching Factor = 3, functional demands

In Figure 6, the instances are the same structure and size as those in Figures 4 and 5 but now the system will only consider paths using links either up or down the tree, imposing a unique path from one node to another. The performance im-

Figure 4 shows the results obtained when using a constant branching factor of three for each problem instance and

6

provement for this more restrictive model is minor, supporting the argument that the more flexible routing scheme does not lead to a significant extra cost for pairwise demands.

(a) Depth = 3, Branching Factor = 15

Figure 7: Depth = 3, functional demands Figure 7 shows how solution time grows for a fixed-depth network with an increasing branching factor. In these sets of experiments we fixed the tree depth to three and varied the branching factor from three through 100. Again, the number of demands was functionally-based on the number of nodes, growing slightly faster than linearly, and we used the same characterization of allowed paths as the models in Figure 4. When compared to the results presented in Figure 4, we see how longer paths (deeper trees) versus more paths (wider trees) through a given router affects performance in this model. The solve time grows slightly faster in instances with more paths (wider trees).

(b) Depth = 3, Branching Factor = 15, localized demands

Figure 8: Varying Demands in a Fixed Network

Related Work The solution techniques we employ are drawn from the very large body of work on methods for Constraint Satisfaction and Constraint Optimization for combinatorial, continuous, and hybrid problems. See, for example (Dechter 1989; Nadel 1990; Boddy and Johnson 2002; Michalowski and Knoblock 2005; Hentenryck and Michel 2006). The work reported here uses purely continuous models, but in using M INI Z INC, we have deliberately opted for a significantly more expressive constraint language. As discussed in the next section, we plan to extend CARINAE to handle more general problems, for which this expressive power may be needed. There is as well a considerable body of work on building network models, including the use of tools such as NMAP to construct these models largely or completely automatically. Deriving a CSP model of the form we employ here directly from a network model is not something we have previously seen discussed. Tools for large-scale network configuration management are not widely available. The need for these tools shows clearly in the increase in propagation speed of network worms, such as Code Red and Sapphire/Slammer, early in this decade. Code Red I infected 359,000 Internet hosts on July 2001 in less than 14 hours (Moore, Shannon, and Brown 2002). Its infection rate peaked at 2,000 hosts per minute, and the number of infected hosts doubled approximately every 37 minutes. The following year, the Sap-

To test the growth rate of the solution time as a function of the number of demands, we generated instances with a fixed depth of three and branching factor of 15, varying the number of demands from 10 through 4000. Figure 8(a) shows that very close to linear growth in solution time with respect to the number of communication demands (messages) for a fixed-size network. Comparing Figure 8(a) to Figure 8(b) shows that the solution time is not significantly affected by restricting demands to be between leaf nodes with a common immediate parent, despite the fact that this strongly restricts the possible paths between the two nodes. Figure 9(a) and Figure 9(b) show performance information for a broadcast message. This is specifically NOT multicast: each network host is sent its own message. The performance in Figure 9(b) is significantly improved over Figure 9(a), by virtue of the routing guidance provided by restricting possible paths to network links up and down the tree. This is in dramatic contrast to the pairwise communication modeled in Figure 4 and Figure 6, where this restriction had only a minimal effect. Furthermore, this restriction puts the performance for a broadcast message in the same general area as for pairwise messages, with or without the path restriction.

7

two things. First, the user must coordinate demands (for example, the need to allocate both CPU and communication resources at the same time for the same task). Secondly, the user may need to specify sequential phases of the same mission. Business process modeling formalisms such as YAWL offer a way to capture a mission’s tasks and the associated demands. The mission can then be analyzed, monitored or automated, with multiple demands thus being derived from a single representation (ter Hofstede et al. 2009). Extending this model to consider latency as well as throughput is another direction for further work. For example, an important difference between Code Red and Sapphire/Slammer is that Code Red, which relied on a TCP connection to propagate, was limited by network latency, while Sapphire/Slammer, which consisted only of a single UDP packet, was limited by network bandwidth (Moore et al. 2003). In fact, Sapphire/Slammer’s scanning quickly interfered with its own growth. Finally, it is clear that defensive approaches will need to be more efficient than the malware that they combat. Theoretical analysis suggests that optimizations, such as deploying a list of target addresses and partitioning that list as deployment progresses, or implementing well-known servers to distribute scan lists upon request, can improve performance (Staniford, Paxson, and Weaver 2002), but the fastest possible deployments will rely on pre-determined “spread trees” to defend only known vulnerable hosts (Staniford et al. 2004). For example, the time to defensively inoculate N hosts with a K-way spread tree is projected as O(logK N ). Further scaling experiments modeling these more structured approaches to cyber-defense can provide useful information regarding the best areas for further work on either expressiveness or scalability for CARINAE.

(a) Branching Factor = 3, broadcast demands

(b) Branching Factor = 3, broadcast demands, path only nodes

Figure 9: Broadcast Demands with a Constant Branching Factor of 3

Acknowledgments

phire/Slammer worm was two orders of magnitude faster, doubling every 8.5 seconds, and achieved its maximum infection rate of 55 million scans per second after only three minutes (Moore et al. 2003). Effective defense of large networks from attacks of this nature requires a coordinated response.

This material is based upon work supported by the Air Force Research Laboratory under Contract Number FA8750-08-C.

References Boddy, M., and Johnson, D. 2002. A new method for the solution of large systems of continuous constraints. In Notes of the 1st International Workshop on Global Constrained Optimization and Constraint Satisfaction. Dechter, R. 1989. Constraint Processing. The MIT Press. Hentenryck, P. V., and Michel, L. 2006. Nondeterministic control for hybrid search. Constraints 11(4):353–373. Michalowski, M., and Knoblock, C. A. 2005. A Constraint Satisfaction Approach to Geospatial Reasoning. In Proceedings of AAAI-05, 423–429. Moore, D.; Paxon, V.; Savage, S.; Shannon, C.; Staniford, S.; and Weaver, N. 2003. The spread of the sapphire/slammer worm. Moore, D.; Shannon, C.; and Brown, J. 2002. Code-red: a case study on the spread and victims of an internet worm. In Proc. of the 2nd ACM SIGCOMM Workshop on Internet measurement, 273–284.

Discussion and Future Work We have demonstrated that a multi-period feasibility model can be solved efficiently for very large instances. There are several directions in which we plan to extend this work. The first one is to support successively more flexible feasibility problems. For example, a simple scheduling problem can be supported by a minor extension to the multi-period model, in which missions are added one at a time, generating a small number of additional periods requiring solution. Scheduling problems that involve moving tasks around, or deciding whether or not to schedule missions, are potentially significantly more difficult to solve, because they contain a mix of both discrete and continuous variables. In the work reported here, we have modeled mission requirements as independent demands on processing and communication resources. This puts the onus on the user to track

8

Nadel, B. A. 1990. Representation selection for constraint satisfaction: A case study using n-queens. IEEE Expert: Intelligent Systems and Their Applications 5(3):16–23. Nethercote, N.; Stuckey, P. J.; Becket, R.; Brand, S.; Duck, G. J.; and Tack, G. 2007. Minizinc: Towards a standard cp modelling language. In Proceedings of the 13th International Conference on Principles and Practice of Constraint Programming (CP2007), 529–543. Staniford, S.; Moore, D.; Paxson, V.; and Weaver, N. 2004. The top speed of flash worms. In Proc. of the 2004 ACM workshop on Rapid malcode, 33–42. Staniford, S.; Paxson, V.; and Weaver, N. 2002. How to own the internet in your spare time. In Proc. of the 11th USENIX Security Symposium, 149–167. ter Hofstede, A. H. M.; van der Aalst, W. M. P.; Adams, M.; and Russell, N., eds. 2009. Modern Business Process Automation: YAWL and its Support Environment. Springer.

9

Attack Planning in the Real World Carlos Sarraute and Gerardo Richarte

Jorge Lucangeli Obes

Core Security Technologies and Ph.D. program, ITBA (Instituto Tecnologico Buenos Aires) {carlos, gera}@coresecurity.com

Core Security Technologies [email protected]

Abstract

to compromise specific hosts in the network. When an exploit launched against a vulnerable machine is successful, the machine becomes compromised and can be used to perform further information gathering, or to launch subsequent attacks. This shift in the source of the attacker’s actions is called pivoting. Newly compromised machines can serve as the source for posterior information gathering, and this new information might reveal previously unknown vulnerabilities, so the phases of information gathering and exploiting usually succeed one another. As pentesting tools have evolved and have become more complex, covering new attack vectors; and shipping increasing numbers of exploits and information gathering modules, the problem of controlling the pentesting framework successfully became an important question. A computergenerated plan for an attack would isolate the user from the complexity of selecting suitable exploits for the hosts in the target network. In addition, a suitable model to represent these attacks would help to systematize the knowledge gained during manual penetration tests performed by expert users, making pentesting frameworks more accessible to non-experts. Finally, the possibility of incorporating the attack planning phase to the pentesting framework would allow for optimizations based on exploit running time, reliability, or impact on Intrusion Detection Systems. Our work on the attack planning problem applied to pentesting began in 2003 with the construction of a conceptual model of an attack, distinguishing assets, actions and goals (Futoransky et al. 2003; Richarte 2003; Arce and Richarte 2003). In this attack model, the assets represent both information and the modifications in the network that an attacker may need to obtain during an intrusion, whereas the actions are the basic steps of an attack, such as running a particular exploit against a target host. This model was designed to be realistic from an attacker’s point of view, and contemplates the fact that the attacker has an initial incomplete knowledge of the network, and therefore information gathering should be considered as part of the attack. Since the actions have requirements (preconditions) and results, given a goal, a graph of the actions/assets that lead to this goal can be constructed. This graph is related to the attack graphs1 studied in (Phillips and Swiler 1998;

Assessing network security is a complex and difficult task. Attack graphs have been proposed as a tool to help network administrators understand the potential weaknesses of their networks. However, a problem has not yet been addressed by previous work on this subject; namely, how to actually execute and validate the attack paths resulting from the analysis of the attack graph. In this paper we present a complete PDDL representation of an attack model, and an implementation that integrates a planner into a penetration testing tool. This allows to automatically generate attack paths for penetration testing scenarios, and to validate these attacks by executing the corresponding actions -including exploits- against the real target network. We present an algorithm for transforming the information present in the penetration testing tool to the planning domain, and we show how the scalability issues of attack graphs can be solved using current planners. We include an analysis of the performance of our solution, showing how our model scales to medium-sized networks and the number of actions available in current penetration testing tools.

1. Introduction The last 10 years have witnessed the development of a new kind of information security tool: the penetration testing framework. These tools facilitate the work of network penetration testers, and make the assessment of network security more accessible to non-experts. The main tools available are the open source project Metasploit, and the commercial products Immunity Canvas and Core Impact (Burns et al. 2007). The main difference between these tools and network security scanners such as Nessus or Retina is that pentesting frameworks have the ability to launch real exploits for vulnerabilities, helping to expose risk by conducting an attack in the same way a real external attacker would (Arce and McGraw 2004). Penetration tests involve successive phases of information gathering, where the pentesting tool helps the user gather information about the network under attack (available hosts, their operating systems and open ports, and the services running in them); and exploiting, where the user actively tries c 2010, Association for the Advancement of Artificial Copyright ! Intelligence (www.aaai.org). All rights reserved.

1

10

Nodes in an attack graph identify a stage of the attack, while

Jajodia, Noel, and OBerry 2005; Noel et al. 2009) and many others. In (Lippmann and Ingols 2005) the authors reviewed past papers on attack graphs, and observed that the “first major limitation of these studies is that most attack graph algorithms have only been able to generate attack graphs on small networks with fewer than 20 hosts”. In medium-sized networks, building complete attack graphs quickly becomes unfeasible (their size increases exponentially with the number of machines and available actions). To deal with the attack planning problem, a proposed approach (Sarraute and Weil 2008; Sarraute 2009) is to translate the model into a PDDL representation and use classical planning algorithms to find attack paths. Planning algorithms manage to find paths in the attack graph without constructing it completely, thus helping to avoid the combinatorial explosion (Blum and Furst 1997). A similar approach was presented at SecArt’09 (Ghosh and Ghosh 2009), but the authors’ model is less expressive than the one used in this work, as their objective was to use the attack paths to build a minimal attack graph, and not to carry out these attacks against real networks. In this paper we present an implementation of these ideas. We have developed modules that integrate a pentesting framework with an external planner, and execute the resulting plans back in the pentesting framework, against a real network. We believe our implementation proves the feasability of automating the attack phases of a penetration test, and allows to think about moving up to automate the whole process. We show how our model, and its PDDL representation, scale to hundreds of target nodes and available exploits, numbers that can be found when assessing medium-sized networks with current pentesting frameworks. The paper is structured as follows: in Section 2. we present a high-level description of our solution, describing the steps needed to integrate a planner with a penetration testing framework. Section 3. describes our PDDL representation in detail, explaining how the “real world” view that we take forces a particular implementation of the attack planning problem in PDDL. Section 4. presents the results of our scalability testing, showing how our model manages medium-sized networks using current planners. Section 5. reviews related work, and Section 6. closes the paper and outlines future work.

2.

Exploits & Attack Modules

transform

Actions

Attack Workspace

transform

Initial conditions

Pentesting Framework

execution

Plan

PDDL Description

Planner

Figure 1: Architecture of our solution. PDDL3 . The domain contains the definition of the available actions in the model, and the scenario contains the definition of the objects (networks, hosts, and their characteristics), and the goal which has to be solved. The attack workspace contains the information about the current attack or penetration test. In particular, the discovered networks and hosts, information about their operating systems, open/closed ports, running services and compromised machines. In the current version of our solution we assume that the workspace has this network information available, and that no network information gathering is needed to generate a solvable plan. We will address this limitation in Section 6. when we discuss future work.

Transform algorithm The transform algorithm generates the PDDL representation of the attack planning problem, including the initial conditions, the operators (PDDL actions), and the goal. From the pentesting framework we extract the description of the operators, in particular the requirements and results of the exploits, which will make up most of the available actions in the PDDL representation. This is encoded in the domain.pddl file, along with the predicates and types (which only depend on the details of our model). From the attack workspace we extract the information that constitutes the initial conditions for the planner: networks, machines, operating systems, ports and running services. This is encoded in the problem.pddl file, together with the goal of the attack, which will usually be the compromise of a particular machine. A common characteristic of pentesting frameworks is that they provide an incomplete view of the network under attack. The pentester has to infer the structure of the network using the information that he sees from each compromised machine. The transform algorithm takes this into account, receiving extra information regarding host connectivity.

Architecture of our Solution

In this section we describe the components of our solution, and how they fit together to automate an attack. Figure 1 shows the relationship between these different components. The penetration testing framework is a tool that allows the user/attacker to execute exploits and other pre/post exploitation modules against the target network. Our implementation is based on Core Impact2 . The planner is a tool that takes as input the description of a domain and a scenario, in

Planner The PDDL description is given as input to the planner. The advantage of using the PDDL language is that we can experiment with different planners and determine which best fits

edges represent individual steps in the attack. 2 As mentioned in the previous section, Metasploit is an opensource alternative.

3 Refer to (Fox and Long 2003) for a description of the PDDL planning language.

11

our particular problem. We have evaluated our model using both SGPlan (Chen, Wah, and Hsu 2006) and Metric-FF (Hoffmann 2002). The planner is run from inside the pentesting framework, as a pluggable module of the framework that we call PlannerRunner. The output of the planner is a plan, a sequence of actions that lead to the completion of the goal, if all the actions are successful. We make this distinction because even with well-tested exploit code, not all exploits launched are successful. The plan is given as feedback to the pentesting framework, and executed against the real target network.

3.

network host port port set application agent privileges

operating system OS version OS edition OS build OS servicepack OS distro kernel version

Table 1: List of object types TCP or UDP port. These predicates express the different forms of connectivity:

The PDDL Representation in Detail

(connected_to_network ?s - host ?n - network) (IP_connectivity ?s - host ?t - host) (TCP_connectivity ?s - host ?t - host ?p - port) (TCP_listen_port ?h - host ?p - port) (UDP_listen_port ?h - host ?p - port)

The PDDL description language serves as the bridge between the pentesting tool and the planner. Since exploits have strict platform and connectivity requirements, failing to accurately express those requirements in the PDDL model would result in plans that cannot be executed against real networks. This forces our PDDL representation of the attack planning problem to be quite verbose. On top of that, we take advantage of the optimization abilities of planners that understand numerical effects4 , and have the PDDL actions affect different metrics commonly associated with penetration testing such as running time, probability of success or possibility of detection (stealth). We will focus on the description of the domain.pddl file, which contains the PDDL representation of the attack model. We will not delve into the details of the problem.pddl file, since it consists of a list of networks and machines, described using the predicates to be presented in this section. The PDDL requirements of the representation are :typing, so that predicates can have types, and :fluents, to use numerical effects. We will first describe the types available in the model, and then list the predicates that use these types. We will continue by describing the model-related actions that make this predicates true, and then we will show an example of an action representing an exploit. We close this section with an example PDDL plan for a simple scenario.

These predicates describe the operating system and services of a host: (has_OS ?h - host ?os - operating_system) (has_OS_version ?h - host ?osv - OS_version) (has_OS_edition ?h - host ?ose - OS_edition) (has_OS_build ?h - host ?osb - OS_build) (has_OS_servicepack ?h - host ?ossp - OS_servicepack) (has_OS_distro ?h - host ?osd - OS_distro) (has_kernel_version ?h - host ?kv - kernel_version) (has_architecture ?h - host ?a - OS_architecture) (has_application ?h - host ?p - application)

Actions We require some “model-related” actions that make true the aforementioned predicates in the right cases. (:action IP_connect :parameters (?s - host ?t - host) :precondition (and (compromised ?s) (exists (?n - network) (and (connected_to_network ?s ?n) (connected_to_network ?t ?n)))) :effect (IP_connectivity ?s ?t) )

Types Table 1 shows a list of the types that we used. Half of the object types are dedicated to describing in detail the operating systems of the hosts, since the successful execution of an exploit depends on being able to detect the specifics of the OS.

(:action TCP_connect :parameters (?s - host ?t - host ?p - port) :precondition (and (compromised ?s) (IP_connectivity ?s ?t) (TCP_listen_port ?t ?p)) :effect (TCP_connectivity ?s ?t ?p) )

Predicates The following are the predicates used in our model of attacks. Since exploits also have non-trivial connectivity requirements, we chose to have a detailed representation of network connectivity in PDDL. We need to be able to express how hosts are connected to networks, and the fact that exploits may need both IP and TCP or UDP connectivity between the source and target hosts, usually on a particular

(:action Mark_as_compromised :parameters (?a - agent ?h - host) :precondition (installed ?a ?h) :effect (compromised ?h) )

Two hosts on the same network possess IP connectivity, and two hosts have TCP (or UDP) connectivity if they have IP connectivity and the target host has the correct TCP (or UDP) port open. Moreover, when an exploit is successful an agent is installed on the target machine, which allows control over that machine. An installed agent is hard evidence

4 Numerical effects allow the actions in the PDDL representation to increase the value of different metrics defined in the PDDL scenario. The planner can then be told to find a plan that minimizes a linear function of these metrics.

12

that the machine is vulnerable, so it marks the machine as compromised5 . The penetration testing framework we used has an extensive test suite that collects information regarding running time for many exploit modules. We obtained average running times from this data and used that information as the numeric effect of exploit actions in PDDL. The metric to minimize in our PDDL scenarios is therefore the total running time of the complete attack. Finally, this is an example of an action, an exploit that will attempt to install an agent on target host t from an agent previously installed on the source host s. To be successful, this exploit requires that the target runs a specific OS, has the service ovtrcd running and listening on port 5053.

0: 1: 2: 3:

Mark_as_compromised localagent localhost IP_connect localhost 10.0.1.1 TCP_connect localhost 10.0.1.1 port80 Phpmyadmin Server_databases Remote Code Execution localhost 10.0.1.1 4: Mark_as_compromised 10.0.1.1 high_privileges ... 14: Mark_as_compromised 10.0.4.2 high_privileges 15: IP_connect 10.0.4.2 10.0.5.12 16: TCP_connect 10.0.4.2 10.0.5.12 port445 17: Novell Client NetIdentity Agent Buffer Overflow 10.0.4.2 10.0.5.12 18: Mark_as_compromised 10.0.5.12 high_privileges

4.

Performance and Scalability Evaluation

(:action HP_OpenView_Remote_Buffer_Overflow_Exploit :parameters (?s - host ?t - host) :precondition (and (compromised ?s) (and (has_OS ?t Windows) (has_OS_edition ?t Professional) (has_OS_servicepack ?t Sp2) (has_OS_version ?t WinXp) (has_architecture ?t I386)) (has_service ?t ovtrcd) (TCP_connectivity ?s ?t port5053) ) :effect(and (installed_agent ?t high_privileges) (increase (time) 10) ))

This model, and its representation in PDDL, are intended to be used to plan attacks against real networks, and execute them using a pentesting framework. To verify that our proposed solution scales up to the domains and scenarios we need to address in real-world cases, we carried out extensive performance and scalability testing – to see how far we could take the attack model and PDDL representation with current planners. We focused our performance evaluation on four metrics:

In our PDDL representation there are several versions of this exploit, one for each specific operating system supported by the exploit. For example, another supported system for this exploit looks like this:

• Number of available exploits in the pentesting suite

• Number of machines in the attacked network • Number of pivoting steps in the attack

• Number of individual predicates that must be fulfilled to accomplish the goal The rationale behind using these metrics is that we needed our solution to scale up reasonably with regard to all of them. For example, a promising use of planning algorithms for attack planning lies in scenarios where there are a considerable number of machines to take into account, which could be time-consuming for a human attacker. Moreover, many times a successful penetration test needs to reach the innermost levels of a network, sequentially exploiting many machines in order to reach one which might hold sensitive information. We need our planning solution to be able to handle these cases where many pivoting steps are needed. Pentesting suites are constantly updated with exploits for new vulnerabilities, so that users can test their systems against the latest risks. The pentesting tool that we used currently7 has about 700 exploits, of which almost 300 are the remote exploits that get included in the PDDL domain. Each remote exploit is represented as a different operator for each target operating system, so our PDDL domains usually have about 1800 operators, and our solution needs to cope with that input. Finally, another promising use of planning algorithms for attack planning is the continuous monitoring of a network by means of a constant pentest. In this case we need to be able to scale to goals that involve compromising many machines.

(and (has_OS ?t Solaris) (has_OS_version ?t V_10) (has_architecture ?t Sun4U))

The main part of the domain.pddl file is devoted to the description of the actions. In our sample scenarios, this file has up to 28,000 lines and includes up to 1,800 actions. The last part of the domain.pddl file is the list of constants that appear in the scenario, including the names of the applications, the list of port numbers and operating system version details.

An attack plan We end this section with an example plan obtained by running Metric-FF on a scenario generated with this model. The goal of the scenario is to compromise host 10.0.5.12 in the target network. This network is similar to the test network that we will describe in detail in Section 4.. The plan requires four pivoting steps and executes five different exploits in total, though we only show the first6 and last ones for space reasons. The exploits shown are real-world exploits currently present in the pentesting framework. 5

Depending on the exploit used, the agent might have regular user privileges, or superuser (root) privileges. Certain local exploits allow a low-level (user) agent to be upgraded to a high-level agent, so we model this by having two different privileges PDDL objects. 6 The localagent object represents the pentesting framework running in the machine of the user/attacker.

7

13

As of March, 2010.

We decided to use the planners Metric-FF8 (Hoffmann 2002) and SGPlan9 (Chen, Wah, and Hsu 2006) since we consider them to be representative of the state of the art in classical planners. The original FF planner was the baseline planner for IPC’0810 . Metric-FF adds numerical effects to the FF planner. We modified the reachability analysis in Metric-FF to use type information, as in FF, to obtain better memory usage. SGPlan combines Metric-FF as a base planner with a constraint partitioning scheme which allows it to divide the main planning problem in subproblems; these subproblems are solved with a modified version of Metric-FF, and the individual solutions combined to obtain a plan for the original problem. This method, according to the authors, has the potential to significantly reduce the complexity of the original problem (Chen, Wah, and Hsu 2006). It was successfully used in (Ghosh and Ghosh 2009).

Gateway

Router

Switch

... Server

Desktop

...

Attacker

...

...

Generating the test scenarios We tested both real and simulated networks, generating the test scenarios using the same pentesting framework we would later use to attack them. For the large-scale testing, we made use of a network simulator (Futoransky et al. 2009). This simulator allows to build sizable networks11 , but still view each machine independently and, for example, execute distinct system calls in each of them. The simulator integrates tightly with the pentesting framework, to the point where the framework is oblivious to the fact that the network under attack is simulated and not real. This allowed us to use the pentesting tool to carry out all the steps of the test, including the information gathering stage of the attack. Once the information gathering was complete, we converted the attack workspace to PDDL using our transform tool. We generated two types of networks for the performance evaluation. To evaluate the scalability in terms of number of machines, number of operators, and number of goals; the network consists of five subnets with varying numbers of machines, all joined to one main network to which the user/attacker initially has access. Figure 2 shows the highlevel structure of this simulated network. To evaluate the scalability in terms of the number of pivoting steps needed to reach the goal, we constructed a test network where the attacker and the target machine are separated by an increasing number of routers, and each subnetwork in between has a small number of machines. The network simulator allows us to specify many details about the simulated machines, so in both networks, the subnetworks attached to the innermost routers contain four types of machines: Linux desktops and servers, and Windows desktops and servers. Table 2 shows the configuration for each of the four machine types, and the share of each type in the network. For server cases, each machine randomly removes one open port from the canonical list shown

Figure 2: Test network for scalability evaluation. Machine type Windows desktop Windows server Linux desktop Linux server

OS version Windows XP SP3 Windows 2003 Server SP2

Share 50% 14%

Ubuntu 8.04 or 8.10 Debian 4.0

27% 9%

Open ports 139, 445 25, 80, 110, 139, 443, 445, 3389 22 21, 22, 23, 25, 80, 110, 443

Table 2: List of machine types for the test networks in the table, so that all machines are not equal and thus not equally exploitable.

Results As we expected, both planners generated the same plans in all cases, not taking into account plans in which goals were composite and the same actions could be executed in different orders. This is reasonable given that SGPlan uses Metric-FF as its base planner. We believe that the performance and scalability results are more interesting, since a valid plan for an attack path is a satisfactory result in itself. Figures 3 to 10 show how running time and memory consumption scale for both planners, with respect to the four metrics considered12 . Recall that, as explained in Section 3., each exploit maps to many PDDL actions. As illustrated by Figures 3 and 4, both running time and memory consumption increase superlinearly with the number of machines in the network. We were not able to find exact specifications for the time and memory complexities of Metric-FF or SGPlan, though we believe this is because heuristics make it difficult to calculate a complexity that holds in normal cases. Nonetheless, our model, coupled

8

Latest version available (with additional improvements). SGPlan version 5.22. 10 The International Planning Competition, 2008. 11 We tested up to 1000 nodes in the simulator. 9

12

Testing was performed on a Core i5 750 2.67 GHz machine with 8 GB of RAM, running 64-bit Ubuntu Linux; the planners were 32-bit programs.

14

Figure 3: Running time, increasing number of machines. (Fixed values: 1600 actions, 1 pivoting step).

Figure 5: Running time, increasing number of pivoting steps. (Fixed values: 1600 actions, 120 machines).

Figure 4: Memory usage, increasing number of machines.

Figure 6: Memory usage, increasing number of pivoting steps.

with the SGPlan planner, allows to plan an attack in a network with 480 nodes in 25 seconds and using less than 4 GB of RAM. This makes attack planning practical for pentests in medium-sized networks. Moving on to the scalability with regard to the depth of the attack (Figures 5 and 6), it was surprising to verify that memory consumption is constant even as we increase the depth of the attack to twenty pivoting steps, which generates a plan of more than sixty steps. Running time increases slowly, although with a small number of steps the behaviour is less predictable. The model is therefore not constrained in terms of the number of pivoting steps. With regard to the number of operators (i.e. exploits) (Figures 7 and 8), both running time and memory consumption increase almost linearly; however, running time spikes in the largest cases. Doubling the number of operators, from 720 to 1440 (from 120 to 240 available exploits), increases running time in 163% for Metric-FF and 124% for SGPlan. Memory consumption, however, increases only 46% for Metric-FF, and 87% for SGPlan. In this context, the number of available exploits is not a limiting factor for the model. Interestingly, these three tests also verify many of the

claims made by the authors of SGPlan. We see that the constraint partition used by their planner manages to reduce both running time and memory consumption, in some cases by significant amounts (like in Figure 6). The results for the individual number of predicates in the overall goal (Figures 9 and 10) are much more surprising. While SGPlan runs faster than Metric-FF in most of the cases, Metric-FF consumes significantly less memory in almost half of them. We believe that as the goal gets more complex (the largest case we tested requests the individual compromise of 100 machines), SGPlan’s constraint partition strategy turns into a liability, not allowing a clean separation of the problem into subproblems. By falling back to Metric-FF our model can solve, in under 6 seconds and using slightly more than 1 GB of RAM, attack plans where half of the machines of a 200-machine network are to be compromised.

5.

Related work

Work on attack modeling applied to penetration testing had its origin in the possibility of programmatically controlling pentesting tools such as Metasploit or Core Im-

15

Figure 7: Running time, increasing number of actions. (Fixed values: 200 machines, 1 pivoting step).

Figure 9: Running time, increasing number of predicates in the goal. (Fixed values: 200 machines, 1 pivoting step for each compromised machine, 1600 actions).

Figure 8: Memory usage, increasing number of actions. Figure 10: Memory usage, increasing number of predicates in the goal.

pact. This model led to the use of attack graphs. Earlier work on attack graphs such as (Phillips and Swiler 1998; Ritchey and Ammann 2000; Sheyner et al. 2002) were based on the complete enumeration of attack states, which grows exponentially with the number of actions and machines. As we mentioned in Section 1. the survey of (Lippmann and Ingols 2005) shows that the major limitations of past studies of attack graphs is their lack of scalability to medium-sized networks. One notable exception is the Topological Vulnerability Analysis (TVA) project conducted in George Mason University described in (Jajodia, Noel, and OBerry 2005; Noel and Jajodia 2005; Noel et al. 2009) and other papers, which has been designed to work in real-size networks. The main differences between our approach and TVA are the following:

about potential vulnerabilities. In our approach the conceptual model and the information about the target network are extracted from a consistent source: the pentesting framework exploits and workspace. The vulnerability information of an exploit is very precise: the attacker can execute it in the real network to actually compromise systems. • Monotonicity. TVA assumes that the attacker’s control over the network is monotonic (Ammann, Wijesekera, and Kaushik 2002). In particular, this implies that TVA cannot model Denial-of-Service (DoS) attacks, or the fact that an unsuccessful exploit may crash the target service or machine. It is interesting to remark that the monotonicity assumption is the same used by FF (Hoffmann 2001) to create a relaxed version of the planning problem, and use it as a heuristic to guide the search through the attack graph. By relying on the planner to do the search, we do not need to make this restrictive assumption.

• Input. In TVA the model is populated with information from third party vulnerability scanners such as Nessus, Retina and FoundScan, from databases of vulnerabilities such as CVE and OSVDB and other software. All this information has to be integrated, and will suffer from the drawbacks of each information source, in particular from the false positives generated by the vulnerability scanners

16

6.

Summary and Future Work

Granick, J. S.; Manzuik, S.; and Guersch, P. 2007. Security Power Tools. O’Reilly Media. Chen, Y.; Wah, B. W.; and Hsu, C. 2006. Temporal planning using subgoal partitioning and resolution in SGPlan. J. of Artificial Intelligence Research 26:369. Fox, M., and Long, D. 2003. PDDL2. 1: An extension to PDDL for expressing temporal planning domains. Journal of Artificial Intelligence Research 20(2003):61–124. Futoransky, A.; Notarfrancesco, L.; Richarte, G.; and Sarraute, C. 2003. Building computer network attacks. Technical report, CoreLabs. Futoransky, A.; Miranda, F.; Orlicki, J.; and Sarraute, C. 2009. Simulating cyber-attacks for fun and profit. In 2nd Internation Conference on Simulation Tools and Techniques (SIMUTools ’09). Ghosh, N., and Ghosh, S. K. 2009. An intelligent technique for generating minimal attack graph. In First Workshop on Intelligent Security (Security and Artificial Intelligence) (SecArt ’09). Hoffmann, J. 2001. FF: The fast-forward planning system. AI magazine 22(3):57. Hoffmann, J. 2002. Extending FF to numerical state variables. In Proceedings of the 15th European Conference on Artificial Intelligence (ECAI-02), 571–575. Jajodia, S.; Noel, S.; and OBerry, B. 2005. Topological analysis of network attack vulnerability. Managing Cyber Threats: Issues, Approaches and Challenges 248–266. Lippmann, R., and Ingols, K. 2005. An annotated review of past papers on attack graphs. Technical report, MIT Lincoln Laboratory. Noel, S., and Jajodia, S. 2005. Understanding complex network attack graphs through clustered adjacency matrices. In Proceedings of the 21st Annual Computer Security Applications Conference, 160–169. Noel, S.; Elder, M.; Jajodia, S.; Kalapa, P.; OHare, S.; and Prole, K. 2009. Advances in Topological Vulnerability Analysis. In Proceedings of the 2009 Cybersecurity Applications & Technology Conference for Homeland Security, 124–129. IEEE Computer Society. Phillips, C. A., and Swiler, L. P. 1998. A graph-based system for network-vulnerability analysis. In Workshop on New Security Paradigms, 71–79. Richarte, G. 2003. Modern intrusion practices. In Black Hat Briefings. Ritchey, R., and Ammann, P. 2000. Using model checking to analyze network vulnerabilities. In IEEE Symposium on Security and Privacy, 156–165. IEEE Computer Society. Sarraute, C., and Weil, A. 2008. Advances in automated attack planning. In PacSec Conference, Tokyo, Japan. Sarraute, C. 2009. New algorithms for attack planning. In FRHACK Conference, Besanc¸on, France. Sheyner, O.; Haines, J.; Jha, S.; Lippmann, R.; and Wing, J. 2002. Automated generation and analysis of attack graphs. In IEEE Symposium on Security and Privacy, 273– 284. IEEE Computer Society.

(Futoransky et al. 2003) proposed a model of computer network attacks which was designed to be realistic from an attacker’s point of view. We have shown in this paper that this model scales up to medium-sized networks: it can be used to automate attacks (and penetration tests) against networks with hundreds of machines. The solution presented shows that it is not necessary to build the complete attack graph (one of the major limitations of earlier attack graph studies). Instead we rely on planners such as Metric-FF and SGPlan to selectively explore the state space in order to find attack paths. We have successfully integrated these planners with a pentesting framework, which allowed us to execute and validate the resulting plans against a test bench of scenarios. We presented the details of how to transform the information contained in the pentesting tool to the planning domain13 . One important question that remains as future work on this subject is how to deal with incomplete knowledge of the target network. The architecture that we presented supports running non-classical planners, so one possible approach is to use probabilistic planning techniques, where actions have different outcomes with associated probabilities. For example, a step of the attack plan could be to discover the operating system details of a particular host, so the outcome of this action would be modeled as a discrete probability distribution. Another approach would be to build a “metaplanner” that generates hypotheses with respect to the missing bits of information about the network, and uses the planner to test those hypotheses. Continuing the previous example, the metaplanner would assume that the operating system of the host was Windows and request the planner to compromise it as such. The metaplanner would then test the resulting plan in the real network, and verify or discard the hypothesis.

References Ammann, P.; Wijesekera, D.; and Kaushik, S. 2002. Scalable, graph-based network vulnerability analysis. In Proceedings of the 9th ACM Conference on Computer and Communications Security, 217–224. ACM New York, NY, USA. Arce, I., and McGraw, G. 2004. Why attacking systems is a good idea. IEEE Computer Society - Security & Privacy Magazine 2(4). Arce, I., and Richarte, G. 2003. State of the art security from an attacker’s viewpoint. In PacSec Conference, Tokyo, Japan. Blum, A. L., and Furst, M. L. 1997. Fast planning through planning graph analysis. Artificial Intelligence 90(1-2):281 – 300. Burns, B.; Killion, D.; Beauchesne, N.; Moret, E.; Sobrier, J.; Lynn, M.; Markham, E.; Iezzoni, C.; Biondi, P.; 13

Our implementation uses Core Impact, but the same ideas can be extended to other tools such as the open-source project Metasploit.

17

  Toward the Semantic Interoperability of the Security Information and  Event Management Lifecycle  Mary C. Parmelee    The MITRE Corporation  7515 Colshire Drive,  McLean, VA 22102­7539, USA  [email protected]   

    guidance, patch management, malware response, incident management and threat analysis. 

Abstract  The rapid growth in magnitude and complexity of the Security Information and Event Management (SIEM) has spurred a trend toward automated security management tools and industry standards for interoperability. Making Security Measurable (MSM) is a program that is funded by multiple United States government agencies. MSM produces a family of security related standards for the automation and interoperability of the SIEM lifecycle. There are three major challenges that are impeding the adoption of MSM standards: an unsustainable vocabulary management process, ineffective interoperability methods, and domain complexity that exceeds the representation capability of current technologies and methods. The MSM Team is currently developing a SIEM ontology architecture, a family of related SIEM ontologies, and a reference implementation that will link related security information from disparate sources and proprietary security-related tools. Coupled with rules and inference engines, we will also semi-automate the security information and event management process across incompatible security-related products and domains.

The Challenges   There are three major challenges that are impeding the adoption of the MSM and NIST SCAP programs: an unsustainable vocabulary management process, ineffective interoperability methods, and domain complexity that exceeds the representation capability of current technologies and methods. Many of the standards require some form of common enumeration, taxonomy or other controlled vocabulary. These vocabularies are implemented both at the value and the representation level. The standards (and their vocabularies) have each been developed independently of each other, are at various stages of maturity. Value level enumerations are currently developed and managed mostly manually using excel spreadsheets, while vocabulary representations are encoded in XML schema. Further, the interrelated and overlapping nature of the vocabulary subject areas has resulted in cross implementation requirements. However, there hasn’t been any cross  vocabulary analysis or alignment. The SCAP Validation program requires security tool vendors to translate their representation and reporting output to the shared common value enumerations and descriptions in order to achieve compliance. This interoperability is largely accomplished with manual mappings between representations and syntactic matching methods, such as string matching at the value level. The information and events of the SIEM lifecycle is growing rapidly in volume and complexity. For example, the Common Platform Enumeration (CPE) standard must standardize inconsistent and ambiguous software product names and versioning schemes as well as complex

Introduction    Through its Making Security Measurable and related efforts for supporting standardized expression and reporting of security related information, MITRE leads the development of a family of community standards, many of which have been adopted by NIST’s  Security Content Automation Protocol (SCAP) program. MITRE support of the standards is largely funded by the DHS, NSA, and NIST. These standards cover most of the security information and event (SIEM) lifecycle, including: vulnerability management, intrusion detection, asset security assessment, asset management, configuration Copyright © 2010, Association for the Advancement of Artificial  Intelligence (www.aaai.org). All rights reserved.

Copyright © 2010, Association for the Advancement of Artificial  Intelligence (www.aaai.org). All rights reserved.

18

relationships between products. The  MSM  team  has  recognized  that  semantic  technologies  could  provide  the  flexibility and robust representation that they need in order  to  manage  and  align  MSM  vocabularies,  improve  interoperability  capabilities  and  provide  a  more  robust  representation  of  the  SIEM  domain.  I  am  currently  supporting  two  of  the  standards,  the  Common  Configuration  Enumeration  (CCE)  standard  for  semantic  vocabulary  management  and  the  CPE  standard  for  representing complex software products and relationships.

Parmelee,  Mary;  Nichols,  Deborah;  Obrst  2009.  A  Net­ Centric  Metadata  Framework  for  Service  Oriented  Environments.  International  Journal  of  Metadata,  Semantics and Ontologies (IJMSO) 4 (4): 250 – 260.  CPE: http://cpe.mitre.org/files/cpe­specification_2.2.pdf    World  Wide  Web  Consortium  (W3C),  OWL  2.0,  http://www.w3.org/TR/2009/PR­owl2­new­features­ 20090922, (2009).    World  Wide  Web  Consortium  (W3C),  Resource  Description  Framework  (RDF)  Semantics,  W3C  Recommendation http://www.w3.org/TR/rdf­mt/, (2004).\    XCCDF:  Specification  for  the  Extensible  Configuration  Checklist  Description  Format  (XCCDF)  Version  1.1.4,  http://csrc.nist.gov/publications/nistir/ir7275r3/NISTIR­ 7275r3.pdf  

The SIEM Ontology Architecture    We are developing an ontology architecture that consists of a modular family of related ontologies that will represent the information and workflow of the SIEM lifecycle. We have already developed a basic architecture and a few common modules including a Software ontology, a Common Configuration Enumeration ontology, a MSM Resource Manager ontology, a Point of Contact ontology and an OWL (Web Ontology Language) standard representation of the Dublin Core Metadata standard for resource management.

 

Deshayes,  Laurent;  Foufou,  Sebti;  Gruninger,  Michael.  2006.  An Ontology  Architecture for Standards Integration  and  Conformance  in  Manufacturing  ,  Proceedings  of  the  6th  International  Conference  on  Integrated  Design  and.  Manufacturing  in  Mechanical  Engineering  (IDDME), Grenoble,. France, May 17-19. http://stl.mie.utoronto.ca/publications/P0057paper.pdf

Workshop Discussion  We are actively developing the architecture and its component ontologies. By the end of June we will have a first draft of the SIEM ontology architecture, a few more ontologies, a presentation, a full length paper, and a reference implementation that will demonstrate how the  family  of  interrelated  ontologies  will  link  related  security  information  from  disparate  sources  and  proprietary  security­related  tools.  Coupled  with  rules  and  inference  engines,  we  will  also  semi­automate  the  security  information  and  event  management  process  across  incompatible products and domains.  

References  Barnum, Sean. 2007. An Introduction to Attack Patterns as  a  Software  Assurance  Knowledge  Resource  .  OMG  SwA  Workshop.     CCE  Board.  2007.  Common  Event  Expression  (CEE),  Technical  Report,  Department  G026,  The  MITRE  Corporation.    Mann, David. 2008. An Introduction to the Common Configuration Enumeration (CCE), Technical Report, Department G026, The MITRE Corporation.  

Copyright © 2010, Association for the Advancement of Artificial  Intelligence (www.aaai.org). All rights reserved.

19

  Unbelievable Agents for Large Scale Security Simulation     Jerry Lin, Jim Blythe, Skyler Clark, Nima Davarpanah, Roger Hughston and Mike Zyda   University of Southern California    [email protected][email protected][email protected][email protected][email protected][email protected]   

    software systems, humans are flexible and resourceful  problem solvers, able to find alternate ways to accomplish  their tasks despite failures of resources or services.  Different people often perform the same task in different  ways, providing a diversification defense from some  attacks. Dourish and Redmiles (02) introduce the concept  of “effective security” as a more realistic measure of the  security of a system than a formal evaluation of the  security mechanisms installed. The level of effective  security is almost always below the level of theoretical  security that is technically feasible in a system, largely due  to human error. On the other hand, effective security must  be measured end­to­end, taking into account the entirety of  the system and the purpose it solves. In this context a high  level of theoretical security may be both expensive and  unnecessary.  Cranor (08) proposes a framework for reasoning about  the security of systems with humans in the loop. She  models the human as an information processor based on  the warnings science literature (Wogalter 06). However,  this model only captures the human response to warning  messages and ignores many important aspects of human  behavior, such as the task being performed, collaboration  that leads to structured communication, and stress,  emotions and tiredness that will affect a human’s  propensity to make errors. Cranor’s approach allows a  checklist­style evaluation of a security system.  In this paper we outline a research agenda to enable a  more detailed and encompassing evaluation of human­in­ the­loop security systems, using intelligent agents  (Giampapa and Sycara 02, Chalupsky et al. 01). We are  designing agents capable of simulating a shared task, in  which individual agents have different roles, different basic  skills and also different emotional responses. Such a  framework should be able to answer far more detailed  questions about the effective security of a system in a  range of different scenarios. One of our goals is to capture  those aspects of human nature that often prove to be crucial  in the security of modern systems, for use in large­scale  simulations at a level of fidelity that allows for end­to­end  scientific evaluation. 

Abstract  Human  error  arguably  accounts  for  more  than  half  of  all  security  vulnerabilities,  yet  few  frameworks  for  testing  secure  systems  take  human  actions  into  account.  We  describe  the  design  of  an  experimentation  platform  that  models  human  behaviors  through  intelligent  agents.  Our  agents  share  some  desired  features  with  believable  agent  systems,  but  believable  interaction  with  a  human  is  less  important  than  accurate  reproduction  of  security­related  behaviors.  We  identify  three  main  components  of  human  behavior that are important in such a system: (1) models of  emotion  and  other  cognitive  state  that  may  increase  the  probability of errors, (2) flexible reasoning in the face of a  compromised system and (3) realistic task­based patterns of  communication  among  groups.  We  describe  an  agent  framework that can support these behaviors and illustrate its  principles  with  a  scenario  of  an  insider  attack.  We  are  beginning the implementation of the framework, and finish  with a discussion of future work. 

 Introduction    Human error is widely recognized as one of the most  important sources of vulnerability in a secure system. In a  survey taken in 2006, approximately 60% of security  breaches were attributed to human error by security  managers (Crawford 06, Cranor 08). Humans often ignore  or misunderstand warnings, underestimate danger, and  download infected files or simply disable security  mechanisms because of their slowness or complexity  (Whitten and Tygar 99). Consider the old statement that the only secure computer is one that is turned off and/or disconnected from the network. A social engineering attack exploiting the human element would simply be to convince someone to plug it back in (Mitnick 02) .  But frailties are only one aspect of human behavior that  impacts our understanding of security. Compared with  Copyright © 2010, Association for the Advancement of Artificial  Intelligence (www.aaai.org). All rights reserved. 

Distribution Statement "A" (Approved for Public Release, Distribution Unlimited), DISTAR case #15423, 30 Apr 2010.

20

Given this goal, what aspects of human behavior are  important to capture? We focus on three aspects of human  behavior that have an important influence on the likelihood  of success and severity of cyber attacks: (1) errors,  particularly under time­related stress, (2) flexibility of  response to problems and (3) non­random patterns of  communication centered around a collaborative task.  While many believable agent­based simulation systems  have been built (Bates 93, Choi et al 07, Marsella and  Gratch 09), many of them are concerned with believability  to humans through interaction (Tambe et al 95). None of  these systems have the specific goal of capturing the  human element that proves to be the weakness in computer  security. Here, we are not so concerned with believable  human­to­agent interaction, but in sufficiently similar  action compared with human behavior to make simulation  results valid. This is why we have used the term  “unbelievable agents” to describe our approach.

Given that human frailties are an important aspect of  computer security, to what degree do they need to be  reproduced in software agents in order improve end­to­end  evaluation?  To achieve our research goals, we need to model frailties  in context of human­computer interaction .  This does  embody some understanding of how humans communicate,  consume information, publish information and distribute  information without a computer, but not the full scope of  human behavior.  For experimenters, the ability to capture human behavior  at different levels of fidelity is important.  The benefit of  accurately capturing the full range of human behavior on  computers is clear.  For partial capture of human behavior,  we believe an experiment may want to focus on a specific  phenomenon related to just a few human traits and  capturing too much may add too much complexity and  hinder analysis.  One of the open questions we aim to answer is whether  there is an equivalent “uncanny valley” in simulating  humans in such a manner.  In other words, are there  simulations which appear better but actually get worse  results because we fall into specific errors near a good  simulation?  

Scenario  As we describe our agents’ desired properties and  architecture we will make reference to the following  scenario: Three organizations are working on a joint  project. Within their respective companies, there are team  leaders, workers, and IT professionals. Each company may  have a point of contact with the others and knowledge of  how to communicate with other workers.   Two of the organizational teams gather information  from different sources and primarily communicate between  themselves. After gathering data, they update a cloud  service spreadsheet with data they have collected and  packaged for analysis. The third organizational team  simply reviews the data, analyzes it, and updates with  results.  There is a worker who is interested in infecting company  computers for financial gain.  He is a part of one of the  teams and is aware of trust relationships within the work  groups.  He waits until another worker goes on lunch break  and jumps on his computer, uses a password that was  written on a post­it, and uploads a worm that propagates  through email. An outsider coordinating with the malicious  insider then gains access to information on various  systems. At some point, a normal worker notices  something is not right and contacts an IT worker he is  familiar with. The IT workers attempt to coordinate and fix  the issue.   Some security questions that may be answered through  agent simulations are: What kinds of organizational  structures are more resilient to cross­organizational attacks  such as this one? What kind of policy is most effective?  Was a piece of security hardware effective? How much of  legitimate vs. malicious traffic is blocked by our security  systems? How does this affect productivity? What kind of  procedures can IT professionals take to mitigate damages  once they are done? 

Agent Properties  We divide the different properties we consider into  properties of individual agents, and properties that govern  patterns of communication within and between groups.  Attributes of individuals are important to achieve a base  level of fidelity as well as to provide a way to incorporate  human frailties and behavioral diversity into the  simulations. We will extend a standard BDI agent (beliefs,  desires, intentions)  (Bratman 87; Rao and Georgeff 91) in  two main ways. First, we will incorporate modular goal­ based planners. Second, we will integrate a  cognitive/emotional state including several factors such as  emotional response based on appraisal theory (Gratch  Marsella 04), biorhythems (e.g. hunger, fatigue) focus  level, stress, creativity/agility, and technical competency to  adjust the planning and execution processes. For example,  an agent who is more creative would be able to devise new  plans to achieve their goals; or one who is fatigued and less  technically competent might incorrectly override security  mechanisms. These influences are dynamic. For example,  as the simulation progresses, the agents will become more  fatigued or if agents were given training, their technical  competency could rise.  In order to model realistic patterns of communication,  we will create and keep track of a social network for our  agents. This will track whom they may be familiar with,  the types of relationships they have, and their  understanding of the other agents in the simulation. Agents  can then reason about who they may think would be  interested in a funny Youtube video or who they would  contact first for help.  In our scenario, a worker who  suspects a worm would contact a person in IT he knows as 

Importance of Human Behavior in Security 

21

a friend, who then may be more inclined to listen and  investigate rather than be annoyed. This aspect is important  to simulate attacks that traverse a social network such as  Facebook viruses or old Trojan viruses which used  victim’s instant messaging account to propagate. This  happens in our example scenario where a worm propagates  itself through email using the victim’s address book.  When the network under attack contains an organization  performing a task, as in our scenario, the needs of the task  itself probably dominate the patterns of communication. In  this case, one would expect denser communication within  working groups than between them. The temporal pattern  of communication will probably follow the working day  and also the organization’s deadlines. The social network  is important within the organization for modeling leisure­ related communication and determining who an agent is  likely to approach for help with technical or security  concerns.  Detailed tracking of an agent’s social network is  important for its emotional influence on decision making.   For example, humans will sometimes avoid admitting fault  to co­workers or superiors in an attempt to maintain the  best possible relationship with others, because of pride.   Alternatively, a person might also admit fault because of  moral traits or feelings of guilt.  Certain prejudices towards  people such as those claiming to be from IT may also lead  to an agent being more compliant to requests such as  deleting files or turning on a machine.  The timing of agent behaviors is another important  aspect for realistic simulations. Cyber attacks can take  place over very short periods of time, far too quickly for  humans to react. Our agents’ decision­act cycles must  match those of humans well enough to capture this.  Similarly, changes to our agents’ cognitive states, e.g.  tiredness, hunger and frustration, should take place over  reasonably human time scales.  Many of these desired properties are shared with agents  that behave in a believable fashion in other domains, for  example in games and for training, and we intend to make  use of this work where possible. However the security  domain makes some properties of believable agents almost  irrelevant and other properties that have not been much  studied are more important. For example, interaction with  other humans is not, at least initially, a required aspect of  believability in this domain, allowing us to finesse natural  language understanding and generation or body language.  There are also properties that we do not intend to model in  the first version of our framework although they are  important in the long run. These include the ability to learn  from observations of the world and of each other, and the  ability to influence the views and beliefs of another agent.  An interesting example of these differences is in the  diversification of agents, i.e. to what extent different agents  should perform the same tasks differently. In our domain,  one way this is important is in how many individuals are  vulnerable to an attack that relies on using a particular 

feature of some software that has a vulnerability. In the  real world, it may be that a third of the users use this  feature, while the rest perform the task in other ways.  Without some diversification, all or none of the agents  might be vulnerable to the attack. This can be contrasted  with the game Halo, where players found the actions of the  automated agents to be less believable if they were too  varied. The designers made adjustments reducing the  diversification of the agents.   

Architecture and Implementation  Our agents are based on a well known BDI model,  however we are extending it with what we call the agent’s  cognitive state.  The cognitive state will influence normal  intentions, goals, and possibly available actions and  methods.  Other modules such as planners, state analyzers,  and learners may be integrated as plug­ins.  This is also  intended to allow for extensibility for specialized needs.   The agent architecture is shown in figure 1. 

Figure 1.Our agent architecture is based on a BDI model  with a cognitive state that includes emotions and aptitudes.  A separate desktop interface provides an abstraction  through which the agent interacts with its software  environment.    Each agent is initialized with goals, beliefs, intentions,  and a cognitive state; depending on the role that agent  plays in the overall simulation. In the case of our scenario,  the inside attacker’s goals would be to make money, it has  certain beliefs about information that flows through  company machines and the technical competencies of its  coworkers, and its intentions are to use a worm to gather 

22

information for financial gain.  With cognitive state,  however, if the agent had sufficient laziness, for example,  he may never follow through with his intentions or choose  to pursue an easier path. It should be noted that these  choices within the agent are stochastic and will rely on a  psychological model of how different factors affect  reasoning. This model may vary between agents to account  for more flexibility and variance in behavior. The cognitive  state is shown in figure 1.  Example attributes in the cognitive state include  tiredness and stress level. As agents complete tasks without  a break their tiredness increases. The stress level may  increase if they notice evidence that the computer  environment may be compromised, or if goals are  obstructed.  Elevated levels of tiredness and stress increase  the probability that an agent will make mistakes, for  example ignoring warning messages, or turning off  security software to save time. The reason for this change  in behavior is agent’s reaction to their feelings and  decision to cope with these feelings by becoming more  careless.  The need to cope with certain feelings may lead  to other decisions such as taking breaks or giving up.   Carelessness could be a primary reason the worm in our  scenario goes unnoticed for a certain length of time.  Eventually someone, perhaps from the IT group, will either  suspect a strange email or notice strange system behavior.  For the actual implementation of our agents, we plan to  leverage either Soar or SPARK (Morley and Myers 04).   Soar is based on a unified cognitive architecture. In Soar,  knowledge (actions and methods) is specified in a series of  statements roughly in “if…then” form commonly seen in  expert systems. Agents built on Soar have been shown to  be very robust in the face of failure or uncertainty, which is  important in our domain. Agents based on Soar are also  capable of abstraction and learning from experience.  SPARK is a descendent of Georgeff et al.’s Procedural  Reasoning System (Georgeff and Lansky 87) that was  central in the development of BDI systems. SPARK is  much smaller in scope than SOAR and concentrates on an  efficient, flexible language for agent behaviors with a  sound formal basis. Its representation for agent operators is  more procedural, and similar to that of RAP systems (Firby  89). SPARK supports multiple execution threads for agents  and the interruption and resumption of tasks.  Much of the reasoning about cognitive state concerns  emotions. In common with several research groups, we  view emotions as arising from goal achievement or failure  and modifying the agent’s actions. Relatively simple  models have been implemented that have validity from  cognitive science, for example Em (Neal Reilly 96), which  is based on the models of Ortony et al (88) or EMA  (Marsella and Gratch 09) based on appraisal theory (Smith  and Lazarus 90).  Emotions in our agents are based on the work (Gratch  Marsella 04) which is based on the appraisal variables of  relevance, desirability, causal attribution, likelihood, 

unexpectedness, urgency, ego involvement, and coping  potential.  Every time an observation is made, appraisals  are generated for these variables which contribute to the  affective state, and lead to coping behavior or changes in  cognitive state.   The flexible behavior required of our agents will be  implemented through planning systems. Although both  SOAR and SPARK are capable of simple planning, we  anticipate the need for more sophisticated planning tools to  operate quickly in large domains. These can be included as  plug­in tools, as shown in Figure 1, where the agent will  invoke a planner to help choose a next step, allowing the  planner a filtered view of its beliefs and goals, and  incorporating the result as intentions. For this reason we  plan to use a Blackboard model for the agent’s dynamic  state (Engelmore and Morgan 86).  To work in teams, groups communicate in a hierarchical  structure as shown in figure 2. Organizations are  represented by agents who act as team leaders.  The team  leaders are assigned high level tasks or sets of tasks and  decompose them into finer grain tasks which are delegated  to team members.  This not only reflects organizational  structures in our scenario, but also many real human  structures. Workers in the scenario will also rely on this  organizational structure to collaboratively produce a  spreadsheet. 

Figure 2. A Hierarchical agent communication framework is realistic and also supports scalability. One agent is distinguished as the scenario director that  keeps track of critical events that place as the scenario  unfolds. The director maintains a high level view of the  system’s overall state, through communication with team  leaders, and triggers critical events that move the scenario  forwards at the appropriate moment.  Actions that require  agency are assigned to team leaders or key agents in the  scenario. Examples of similar directors include Moe, part  of the Oz project, which used adversarial search to ensure 

23

that plot points were met while the user explored an  interactive world (Weyhrauch 97, Kelso et al 93).  In our scenario, each of the three companies is created as a  separate team, each with a team leader, workers, and IT  professionals respectively. Two of the teams are designed  to gather, collect and package data, and the other team  reviews, analyzes, and updates with a final result. The  outsider can be defined as his own team, but simply stays  out of the story until the inside attacker is triggered by the  scenario director. Most of the agents in this example have  the ability to manipulate the data as a spreadsheet saved on  a cloud service.  The inside attacker is given a special set of goals that  can be triggered directly, and at a given time will  compromise another agent’s computer. This begins the  attack phase of the simulation in which we can model and  measure a number of features, including the size and speed  of the attack, the amount of data compromised before IT  professionals can re­secure, and the loss in overall  productivity.    

Important questions remain about how to evaluate systems such as this and what conclusions can be drawn from experiments run with this system. Here we intend to follow work from other multi-agent or believable agent systems. Ultimately we aim to enable useful research making security tests that incorporate human error, the probable vulnerability point of more than half of all successful cyber attacks.

References  Engelmore, R., and Morgan, A. eds. 1986. Blackboard Sys­ tems. Reading, Mass.: Addison­Wesley.  Bates,  J.  1993.  The  Nature  of  Character  in  Interactive  Worlds and the Oz Project, Virtual Realities: Anthology of  Industry and Culture, Loeffler, ed.  Blythe,  J.  2005.  Task  Learning  by  Instruction  in  Tailor,  Intelligent User Interfaces  Bratman,  M.  1987.  Intentions,  Plans,  and  Practical  Reasoning, U Chicago Press  Cranor,  L.  2008.  A  Framework  for  Reasoning  about  the  Human in the Loop, Usability, Psychology and Security  Chalupsky, H., Gil, Y., Knoblock, C., Lerman, K., Oh, J.,  Pynadath,  D.,  Russ,  T.,  Tambe,  M.  2001,  Electric  Elves:  Applying  Agent  Technology  to  Support  Human  Organizations,  Innovative  Applications  of  Artificial  Intelligence.  Crawford, M. Whoops, Human Error Does it Again, CSO  Online, http://www.csoonline.com.au/index.php/id ;25583  0211;fp;32768;fpid;20026681  Dourish,  P.  and  Redmiles,  D.  2002.  An  Approach  to  Usable  Security  Based  on  Event  Monitoring  and  Visualization, New Security Paradigms Workshop.  Firby,  J.,  1989.  Adaptive  Execution  in  Complex,  Dynamic  Worlds, PhD Thesis, Yale.  Giampapa,  J.  and  Sycara,  K.  2002.  Team­Oriented  Agent  Coordination  in  the  RETSINA  Multi­Agent  System,  AAMAS  2002  Workshop  on  Teamwork  and  Coalition  Formation   Georgeff,  M.  and  Lansky,  A.,  1987.  Procedural  Knowledge, IEEE 74, 1383­1398  Kelso,  M.,  Weyhrauch,  P.  and  Bates,  J.,  1993.  Dramatic  Presence,  PRESENCE:  The  Journal  of  Teleoperators  and  Virtual Environments, 2, 1  Marsella, S. and Gratch, J., 2009. EMA: A Process Model  of  Appraisal  Dynamics,  Journal  of  Cognitive  Systems  Research, 10, 1.  Morley,  D.  and  Myers,  K.  2004.  The  SPARK  Agent  Framework, in Intl Conf on Autonomous Agents and Multi­ Agent Systems (AAMAS 04)  Neal  Reilly,  S.  1996.  Believable  Social  and  Emotional  Agents, PhD Thesis, CMU­CS­96­138.  Ortony,  A.,  Clore,  A. and  Collins, G. 1988  The Cognitive  Structure of Emotions, Cambridge University Press  Rao,  A.  and  Georgeff,  M.,  1991,  Modeling  Rational  Agents within a BDI Architecture, Int Conf. on Knowledge  Representation 

Conclusions and Future Work  Understanding the human element is critical in evaluating  systems for security. We have outlined an architecture  based on autonomous agents that will improve researchers’  ability to incorporate human behavior into experiments  with security systems. By allowing all the agents to be  simulated, the approach maintains the benefits of automatic  testing, such as scale and potentially accelerated timelines.  We have also outlined a scenario in which team oriented  behavior, human frailties and human flexibility of  approaches play an important role and shown how it will  be modeled within our framework.  We are currently in process of implementing a prototype  of the framework. We intend to perform full evaluation on  the system and improve upon our current design decisions.  Two of our central questions moving forward will be  scalability and user authoring of agents and behaviors.   We want to support experiments with perhaps thousands  of agents performing loosely coupled tasks over a realistic  hardware and network landscape. We believe our approach  will scale, even with a single scenario director, if we allow  the scenario director to offload the oversight of key plot  events to team leaders as necessary. Experiments of  security systems may take days or weeks to run, creating a  challenge to the longevity of our autonomous agents.   In the long term we intend to construct a toolkit for  security researchers, allowing them to instantiate human  behaviors as appropriate for their experiment. This will  rely on powerful authoring tools that will allow users to  define the key plot points of a scenario and a set of agents,  probably by retrieving agents from a library and modifying  their capabilities and profile. We intend to build on earlier  work in procedure editing already integrated with Spark as  a starting point (Blythe 05). 

24

Smith, C. and Lazarus, R. 1990. Emotion and Adaptation.  Handbook of Personality: Theory and Research, Guildford  Press.  Weyhrauch,  P.  1997.  Guiding  Interactive  Drama,  PhD  Thesis, CMU­CS­97­109  Whitten,  A.  and  Tygar,  D.  1999.  Why  Johnny  Can’t  Encrypt: A Usability Study of PGP 5.0, Proc. 8th USENIX  Security Symposium  Wogalter,  M.  2006.  Communication­Human  Information  Processing  (C­HIP)  Model.  In  Handbook  of  Warnings,  Lawrence­Erlbaum, NJ.  Mitnick, K. D., 2002 The Art of Deception: Controlling the  Human  Element  of  Computer  Security,  Hoboken,  NJ:  Wiley  Tambe, M., Johnson, W.L., Jones,  R. M., Koss, F., Laird,  J.  E.,  Rosenbloom,  P.  S.,  Schwamb,  K.,  1995  Intelligent  Agents  for  Interactive  Simulation  Environments,  AI  Magazine 16(1): 15  Gratch,  J.,  Marsella,  S.,  2004.  A  Domain­independent  Framework  for  Modeling  Emotion,  Journal  of  Cognitive  Systems Research, 5, 4 

25

Typed Linear Chain Conditional Random Fields And Their Application To Intrusion Detection Carsten Elfers and Mirko Horstmann and Karsten Sohr {celfers,mir,sohr}@tzi.de Center for Computing and Communication Technologies 28359 Bremen, Germany

Abstract

and Jung 2008; Ourston et al. 2003)). These models suffer from an implicit modeling of past alerts with the Markov assumption. However, in this domain the threat of an alert may highly depend on the context, e.g. the previously recognized alerts. This problem can be addressed by using Conditional Random Fields (CRF) (Jafferty, McCallum, and Pereira 2001) that can consider several (past) alerts to reason about the current state. It has been shown that CRFs are very promising for detecting intrusions from simulated connection information in the KDD cup ’99 intrusion domain1 compared to decision trees and naive Bayes (Gupta, Nath, and Ramamohanarao 2007; 2010). However, the high amount of reference data as in the KDD data set is only available in simulated environments and is not available in real network domains (cf. (Anderson 2008)). The sparse reference data problem is due to the infrequent occurrence of successfully accomplished critical intrusions. This leads to the problem that most of the possible alerts are even unknown at the training phase of the alert correlator. One possibility to face this problem is described in this paper: Typed Linear Chain Conditional Random Fields. This method uses type information of feature functions for the inference in linear chain conditional random fields and is motivated by filling the gap of missing reference data by considering semantic similarities. Earlier work has already considered semantic similarity between states for the inference, e.g. in Markov models (Anderson, Domingos, and Weld 2002), in hidden Markov models (Elfers and Wagner 2010) and in input output hidden Markov models (Oblinger et al. 2005). The latter is very similar to linear chain conditional random fields. The inference can also be regarded as mapping a sequence of input values to a sequence of labels. The paper is organized as follows: In the next section the intrusion detection domain representation in an ontology and their use for preprocessing the alerts from the different IDSs is described. In Section we face the problem of sparse reference data by using the domain knowledge described in Section . In Section the type extension to linear conditional random fields is evaluated by some real examples in the intrusion detection domain. At last we come to a conclusion and give an outlook to future research.

Intrusion detection in computer networks faces the problem of a large number of both false alarms and unrecognized attacks. To improve the precision of detection, various machine learning techniques have been proposed. However, one critical issue is that the amount of reference data that contains serious intrusions is very sparse. In this paper we present an inference process with linear chain conditional random fields that aims to solve this problem by using domain knowledge about the alerts of different intrusion sensors represented in an ontology.

Introduction Computer networks are subject to constant attacks, both targeted and unsighted, that exploit the vast amount of existing vulnerabilities in computer systems. Among the measures a network administrator can take against this growing problem are intrusion detection systems (IDS). These systems recognize adversary actions in a network through either a set of rules with signatures that match against the malicious data stream or detection of anomalous behavior in the network traffic. Whereas the former will not recognize yet unknown vulnerability exploits (zero-day) due to the lack of respective signatures, the latter has an inherent problem with false positives. Anomalies may also be caused by a shift in the network users’ behavior even when their actions are entirely legitimate (see (Garcia-Teodoro et al. 2009)). One strategy is to combine the signature and anomaly based detection to a hybrid IDS by learning which detection method is reliable for a given situation (e.g. (Gu, C´ardenas, and Lee 2008)). In this setup detecting false positives is the challenging task to avoid overwhelming the users of an IDS with irrelevant alerts but without missing any relevant ones. Several well-known machine learning methods have already been applied therefore to the domain of intrusion detection, e.g. Bayesian networks for recognizing attacks based on attack-trees (Qin and Lee 2004) and (hidden colored) Petri nets to infer the actions of the attacker by alerts (Yu and Frincke 2007). For the detection of multi-stage intrusions in alert sequences especially hidden Markov models have been successfully investigated (e.g. (Lee, Kim, c 2010, Association for the Advancement of Artificial Copyright ! Intelligence (www.aaai.org). All rights reserved.

1

26

KDD ’99 data set: http://kdd.ics.uci.edu/databases/kddcup99

Preprocessing and Domain Knowledge

Prerequisites: Linear Chain Conditional Random Fields

Hybrid IDSs that use both signature-based and anomalybased detectors are a promising strategy to improve the precision of intrusion detection. Our approach therefore involves correlation of alarms from several detectors that can be added if they are present in a particular network. As a first step, we use a syntactic normalization in the IDMEF2 format, which is done by Prelude Manager3 , a well-known open source interface. This is followed by a semantic normalization that enables the system to handle each sensor’s alarms according to their meaning and a burst filtering that eliminates duplicates of alarms produced by several sensors or as a result of similar observations. The semantic normalization is based on an ontology in OWL-DL4 representation. This ontology contains several aspects of the security domain, including e.g., the topology of the network in question, its computers (assets) and general configuration. Of particular interest for the recognition of multi-step attacks are definitions of possible observations that can be made by the sensors that are organized in a hierarchy of concepts. Among the concepts are some that have been derived from classes introduced by Snort. Individuals that belong to these concepts are possible observations and can be imported from Snort’s rules set by an automatic parser. When analysing multi-step attacks, these observations can be considered as describing adversary actions of an attacker, but from a security expert’s perspective. Furthermore, the hierarchy denotes semantic similarity between nearby concepts and thereby supports the further correlation process. If knowledge about further sensors is added to the ontology, several observations from one or more sensors can be unified when they are instances of the same concept from the observation ontology. E.g., if an observation according to the ET EXPLOIT MS04-007 Kill-Bill ASN1 exploit attempt rule has been made by the Snort IDS and the Prelude logfile parser LML recognizes a match of the Admin login rule in a log file it observes, they may be normalized to one concept AttemptedAdminObservation to which they both belong.

The purpose of a linear chain conditional random fields compared to hidden Markov models is to take multiple observations (at several time slices) for computing the probability of the labels into account (cf. (Jafferty, McCallum, and Pereira 2001)). Thereby the strong independence assumptions inherent in hidden Markov models is relaxed. In the following the simplified notation from (Wallach 2004) for linear chain conditional random fields is used with a sequence of labels X and a sequence of observations to be labeled Y with a normalization function Z: ! 1 p(y|x, λ) = exp( λj Fj (y, x)) (1) Z(x) j The inference problem is to determine the probability distribution over a vector of labels y from a vector of observations x. Conditional random fields are generally not restricted in the dependencies among the nodes, however in linear chain conditional random fields the nodes are only dependent on their predecessor and on the vector of observations. Each feature function Fj has a corresponding weight λj that is computed during training.

Typed Linear Chain Conditional Random Fields One issue with linear chain conditional random fields is that if the observations/features are not known at training time there is a lack of information for computing the probability of the labels. Our suggestion is to use a type hierarchy of feature functions to find the most similar feature functions that handle the observation. E. g. if no feature function matches a tcp port scan observation it is dangerous to assume that the tcp port scan observation belongs to a normal system behavior. If there is a feature function matching a udp port scan observation and the type hierarchy expresses a high similarity between udp and tcp port scans, the feature function for udp port scan observation could be assumed to match instead. In our case we can derive the type hierarchy from the semantic normalization (cf. Section ). The computation of the conditional probability is therefore extended by a parameter for the type hierarchy over feature functions T :

Typed Linear Chain Conditional Random Fields

p(y|x, λ, T ) =

! 1 exp( λj Fj! (y, x)) Z(x) j

(2)

In the case of not having a matching feature function for a given x we propose to instantiate a new feature function F ! to match the currently unknown observation, i.e. the feature function is fulfilled (returns 1) iff the given observation arrives.

In this section we briefly introduce conditional random fields and extend them by using a type hierarchy to fill the gap of missing feature functions due to insufficient reference data. For ease of demonstration this paper assumes that each observation corresponds to one feature function, however the problems in the other case are also mentioned.

However, there is the need to determine the corresponding weights for the new feature function. The original weights of the most similar feature functions should be regarded but with a loss to reduce the likelihood that the sequence of observations really belongs to that label.

2

s. RFC 4765 http://www.prelude-technologies.com/ 4 http://www.w3.org/TR/owl-features/ 3

27

Figure 1: Excerpt from the observation taxonomy. Concrete observations (as defined by the sensors’ rules) are instances of concepts in a hierarchy. Each observation is also associated with the respective sensor that issues the alarm.

Preliminary Evaluation

The weights λF ! of the new feature function F ! are determined by the weights of the most similar feature functions. The most similar feature functions are given by a similarity measurement. The set of most similar feature functions SF is given by: SF = {

Fs (x, y)|s ∈ argmin sim(Fj (x, y),

(3)

Fk (x, y), T ).Fk (x, y) ∈ B(x, y)}

(4)

1 ! λs sim(Fj , Fs , T ) |SF | s

(5)

The implementation of typed linear chain conditional random fields is based on the code of Lon Bottou5 using stochastic gradient descent for learning conditional random fields. The evaluation consists of two real intrusions performed with the Metasploit Framework6 : (1) the Kill-Bill7 and (2) the Net-Api8 exploit. The gathered sequences of alerts from the snort detector and the normalized alerts by the preprocessor are presented in Fig. 2 and 3. The two kinds of linear chain conditional random fields (typed and untyped) have been trained to detect the Net-Api exploit as an attack and the MiscActivityObservation as normal system behavior. Both methods share exactly the same reference data and take the two preceding, the current, and the two succeeding alerts for the labeling into account. The linear chain conditional random field has been tested against the typed linear chain conditional random field by performing the untrained Kill-Bill exploit. As expected the typed conditional random field detects the Kill-Bill exploit by using the domain knowledge for this unknown observation.

k

B(x, y) ⊆ T is the set of bound feature functions, i. e. the feature functions that have a value for the given parameters. The corresponding weights of the new feature function F ! based on the most similar feature functions SF is given by: λF ! =

As mentioned there is the need for a similarity score between feature functions regarding the type hierarchy, denoted as sim(a, b, T ), a ∈ T, b ∈ T . There are different possibilities to determine the similarity. In this paper we investigated the method of Zhong et al. (Zhong et al. 2002). This method uses the distance from a to the closest common parent in the type hierarchy denoted as d(a, ccp, T ) and the distance from b to the closest common parent d(b, ccp, T ) where the distance is defined as: d(a, b, T ) = |

1 2k l(a,T )



1 2k l(b,T )

|

Alert ProtocolCommandDecodeObservation ShellcodeDetectObservation MiscActivityObservation MiscAttackObservation

(6)

l(n, T ) is the depth of n ∈ T from the root node in the corresponding type hierarchy where the depth of the root node is zero (l(root, T ) = 0). k is a design parameter to indicate how fast the distance increases depending on the depth in the hierarchy, in this paper we use as proposed by Zhong k = 2. The similarity of two feature functions is given by the distances to the closest common parent by:

Similarity 0.882812 0.882812 0.519531 0.519531

Figure 4: Similarity of the different feature functions to the feature for AttemptedAdminObservation 5

http://leon.bottou.org/projects/sgd http://www.metasploit.com/ 7 Metasploit: windows/smb/ms04 007 killbill 8 Metasploit: windows/smb/ms08 067 netapi 6

sim(a, b, T ) = 1 − d(a, ccp, T ) − d(b, ccp, T ) ∈ [0; 1] (7)

28

Time 1 2 3 4 5

Normalized alerts (after preprocessing) MiscActivityObservation MiscActivityObservation AttemptedAdminObservation MiscActivityObservation MiscActivityObservation

Snort message BAD-TRAFFIC udp port 0 traffic BAD-TRAFFIC udp port 0 traffic ET EXPLOIT MS04-007 Kill-Bill ASN1 exploit attempt BAD-TRAFFIC udp port 0 traffic BAD-TRAFFIC udp port 0 traffic

Figure 2: Alert sequence for the Kill-Bill exploit Time 1 2 3 4 5

Normalized alerts (after preprocessing) MiscActivityObservation ShellcodeDetectObservation MiscActivityObservation ProtocolCommandDecodeObservation MiscActivityObservation

Snort rule BAD-TRAFFIC udp port 0 traffic ET EXPLOIT x86 JmpCallAdditive Encoder BAD-TRAFFIC udp port 0 traffic NETBIOS SMB-DS IPC$ share access BAD-TRAFFIC udp port 0 traffic

Figure 3: Alert sequence for the Net-Api exploit showed that both classifiers behave exactly the same for already known reference data. In conclusion the typed linear chain CRF outperforms the traditional linear chain CRF if the necessary domain information is available with only an increased inference effort by classifying untrained features. Another experiment has shown the limitation of increasing the inference accuracy by the quality of the type hierarchy. In this experiment the MiscActivityObservation has been attached in the type hierarchy to have exactly the same similarity than the other similar observations ProtocolCommandDecodeObservation and ShellcodeDetectObservation. The high belief of the model that the MiscActivityObservation feature corresponds to normal system behavior led to the misclassification of the AttemptedAdminObservation to a normal system behavior like in linear chain conditional random fields. Beyond the scope of this work, there are two open issues for further improving typed linear chain conditional random fields. The first is the semantic context sensitivity of the feature functions: This means that the weights of the known similar feature functions, to approximate the unknown one may be highly dependent on the context in which they appear in the reference data. E. g., a MiscActivityObservation alone may be harmless but with a preceding ProtocolCommandDecodeObservation it may belong to an attack. This means that the semantic meaning of the feature functions may vary by the parameters (the feature functions’ context) which lead to the problem that the semantic similarity of feature functions cannot be represented by a simple type hierarchy. The type hierarchy itself may need to consider the context information too. The second issue is about the semantic decomposition of observations: Generally multiple feature functions may have a value for one observation. Each feature function may represent another semantical part of the observation. Futher, the individual feature functions belonging to one observation may have a high dissimilarity to each other. If an observation is unknown in a typed linear chain conditional random field, only one feature function is approximated with possibly too few similar feature functions. E.g. five feature

Figure 5: Excerpt from the ontology with highlighted concepts used in the evaluation examples.

The typed conditional random field does not know the observation AttemptedAdminObservation from training but it searches the available feature functions with the highest degree of similarity. These are the alerts ProtocolCommandDecodeObservation and ShellcodeDetectObservation (cf. Fig. and ) and computes the corresponding weights as described in Eqn. 5. Both features refer to an attack and therefore the classification comes to the conclusion that the unknown observation AttemptedAdminObservation also refers to an attack. In contrast, the untyped linear chain conditional random field has not detected the Kill-Bill exploit and therefore produced a critical classification in the intrusion detection domain: A false negative. This shows how typed linear chain CRFs enrich traditional linear chain CRFs. Another experiment performing the trained Net-Api exploit

29

Gu, G.; C´ardenas, A. A.; and Lee, W. 2008. Principled reasoning and practical applications of alert fusion in intrusion detection systems. In ASIACCS ’08. Gupta, K. K.; Nath, B.; and Ramamohanarao, K. 2007. Conditional random fields for intrusion detection. In 21st International Conference on Advanced Information Networking and Applications Workshops (AINAW’07). Gupta, K. K.; Nath, B.; and Ramamohanarao, K. 2010. Layered approach using conditional random fields for intrusion detection. In IEEE Transactions on Dependable and Secure Computing. Jafferty, J.; McCallum, A.; and Pereira, F. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In 18th International Conf. on Machine Learning. Lee, D.; Kim, D.; and Jung, J. 2008. Multi-stage intrusion detection system using hidden markov model algorithm. In Proceedings of the 2008 International Conference on Information Science and Security. Oblinger, D.; Castelli, V.; Lau, T.; and Bergman, L. D. 2005. Similarity-based alignment and generalization. In Machine Learning: ECML. Ourston, D.; Matzner, S.; Stump, W.; and Hopkins, B. 2003. Applications of hidden markov models to detecting multistage network attacks. In Proceedings of the 36th Hawaii International Conference on System Sciences. Qin, X., and Lee, W. 2004. Attack plan recognition and prediction using causal networks. In Annual Computer Security Applications Conference. Wallach, H. M. 2004. Conditional random fields: An introduction. In Technical Report MS-CIS-04-21, University of Pennsylvania. Yu, D., and Frincke, D. 2007. Improving the quality of alerts and predicting intruder’s next goal with hidden colored petri-net. In Computer Networks: The International Journal of Computer and Telecommunications Networking. Zhong, J.; Zhu, H.; Li, J.; and Yu, Y. 2002. Conceptual graph matching for semantic search. In Proceedings of the 2002 International Conference on Computational Science.

functions have a significant value for one observation but this observation is unknown during the training phase. In typed linear chain conditional random fields only one feature function is approximated and probably only with some of the five feature functions to consider. This is a challenging issue because there is currently no reasonable information in the type hierarchy about the amount of feature functions to approximate.

Conclusion and Future Work Typed linear chain conditional random fields offer an improved way to handle missing feature functions. The missing feature functions’ weights are approximated during runtime by searching semantically similar feature functions out of a type hierarchy. The type hierarchy is extracted out of an ontology and the semantic similarity between the concepts in the ontology respectively the type hierarchy are determined by a measurement from Zhong et al. (Zhong et al. 2002). Fortunately the training process keeps the same as for conditional random fields, only the inference process is adapted. Further, the computational effort of the inference process only increases if missing reference data influences the inference result, all other cases are not effected. First experiments in the domain of intrusion detection have shown that this is a useful extension to linear chain conditional random fields and that with this method variations of already known kinds of intrusions can be detected more reliably. In the future, the evaluation should be extended to a more expressive data set. Currently the benchmark sets of real intrusions are either very limited to the amount/kinds of intrusions or are only available for a low level analysis. Besides the mentioned issues of semantic context sensitivity and semantic decomposition of observations the search for similar features may be improved by suitable search algorithms. Also the way of similarity measurement might be extended by not only considering a type hierarchy but also considering different object properties / relations in the ontology, e. g. by considering IP-to-subnet relations and host-to-asset relations. Overall, typed linear chain conditional random fields are a promising step in the direction of using complex domain knowledge to improve reasoning over time with only a few reference data.

References Anderson, C.; Domingos, P.; and Weld, D. 2002. Relational markov models and their application to adaptive web navigation. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Anderson, R. 2008. Security Engineering 2nd Ed. Wiley Publishing, Inc. 664. Elfers, C., and Wagner, T. 2010. Learning and prediction based on a relational hidden markov model. In International Conference on Agents and Artificial Intelligence. Garcia-Teodoro, P.; D´ıaz-Verdejo, J.; Marci´a-Fern´andez, G.; and V´azquez, E. 2009. Anomaly-based network intrusion detection: Techniques, systems and challenges. In Computers & Security.

30

A Dynamic Knowledge Base for Intrusion Detection Mirko Horstmann and Carsten Elfers and Karsten Sohr Technologie-Zentrum Informatik und Informationstechnik Am Fallturm 1, 28359 Bremen, Germany

Abstract

serve the behaviour of malware in a sandbox environment, extract features using techniques from the field of information retrieval and text processing, and finally train a support vector machine using the resulting data. The most important in a ranking of features serve as explanation for the classification. Perdisci et al. (Perdisci, Gu, and Lee 2006) use a combination of one-class support vector machines trained on different feature spaces to improve the accurancy of their anomaly-based IDS PAYL (Wang, Cretu, and Stolfo 2005). Forrest et al. (Forrest, Hofmeyr, and Somayaji 1997) introduced the idea to use artificial immune systems (AIS) for intrusion detection. More recently, Luther et al. (Luther et al. 2007) present an approach inspired by biological immune system principles. In their framework, detectors are created randomly and are discarded if they match normal patterns, or cloned if they match abnormal patterns. This results in a number of detectors that collectively react to anomalies in an agent-based fashion. Even though these detectors may have an advantage over rule-based intrusion detection, they still produce false positives that have to be dealt with on a higher level. Security Incident and Event Management Systems (SIEMS) like ArcSight2 or Enterasys3 analyse events from a higher perspective and correlate alarms that belong to the same incident, usually based on a set of rules. The FIDeS project4 aims to improve this kind of correlation. The FIDeS system correlates successive alarm messages from different existing detectors by a combined application of artificial intelligence techniques such as Relational Hidden Markov Models (Elfers and Wagner 2010) that recognise multi-step attacks within a sequence of events. One important aspect of the system is that in addition to the events provided by the sensors it uses additional, explicitly modelled knowledge for the correlation: FIDeS takes into account available information about the network it monitors as well as knowledge about the way an attack is usually carried out by an intruder and observed by the intrusion detection system. Therefore, in addition to the correlation engine, the FIDeS system needs a comprehensive knowledge base, which is partly implemented as an ontology.

We propose an approach to event correlation in security incident and event management systems (SIEMS) that uses machine learning methods combined with an ontology-based knowledge base. The knowledge base represents various aspects of the network, including necessary information about the network sensors in use, definitions of possible actions an intruder may take during the course of a multi-step attack, and the observations that can be made by network intrusion detection sensors. The correlation engine uses the knowledge base to normalise events from various existing intrusion detectors to infer plausible sequences of adversary actions.

Introduction Intrusion detection in computer networks has traditionally been approached by two techniques: Signature-based intrusion detection finds given patterns in data streams that correspond to either malware code or the traffic produced by this malware. Signature-based IDSs like Snort1 use a large and extensible set of rules, each of which contains a regular expression to match against the network traffic and an alarm message to be brought to the security manager’s attention. On the other hand, anomaly-based IDSs observe the network traffic to form profiles of normal network usage and will report deviations thereof. The problem with both kinds of detection is that false alarms are too common – because signatures frequently match against random attack attempts that are obviously doomed to fail and a shift in the network users’ behaviour sometimes invokes anomaly-based alarms even though their behaviour may be entirely legitimate. Additionally, signature-based approaches will fail when malware code has changed compared to the signatures and when facing completely new attacks (zero-day-exploits). These shortcomings mitigate the acceptance for intrusion detection systems and, therefore, a great deal of research is devoted to techniques that aim to lower their amount. Some authors have proposed the use of artificial intelligence methods such as machine learning to improve the precision of security events: Rieck et al. (Rieck et al. 2008) ob-

2

http://www.arcsight.com http://www.enterasys.com 4 http://www.fides-security.org

c 2010, Association for the Advancement of Artificial Copyright ! Intelligence (www.aaai.org). All rights reserved. 1 http://www.snort.org

3

31

The remainder of this paper is structured as follows: The next section shows how knowledge is organised inside the FIDeS knowledge base, then we describe challenges and possible solutions for its maintenance and an example usecase for the acquisition of new knowledge for use in the system. Finally, we explain how the knowledge is used by the correlation engine.

Ontology-based Representation of Security Knowledge At the heart of the FIDeS knowledge base is an ontology representing domain knowledge by concepts from the domain and their properties, i.e., attributes of single concepts and relations between them. The specific ontology language used here is OWL-DL, a language specified by the W3C that emerged from DAML+OIL and that uses RDF(S) notation5 . An ontology in this sense consists primarily of a number of concepts (classes) and their properties (attributes or relations between objects) and defines a vocabulary (the terminology, or TBox). Based on this vocabulary, facts about individuals (the assertions, or ABox) can be formulated in statements like HostA hasInstalled Solaris10, where ”HostA” and ”Solaris10” are individuals and ”hasInstalled” is a relation between these two individuals that denotes the fact that the computer named HostA is running an operating system named Solaris10. Together, the ontology and the individuals constitute a part of the knowledge base of the system. The ontology used in the FIDeS project describes various aspects of the security domain like, e.g., the topology of the network in question, its computers and general configuration. Of particular interest for the recognition of multi-step attacks (as described in the last section), are definitions of possible actions of an intruder and the observations that can be made by the sensors. When analysing multi-step attacks, these representations are important considerations as both describe adversary actions: The former describes them from an attacker’s perspective, the latter from a security expert’s perspective. Atomic attack steps or actions in the knowledge base are the smallest units used to describe multi-step attacks and constitute all known instruments an attacker can use to intrude a system. These attack steps are part of the ontology and are organised hierarchically by possible goals of the intruder. The hierarchical representation allows the system to generate explanations on different levels of abstraction and may later help a user to query information according to his level of knowledge. Fig. 1 shows an excerpt of this hierarchy. Security-relevant events are recognised by IDSs and reported to the administrator based on a set of rules that define the possible observations the system is able to make. The standard rule-set of the Snort IDS uses 34 categories like ”trojan-activity” or ”policy-violation” to classify the events. These are used as a basis for the current definition of observations inside the ontology. Fig. 2 shows an excerpt of this – e.g., ReportedTCPPortScanEvent is a subclass of ReportedPort5

Figure 1: Excerpt of the attack step taxonomy. All known single-step attack schemes are ordered by the attackers’ goals ScanEvent. Events are modelled as individuals of these category concepts with relations to their respective sensor (Snort in this case).

Maintenance of the Knowledge Base As we will show in the last section, the correlation engine relies on the knowledge base at several occasions. While it is desirable to maintain the knowledge by automatic means, a challenge lies in the fact that at least some parts need to be modelled manually. Whereas the terminology for the observations and attack steps is rather comprehensible, there will be a lot of individuals (e.g., IDS rules attached to observation concepts in the ontology). As new sensors are added to the system, their observations will have to be added to the existing knowledge – the knowledge base is therefore a dynamical one. The particular outfit of sensors may of course be different in every network and is under the responsibility of the local system administrator. However, the features of a certain sensor that have to be modelled in the ontology are valid for all installations of that sensor. The modelling process is therefore a combination of local and distributed work. This poses the question of how users and domain experts can work together on the ontology to incorporate all the knowledge needed. A number of methodologies and tools for the engineering of ontologies have been developed over the years (see (Gomez-Perez, Fernandez-Lopez, and Corcho-Garcia 2004) for a review), among them METHONTOLOGY (FernandezLopez, Gomez-Perez, and Juristo 1997) and the method developed by Uschold and King (Uschold and King 1995). However, of particular interest in recent years have been means for the distributed development of dynamic ontologies: The DILIGENT (Tempich et al. 2005) argumentation framework aims to assist in finding consensus within a group of people working on an ontology. The framework

http://www.w3.org/TR/owl-features/

32

Figure 2: Excerpt of the FIDeS ontology. All possible observations are sub-classes of the Observation concept, specific rules are individuals of such concepts and are generated by their respective Analyser (Snort IDS, in this case) each user. However, at the user’s site, there will only be one observation hierarchy that describes observations from all sensors. The following is an example use case that shows how a network administrator would add a new sensor to the ontology: What actions and knowledge are necessary if a security expert needs to add a new sensor (in this case the logfile parser Prelude-LML6 ) to the knowledge base? When changing the knowledge base, the user is faced with two problems: First, the existing concepts and the new concepts may describe aspects of the domain from different perspectives. The user needs knowledge not only about the domain in general, but also about the perspectives that underlie the vocabularies he is going to rely on. Second, as the knowledge base is used by particular inference mechanisms, it is important, that these mechanisms can still use the knowledge base in its new state. One cannot assume that a security manager will have a deep understanding of, e.g., the mechanisms described in the last section. Therefore, a suitable editing tool needs to ensure that the knowledge base remains in a state that is usable by both, the users and the inference mechanisms. To add the new vocabulary of Prelude-LML, the user needs to know the concept semantics. The rules of both, Snort and Prelude-LML, contain descriptions of the event that can help the modelling user make the right decisions. The following is an excerpt from a Prelude-LML rule that matches if someone logs in successfully as the root user:

supports the discussion between participants with a wikibased tool and using an argumentation ontology, which defines concepts to describe and keep record of the ongoing argumentation. The framework thereby allows new participants in the process to inform themselves about former decisions and their rationale. The process focuses not so much on the initial design than on the evolution of an ontology and is described by the following five steps: (1) A panel of domain experts develop an initial ontology that is not necessarily complete. This ontology is distributed among the users who (2) adapt the ontology to their needs and possibly to their own, differing vocabulary. (3) The panel members collect suggestions for changes, analyse them and decide, which changes should be accepted in the global ontology. (4) The ontology is then revised by the panel and the users. (5) Finally, local copies of the ontology are updated by the users. This process is based on the assumption that work on a dynamic ontology is distributed among a number of users in different places and a panel of experts that coordinates this work – a situation that is not unlike the one described for our system above, which is deployed at various users’ sites and provided with knowledge that, although it can partly be acquired in a central place, must be recombined according to a particular user’s environment and supplemented with his own additions. We propose a semi-formalised way to support discussion among all participants, like that used by the DILIGENT framework, which will be a good basis for the ongoing maintenance of the knowledge base. The ontology of observations currently consists of concepts that are derived from Snort’s categories and individuals that correspond to the events that can be observed by Snort. As new sensors are added to a network, these individuals will have to be supplemented with new events – and possibly with new concepts, if the new sensors’ categories differ considerably from Snort’s. For this reason, the intended knowledge base editor needs editing features for the concept hierarchy. As the same sensors are used in many users’ networks, it is desirable to have the knowledge about these sensors modelled in one place, rather than by

regex=Accepted (\S+) for root from (\S+) port (\d+); classification.text=Admin login; [...] analyzer(0).class=Authentication; [...] assessment.impact.description= Root logged in from $2 port $3 using the $1 method; The rule description contains the classification text for the event (Admin login) and a class (Authentication). 6

33

https://dev.prelude-technologies.com/wiki/1/PreludeLml

Improved attack recognition by domain knowledge

The class may be seen as equivalent to Snort’s notion of a category and can therefore lead the user to the place in the existing ontology where this particular rule or a whole set of rules should be integrated. This works as long as the class is compatible with the existing categories and the rest of the ontology.

In the FIDeS system, the knowledge base described above is used by the correlation engine that aims to recognise multistep attacks from incoming alert streams in a number of steps (see Fig. 4):

A more complex case occurs when it becomes necessary to add a new concept. If new rules cannot be assigned to an existing concept the ontology needs to be extended at a higher level. Consider the hierarchy inside the ontology from Fig. 2 with its concepts partly derived from Snort’s categories, and the Prelude-LML rule from above. The ontology currently has a concept SuspiciousLoginObservation with only one sub-concept DefaultLoginAttemptObservation. The latter is a rather special case of suspicious login – one with a default password. There is no sub-concept of SuspiciousLoginObservation that would be specifically suited for a successful root login. We now assume that the user decides that a root login is always suspicious and such an observation should therefore be integrated under the concept SuspiciousLoginObservation. He also decides that a distinction should be made between successful logins and mere attempts because an unsuccessful login attempt is uninteresting for intrusion detection in most cases. As a result, he introduces two sub-categories to SuspiciousLoginObservation: SuccessfulLoginObservation and AttemptedLoginObservation. The formerly existing branch is then subcategorised under AttemptedLoginObservation and the new rule is added to the category SuccessfulLoginObservation. Fig. 3 shows the new version of the observation hierarchy inside the ontology.

1. Collecting, aggregating and normalising alerts from different sensors 2. Refining alerts based on predefined patterns considering domain knowledge 3. Attack sequence recognition by classification algorithms based on matched patterns

Alert collection and preprocessing FIDeS is designed to use several alert sources for the attack recognition process. Therefore it is necessary to have a common interface: A well-known open source interface for this task is the Prelude manager7 , which is used to collect syntactically normalised messages in the IDMEF8 format from different sources, e.g., network signature detectors like Snort, logfile detectors like Prelude-LML, or anomaly based detectors like IAS (see (S. Bastke and Schmidt )). The FIDeS aggregation and normalisation component is directly connected to the Prelude manager and receives the alerts from the different sensors. In the context of analysing alerts from different sources one generally suffers from an exceeding amount of alerts. Using complex artificial intelligence methods for all of these alerts is not feasible, so one has to decide which alerts are the most suitable/promising for detecting intrusions. At first, a semantic normalisation of the incoming alerts based on the hierarchy of alerts from the knowledge base is a necessary step to decide which alerts are the most promising. The semantic normalisation is necessary to enable the system to handle each sensor in the same way. Fig. 5 shows an example of how the information for semantic normalisation is described in the ontology. For example, if an alert is generated from an analyser with the label snort in the analyser field and the label (portscan) TCP Distributed Portscan in the alert classification field of the IDMEF message we can normalise it to the corresponding concept, i.e. the value of the analyser field to the concept Snort and the value of the classification field to the concept TCPPortScanObservation. The next step is to detect and aggregate bursts/duplicates in the alert stream. Therefore, we compare the normalised source IP, destination IP and classification information. If duplicates in a given time window are detected, the preprocessor will combine them to one alert which offers to only analyse the aggregated alert itself, which is in most cases sufficient, but also offers to look into each of the aggregated alerts in detail if it is necessary. Additionally, this offers flexibility to the failure of alert sources and also allows for

Figure 3: New version of the taxonomy of observations, with new sub-concepts of SuspiciousLoginObservation

7 8

34

http://www.prelude-technologies.com/ s. RFC 4765

Figure 4: Overview of the information flow in FIDeS erties. Patterns contain a set of constraints in the form of: Property of the alert, relation, concept in the ontology. For example, one constraint may be that the source IP address is from the internal network, so we define the constraint alert ip src rdf:subClassOf fides:InternalAddress, where alert ip src is instantiated with the corresponding field value from the IDMEF message, rdf:subClassOf is a relation or object property from the ontology and fides:InternalAddress is a concept from the ontology.

adding and removing redundant sensors at runtime. The preprocessing is also necessary to abstract the attack knowledge from the special learning domain (there is a certain amount and type of sensors detecting attacks) and to transfer this information to other domains (with another type and amount of sensors).

Alert refinement To determine which alerts are indicators for real attacks we need to refine the alerts using domain knowledge. E.g., a port scan from outside the network to the web server is in most cases harmless but a port scan from the network’s own web server to internal hosts is mostly very dangerous. In this example we need to consider knowledge from the asset management to distinguish these two cases. To realise alert refinement, we need a set of patterns describing complex situations based on the alerts’ prop-

Another constraint may be that the alert classification must be a port scan alert so we additionally define the constraint alert class rdf:subClassOf fides:PortScanObservation and name our pattern in this case Internal Portscan. The created pattern instances can be formalised as SPARQL queries (see Fig. 6) that can

35

Figure 5: Example of the normalisation information in the ontology. PREFIX rdf: PREFIX fides: ASK { "194.68.120.42" rdf:subClassOf fides:InternalAddress. fides:TCPPortScanObservation rdf:subClassOf fides:PortScanObservation. }

modelling special patterns for this alert in the long term by the user. However, if no exact match has been found in the set of patterns the distance to the imprecisely matching pattern must be added to the matching information to adjust the parameters of further inference methods, e.g., to compute a penalty for the certainty of the results. The imprecise matchmaking can be done in the knowledge base by expanding the SPARQL language by so called magic properties like in iSPARQL (Kiefer, Bernstein, and Stocker 2007), or by integrating the search for similar concepts into the matchmaking process itself. After checking patterns against the alerts and the domain knowledge, this information can be used for intelligently classifying the sequence of matches to corresponding alerts and to known attack types.

Figure 6: SPARQL query example of checking the example constraints

Attack sequence recognition Recognising attacks in the stream of refined alerts (matches of patterns) can be done by several methods of reasoning over time. Specifically, probabilistic methods are reasonable due to possibly missing alerts and varying attack sequences, e.g., Hidden Markov Models (Lee, Kim, and Jung 2008). In our case we can also benefit from the domain knowledge from the knowledge base as we used for the imprecise matchmaking before, e.g., by Relational Hidden Markov Models (Elfers and Wagner 2010). This model uses the observation and attack step concepts for statistical smoothing to improve the inference performance. Therefore we are specifically interested in the conceptual similarities between the attack steps and between the matches to handle currently unknown attack types by reasoning over previously known similar attack types. Because of the limitation of the amount of hypotheses to take into account, due to a high computational effort and the problem that we do not know which alerts belong to an attack (e.g. between two alerts belonging to an attack there may be thousands of alerts not belonging to it) we propose a method we call hypotheses pool. The hypotheses pool consists of several priority queues of alert sequences with one priority queue of hypotheses for each specific length of hypotheses (one queue for hypotheses with one attack step, one queue for hypotheses with two attack steps and so on) and with each hypothesis given a priority based on the probabil-

be evaluated by the knowledge base which can decide if the patterns’ constraints are satisfied or not and reply the result to the matchmaker. The Internal Portscan pattern can probably be instantiated in SPARQL as depicted in Fig. 6. Another problem occurring by refining the alerts by context information is that not all necessary context information may be available. In our example, perhaps nobody has ever modeled a port scan from inside. The consequence is that this information is not available for further reasoning and that this attack will not be detected. We propose to do imprecise matchmaking by using the hierarchy of concepts. Consider, e.g., a ping alert with a source address value from the internal IP address range without a matching pattern: In this case, we can search for similar patterns considering the conceptual similarity in the constraint parameters, for example based on the method of Zhong et al. (Zhong et al. 2002). If we have a pattern with a constraint for the classification TCPPortScanObservation but no pattern for the classification UDPPortScanObservation and we receive an alert with the second classification, we can reason over the concept PortScanObservation (see Fig. 5) that both classifications are very similar because they are both an inheritance of the port scan event. This is helpful for ranking previously unknown alerts but this does not avoid

36

ity that this sequence belongs to an attack. If a new alert arrives from the preprocessing, each hypothesis is expanded by the new alert and checked if the probability belonging to the sequence is higher than the lowest probability of the hypotheses in the corresponding priority queue. The hypothesis with the lowest probability is then dropped so that the most promising alert sequences remain. If the expanded hypothesis’ probability is lower than all the corresponding hypotheses in the queue, the expanded hypothesis is dropped. To identify the most promising hypotheses for further investigation we have specified a quotient called threat confidence. This is the probability that the corresponding alerts of a hypothesis belong to the most likely attack sequence (determined by the Viterbi algorithm (Viterbi 1967)) divided by the probability that all the alerts belong to a normal system behaviour: TC =

P (a1 , · · · , an |e1 , · · · , en ) P (n1 , · · · , nn |e1 , · · · , en )

An evaluation of the first part of the detection process is currently in progress.

Acknowledgements This work was supported by the German Federal Ministry of Education and Research (BMBF) under the grant 01IS08022A.

References Elfers, C., and Wagner, T. 2010. Learning and prediction based on a relational hidden markov model. In International Conference on Agents and Artificial Intelligence. Fernandez-Lopez, M.; Gomez-Perez, A.; and Juristo, N. 1997. Methontology: from ontological art towards ontological engineering. In Proceedings of the AAAI97 Spring Symposium, 33–40. Forrest, S.; Hofmeyr, S. A.; and Somayaji, A. 1997. Computer immunology. Communications of the ACM 40(10):88– 96. Gomez-Perez, A.; Fernandez-Lopez, M.; and CorchoGarcia, O. 2004. Ontological Engineering. Springer-Verlag. Kiefer, C.; Bernstein, A.; and Stocker, M. 2007. The fundamentals of isparql - a virtual triple approach for similaritybased semantic web tasks. In Proceedings of the 6th International Semantic Web Conference and 2nd Asian Semantic Web Conference, volume 4825 of LNCS, 295–308. Lee, D.; Kim, D.; and Jung, J. 2008. Multi-stage intrusion detection system using hidden markov model algorithm. In Proceedings of the 2008 International Conference on Information Science and Security. Luther, K.; Bye, R.; Alpcan, T.; Mller, A.; and Albayrak, S. 2007. A cooperative ais framework for intrusion detection. In IEEE International Conference on Communications, 1409–1416. Perdisci, R.; Gu, G.; and Lee, W. 2006. Using an ensemble of one-class svm classifiers to harden payload-based anomaly detection systems. In IEEE International Conference on Data Mining, 488–498. Rieck, K.; Holz, T.; Willems, C.; Dssel, P.; and Laskov, P. 2008. Learning and classification of malware behavior. In SIG SIDAR Conference on Detection of Intrusions and Malware & Vulnerability Assessment, volume 5137 of Lecture Notes in Computer Science, 108–125. S. Bastke, M. D., and Schmidt, S. Combining statistical network data, probabilistic neural networks and the computational power of gpus for anomaly detection in computer networks. First Workshop on Intelligent Security (SecArt). Tempich, C.; Pinto, H. S.; Sure, Y.; and Staab, S. 2005. An argumentation ontology for DIstributed, Loosely-controlled and evolvInG Engineering processes of oNTologies (DILIGENT). In G´omez-P´erez, A., and Euzenat, J., eds., Proceedings of the Second European Semantic Web Conference, volume 3532, 241–256. Uschold, M., and King, M. 1995. Towards a methodology for building ontologies. In Workshop on Basic Ontologi-

(1)

where a are the most probable attack steps, e are the observed alerts and n is the state for normal system behaviour. Preliminary experiments have shown that this is a reasonable confidence for the certainty that the given sequence belongs to an attack. The attack sequences with the highest threat confidences are presented to the user who can decide whether this is a real attack or not. If it is not an attack the user may simply drop this hypothesis (to make room for further hypotheses) or may annotate the recognised attack to improve the system’s behaviour, e.g., by adding missing domain knowledge (adding patterns for the matchmaking process) and/or by adding classification data to this sequence (perhaps classifying the sequence as normal system behaviour, i.e., as false positives).

Conclusion and Outlook In this paper, we propose the modelling of security relevant properties of a network domain with ontologies. Ontologies are a well-founded way of knowledge representation that enables the user to model heterogeneous knowledge in one representative language. The domain specific knowledge can be applied to improve the process of correlating security alarm messages and is used internally for matchmaking and classification purposes and for a user-specific presentation of attacks. Specifically, the matchmaking process benefits from this representation because complex domain information allows the user to express meaningful patterns and we show how to formulate these complex patterns in SPARQL. Additionally, we show the challenge of adding new knowledge to the ontology in a use case: To add a new detector to the network, an administrator may have to edit the ontology within the knowledge base for which he needs to be provided with additional information. How the modelling process of this security-related knowledge can be supported for a distributed group of users is still a matter of research and will be an important step for the acceptance of this kind of representation by the intended users.

37

cal Issues in Knowledge Sharing, held in conjunction with IJCAI-95. Viterbi, A. J. 1967. Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory IT-13(2):260–269. Wang, K.; Cretu, G. F.; and Stolfo, S. J. 2005. Anomalous payload-based worm detection and signature generation. In Valdes, A., and Zamboni, D., eds., RAID, volume 3858 of Lecture Notes in Computer Science, 227–246. Zhong, J.; Zhu, H.; Li, J.; and Yu, Y. 2002. Conceptual graph matching for semantic search. In Proceedings of the 2002 International Conference on Computational Science.

38

Identifying Malware Behaviour in Statistical Network Data Sascha Bastke and Mathias Deml and Sebastian Schmidt and Norbert Pohlmann Institute for Internet-Security University of Applied Sciences Gelsenkirchen {bastke, deml, schmidt, pohlmann}@internet-sicherheit.de

Abstract Today simple signatures of malware binaries lose importance for malware detection. The reason for this is the polymorphic character of malware. An alternative for that is behaviour based detection. Within the scope of this work we focus on the detection of malware communication using statistical data that describes network traffic. For classification we used different approaches such as feed forward neural networks.

1. Introduction

Figure 1: Overview of infrastructures

Today simple signatures of malware binaries lose importance for malware detection. The reason for this is the polymorphic character of actual malware such as Conficker (see (NERC 2008)). Examples for this are simple changes of filenames or the usage of compression schemes. An alternative is behaviour based detection that can be divided in host- and network-based detection. To get information about the behaviour of a malware it should be executed. This is normally done in sandboxes. In these sandboxes malware can be executed and it’s activity is logged. The data includes actions on the host itself and the communication of the malware and other systems over the network. In our work we focus on the detection of infected hosts by identifying malware communication. This means we try to identify malware communication based on prior knowledge about their network behaviour. An important note is that we only use statistical descriptions of network traffic and we try to avoid data privacy relevant information. To get the necessary information for detection we need two infrastructures. First of all we must get a description of the network behaviour of the malware. For this we need a sample of the malware binary, so we need an infrastructure that collects actual malware binaries. These binaries must be executed, so we can measure their network communication and gather information about the malware behaviour. For the collection of malware honeypots like nepenthes ((B¨acher et al. 2006)) and honeyclients could be used. The execution can be done with the help of sandboxes. An example of a sandbox is CWSandBox ((Willems, Holz, and Freiling 2007)). A combination of honeypots and sandbox systems is the InMAS System described in (Engelberth et al. 2009).

If only the network traffic of the malware is needed, also a normal Windows installation can be used where the network traffic is recorded with a tool like tcpdump. For this work we used nepenthes as honeypot and a simple Xen installation of Windows XP as execution environment. From this data we try to generate profiles for detection. The second infrastructure collects live data from network traffic. In this data we search for the profiles generated by the first infrastructure. The whole setup is shown in figure 1. To describe this more formal lets assume we have n malware binaries. Then the first infrastructure generates a set S = {(m1 , p1 ), (m2 , p2 ), . . . , (mn , pn ))} of pairs where mi is a malware and pi is a profile for the network behaviour of the malware. The profiles p = {o1 , o2 , . . . , ok } consist of a set of observations made by the execution environment. The second infrastructure generates a set O = {o1 , o2 , . . . , om } of observations from the monitored network. The question to answer is now if any subset G ⊆ O of this observations is similar to one of the profiles pi . In this case we possibly found malware activity. The term ’possibly’ is due to the fact that a profile could be not good enough to identify only malware behaviour and it also can identify normal behaviour. Also it’s possible that an observation that describes one malware also describes another, because today malware is highly modularized and modules are exchanged between different malware developers. With the Honeypot infrastructure we collected 245 Malware binaries. These binaries are used for the experiments. The collected binaries were executed in a windows xp system for five minutes and the generated traffic was dumped with tcpdump. The binaries were also analyzed by different virus scanners and with a tool that extracts the information

c 2010, Association for the Advancement of Artificial Copyright ! Intelligence (www.aaai.org). All rights reserved.

39

value pairs mp = {(d1 , v1 ), (d2 , v2 ), ..., (dn , vn )} for each network packet. This packet measurements are aggregated in two different ways for the classification process in this paper. One method aggregates all mp of packets that passed the sensor in a given time window. Relations between packets get lost in this case. Figure 2 shows time series created of this data with a time window size of five minutes. The figure visualizes inbound and outbound packets with syn or syn+ack flags over a period of about three days. The rapidly increasing packet amount, shown by the blue series at the last day, shows an anomaly we classified as a weak syn-ack flood from a chinese server. The hill of the time series just before the syn-ack flood is the normal traffic of a Monday that is not present at the two days before because there is only low traffic at the weekend at this sensor location.

from the pe-header. In this work we only use the network traffic. The dump data is then processed to generate the observations and profiles for description. Normal traffic is taken from the network of the department of computer science at the University of Applied Sciences in Gelsenkirchen. Later we got also 3000 malware binaries from other sources that we used for analysis. We must emphasize that this is work in progress and we are at the very beginning, so the results are not final and need much deeper investigation.

2. Related Work There are many publications in the area of botnet detection or detection of malware communication. A survey of botnets and botnet detection is given in (Feily, Shahrestani, and Ramadass 2009). The paper (AsSadhan et al. 2009) analyzed the command & control communication of botnets. The authors found out that this traffic shows periodic behaviour that can be used for detection. Many papers are focusing on particular protocols like IRC, HTTP or P2P. Examples for such papers in the field of IRC are (Wang et al. 2009), (Lu and Ghorbani 2008), (Mazzariello 2008a), (Livadas et al. 2006) and (Lin, Chen, and Tzeng 2009). The field of P2P botnet detection is covered by (Schoof and Koning ). In (Strayer et al. ) an approach for flow-based classification of IRC botnet traffic is introduced. In their work they focus on behavioural features of malware traffic.

3. Monitoring Network Traffic The network data we analyzed was captured with a sensor technology called Internet-Analysis-System ((Hesse and Pohlmann 2008)). This sensor processes network packets and stores counters for the occurrence of different network traffic parameters called descriptors. Therefore it dissects and decodes the different protocols of each packet in a way similar to Wireshark ((Wireshark 2010)) to extract the values of the defined parameters. Descriptors are similar to signatures in intrusion detection systems without looking for communication which is typical for a specific malware or remote attack but counting more general aspects of different protocols. Examples for this descriptors are the occurrence of a TCP packet with a set SYN flag, a TCP packet with destination port 22 or more complex descriptors like the one that counts the occurrence of IRC communication on a non standard port using protocol detection with L7Filters ((L7Filter 2010)). The advantage of this method over a completely statistical description of network packets for example with n-grams is, that we have a more comprehensive description of the packets and we can exclude parameters more precisely that are irrelevant for the classification or that lead to bad results. With this method we also add additional expert knowledge to the statistical description, because we are able to compare information with each other, that have the same semantic meaning even if the same parameters in different packets have different packet offsets because of optional parameters. Result of the counting process are sets of packet measurements represented by descriptor

Figure 2: Time series of descriptor measurements Figure 3 shows descriptors and their counter values for one time window in detail as a histogram. This is an example of the amount of different descriptors and their values which can occur in five minutes. The distribution of this descriptors can change strongly between different sensor locations or different temporal conditions like daytime, weekday and so on. For the research of this paper we introduced a second method for aggregating counters by extending the basic functionality of the sensor to make it able to count descriptors per TCP stream to get a statistical description of each stream. A stream is combined of all packets with the same set {(ip, port)1 , (ip, port)2 } where tuple 1 or 2 can be respectively source or target to cover both directions of the communication. The first packet of each stream has the TCP SYN flag set and the stream ends with FIN ACK, Reset or a timeout.

4. Identification Process

The process of malware identification can be seen as a two stage process. This is described further in this section. As mentioned in section 1.the profiles pi consist of a set of observations oi . As also mentioned the sensors used for network monitoring generate a set O = {o1 , o2 , . . . , om } of observations from actual network traffic. In the first stage of the

40

tures from the transport layer do not have any positive effect to the detection rate. The reason is that malware is also an application which uses the TCP/IP communication stack of an operating system like normal applications. So there is a great similarity in this features between malware and normal traffic. Even ports cannot be used as unique classifier since they can change between executions of the same binary. Furthermore there are other features on this layer which differs in every execution environment - for example the IP time to live. For this reason we skip that features and focus on features from the application layer and other meta information which can be produced by the sensor. For example the IAS can detect the usage of common protocols on non-standard ports or produce a statistical distribution over the entire payload. First of all we want to give an overview about the observations we want to correlate to each other. As mentioned in the sections before we are collecting statistical network data from normal traffic and malware traffic. We made an analysis about the occurrences of different descriptors in malware and normal communication. For that we fetched malware streams from 3000 malware samples, executed it in our environment and collected the statistical data from that streams. Also we took a 2-day traffic dump from our university traffic to get some normal streams. Then we cut the descriptors from the transport layer descriptors as mentioned before and measured the occurrences of the descriptors. Figure 4 shows a bar plot of the occurrences. The x-axis shows a subset of available descriptors sorted by its occurrence higher than one percent. The y-axis shows the percentage of occurrences in malware and normal streams. As you can see there are several descriptors occurring only in malware traffic or normal traffic. These descriptors should be good features. For example there are many IRC descriptors occurring only in malware traffic and HTTP user agent descriptors only occurring in normal traffic. This is due to the fact that IRC is not a commonly used service in our university network. Also the malware we analyzed does not use common HTTP user agents for HTTP communication. For this reason the usage of our descriptors for identifying malware communication seems useful. The question is which features we have to monitor and how to put them together to get an appropriate detection rate.

Figure 3: Histogram of descriptor measurements for one time window identification process the observations from actual network traffic must be matched against the observations that describe the malware behaviour. After this stage we have identified a set of observations OM = {oM,1 , oM,2 , . . . , oM,l } that could be caused by malware. An observation can be for example a single descriptor as described in section 3. or a complete stream that is described by a set of descriptors. This is described in more detail in section 5. where we describe the matching of observations to identify malware. In the next stage we must decide which of these observations correlate to each other and group them according to the correlations. After this we compare the groups to the malware profiles and if the groups are similar to a profile defined for a malware we’ve found that malware. The easiest and most helpful correlation in this case is the correlation based on the host in the monitored network. All observations that can be associated to the same host are grouped together. The set of observations OM is partitioned by the equivalence relation R based on the host of the monitored network. So we get a partition OM / ≡ {[H1 ], [H2 ], . . . , [HQ ]}. Against this partitions we match the generated malware profiles. To further improve the matching process in the last stage we also think about the generation of profiles for the normal behaviour of a host in the network. This helps to filter out observations that are normal for the hosts and should leave only anomalous behaviour.

Time-Window Method with Cross-Correlation In the first approach we used the time-window method introduced in section 3. to compare the behaviour of malware traffic and network traffic. We used the sensor to generate profiles from malware and network traffic. In this experiment the observations are single descriptors as described in section 3. . A malware is then described by a set of such observations. This can be seen in figure 5 for two malwares. Caused by our sensor technology we cannot use a simple matching algorithm since our profiles are just describing network behaviour. Also it would be vulnerable against minor changes of malware communication caused by a higher occurrence of malware or behaviour changes. For this reason we first tried a method from signal processing called

5. Classification of Network Data In this section we show some results of our experiments. So far we have used three different approaches for detection. We especially concentrate on the first stage of the detection process. This is due to the fact that this is work in progress and we are at the very beginning of this work. In the next subsections we explain the approaches we used, the experiments we have done and the results we got so far. As described in section 3. the sensor technology we use has the capability to analyze the transport-oriented protocols IP, TCP, UDP and ICMP. The analysis of communication generated by malware and normal traffic has shown that fea-

41

Figure 5: Normal and malware traffic signals. equation 1 are the observations generated by the sensor and oi are the observations from the malware profile. With this method no normalization is needed because the cross correlation compares the structure of the data not the absolute values. We select malware and network profiles at random, apply the function for three different test cases and determine the detection performance rates. In the first test case we are using a network profile compared to a malware profile to measure the false-positive rate. In the second test case we compared a malware profile with a network profile including the same malware profile to measure the false-negative rate. The third test case is similar to the second case but we gave the malware profile a higher weight. We repeated the tests with different malware and network traffic and calculated the average performance rates. Table 1 shows our results. In the first experiment we tried to detect the occurrence of malware communication caused by a single malware. As you can see the performance is very poor. The algorithm did only detect 55 percent of the malware and has problems with false-positives and falsenegatives. One problem is the normal traffic which overlays the malware communication. Even if the malware communication is included in the traffic, the transmission signal won’t correlate with the malware signal if the signal isn’t significant. Further analysis showed that the term ’significant’ depends on the normal network traffic at the measuring points. One idea to fix this could be decreasing the threshold but this would produce a high false-positive rate. On the other hand increasing the threshold would shift the problem to the false-negative rate. So as next step we tried to eliminate the normal traffic. For this we analyzed different methods for estimating the normal traffic: Interval-Based Arithmetic Mean, Holts-Winters, Linear-Regression and Smoothing Average. Interval-Based means that we calculated a mean value for every measuring interval of one day. For example if we want to estimate

Figure 4: Occurrences of descriptors in normal and malware traffic cross-correlation (cro 1988). The cross-correlation function is able to calculate a value which describes the similarity between two signals. In other words the function can determine if a given signal contains another signal. In a more formal manner o!N = (oN,1 , oN,2 , ..., oN,n ) is a vector containing the observations of the network traffic and p! = (o1 , o2 , ..., on ) a vector containing the malware observations. Both have the same size and equivalent indices. Figure 5 shows an example of two signals that are compared to each other. The green line shows a transmission containing normal network traffic. The red line shows malware traffic. Each signal element represents a descriptor counter value. Equation 1 explains the calculation of the correlation coefficient. The function calculates a value between Zero and One using the values from the given vectors. Zero means that there isn’t a correlation between the network traffic and the malware. Whereas a value of One means that there is a very high correlation between them and they are nearly equal. The advantage of this method is that it is insensitive against noise and easy to calculate. On the other hand the problem is to determine a threshold value. COR(o!N , p!) =

!n 1 ¯) N )(oi − p i=1 (oN,i − o¯ n " ! " ! n n 1 1 2 N) i=1 (oN,i − o¯ i=1 (oi n n

− p¯)2 (1) For the identification process we order the set of observations generated as profile for a malware to build a sequence. This sequence is then matched against the actual traffic using the cross correlation method described before. The oN,i in

42

Malware 1 1 500

NT X X

Pos./Neg. 0.5565 0.6623 0.9511

False-Pos. 0.2379 0.0286 0.0372

False-Neg. 0.2065 0.3091 0.0117

All data Normal Malware

Table 1: Results of time-window method with crosscorrelation

Correct 0.8939 0.8527 0.9999

Wrong 0.1061 0.1473 0.0001

Table 2: Classification results for Neural Networks.

distinguish between the observations. The results show very high false positive rates because of the cross-correlations property to be insensitive against small differences. One idea to improve the detection was to treat the streams in relation to each other. We grouped the streams per host and applied the detection process what improved the detection rate slightly but not enough to get useful results. As mentioned this is due to the fact that that the difference between the observations is very low. Based on this results we decide to use a different method for classification of streams. This method should not only consider the structure as the cross correlation method does.

the normal traffic from 14:00 till 14:05PM on a Monday, we only use the same measuring intervals from the last n weekdays in the past. We found out that the usage of the arithmetic mean was the best approach for estimating the normal traffic of the test data set. Considering this we repeated our experiment. We also changed the threshold to a higher value which causes less false-positives because the normal communication must be more similar to malware communication to produce a falsepositive. The results are shown in the second row of the table. As you can see the detection rate increased to 66 percent but is still to low for an effective detection. We analyzed this in more detail and found out that the normal traffic is still a problem because our estimation is insufficient. If we multiply malware communication by 500 within the transmission you can see in the third row that we can identify it with a detection rate of 95 percent. So that means that we need a high occurrence of malware communication or much better method to eliminate normal traffic to detect it.

Neural Network for Stream Classification In this experiment we decided to use a feed forward neural network for classification. With this approach our observations oi are streams and for the first simple experiment we tried to classify if a stream is a malware stream or not. For this we trained the neural network to distinguish between malware and normal streams. To train the network we choose a training set at random from the data set consisting of about 19000 normal and malware streams. In total the training set consists of 2000 streams, 1000 streams from every class. The data was normalized by standard score method. After training we give the whole set to the neural network for testing. With such a trained neural network we get the following results shown in table 2. The first row of the table shows the classification rate related to all data. The second row shows the classification rate if we only look at the normal streams. This means that around 14 % of all normal streams in the data were classified wrong. The third row shows the classification rate of the malware streams. The result shows that only one malware stream was classified wrong. A deeper look at the results shows that about 35 % of wrong classified streams were HTTP, about 5 % were HTTPS and the rest of about 60 % has no application layer protocol we could detect which results in an empty set of descriptors for this streams. In the next experiment we omit a part of the descriptors used for describing the streams. We decide to omit the descriptors that describe the payload of the application layer (n-gram). The results of this experiment are shown in table 3.In this case for the normal data most of the wrong classified streams are HTTP (about 70 %). The others are HTTPS with about 30 %. Another interesting observation is that we get much more wrong classified malware streams. A detailed look at this shows that also in this most of the wrong classified streams are HTTP (about 76 %). The rest are HTTPS with about 24 %.

Stream Method with Cross-Correlation The main problem of the first approach was that the noise generated by normal network traffic overlays the malware communication. We found out that a good detection rate was only possible with a significant amount of malware traffic. In most cases malware does not generate such a high amount of traffic. To encounter this problem we think about methods for eliminating the background traffic. As described in previous section an estimation of background traffic does not work, so we must describe the communication more detailed. For this reason we change our view of the communication a bit. In section 3. we described that we look separately at every packet that passes the sensor and sum up the results into one set of descriptors. A more closer look at the packet flow that passes the sensor reveals that it consist of packets belonging to different communications called streams. Instead of describing the packet flow as one set of descriptors we can describe the packet flow as a set of descriptor sets one for each communication. Every of these sets consists only of the packets for this communication without noise from other communications. According to this the profiles for the malware changed. The single observations that describe the malware are now streams. Every stream is described by a set of descriptors. The profile consists of a set of streams. We also tried the cross-correlation for detection. Indeed we solved the problem with the normal traffic but caught a new problem. The problem now is that there is very little difference in the structure of the observations, e.g. the used descriptors. The cross correlation is therefore not able to

43

All data Normal Malware

Correct 0.9912 0.9928 0.9892

Wrong 0.0088 0.0072 0.0108

Table 3: Classification results for Neural Networks.

6. Feature Analysis To better understand the results we get so far we analyzed the features used for classification. Based on this results we also think about new features that improve the classification process. This analysis was done on tha data set used in the classification tests and is not the same data set used before to analyze the occurrence of descriptors.

Figure 7: Barplot of divergence (logarithmic)

Divergence Analysis of Features

protocol detection and a part of the payload descriptors. As mentioned we see that the divergence values for some of the payload descriptors shown at the right side of the bar plot have relatively low divergence values. They only reach values of about 102 . This means their contribution to classification is very low. As we have seen before in section 5.we got better results without this features, but in our experiments we have omitted all of them. So we must be more careful in omitting features. In the next step we compared the values of the features of the malware and the normal traffic samples for every feature. For this we build a histogram of a feature for the malware and normal samples. An example is shown in figure 8. In most cases we get the same result as shown in figure 8. The features for the normal data have a broader distribution. This means if the value of a feature exceeds a limit we can say the sample stream is normal.

The divergence of a feature i according to two classes c1 , c2 are defined by 2. In this σj,i is the standard deviation of feature i in class j and µj,i is the mean value of feature i in class j. For σ and µ we use the values estimated on the data set. (σ1,i − σ2,i )2 + (σ1,i + σ2,i )(µ1,i − µ2,i )2 2σ1,i σ2,i (2) A high divergence value of a feature means that the feature distinguishes good between the two classes, a low value means that the feature does not distinguish good between the two classes. D(c1 , c2 , i) =

Figure 6: Barplot of divergence Figure 6 shows the divergence of the used features as a bar plot. The two classes are malware and normal streams. In the figure you can see a set of features that have a high divergence. This features describe protocols that only occur in one class and not in the other. These are for example IRC that are only used in malware traffic or FTP that only occurs in our normal traffic. This shows that our normal traffic is not representative enough. In figure 7 we show the same plot as in figure 6 but with logarithmic scaled y-axis. From this figure one can see that are only a few other features that have high divergence values like descriptors for

Figure 8: Histograms of feature 657 for malware and normal data. This results show that more work must be done to identify correct features to get a good classification. Our analysis also shows that our normal data used for tests is not very representative. The reason for this is that not all protocols are included in this data, e.g. no IRC traffic. Also we are sure that until now we have not seen all kinds of traffic that

44

is for example normal in HTTP. This means we can not be sure that the classification results shown above (section 5. ) sustain with different normal traffic data. Also its important to describe the payload much more detailed to get more information needed for classification.

we analyzed the data manually. Many of the analyzed malware samples used IRC on a non common port like 6660 to 6669, so this is a good indicator for botnet traffic on its own, making emerging threats snort signatures ((Threats 2010)) for IRC commands on non standard ports very valuable. Nevertheless there is a huge number of malware that uses common ports so this is not the exclusive indicator that should be monitored. Also in huge environments like university networks, IRC on non common ports could appear continuously, causing a high false positive rate. During the manual analysis we found two aspects that could help to classify malware traffic. In the first case we saw continuous repetitions of IRC commands like NICK or JOIN with exactly the same parameters. This should not apply to human communication as humans normally react on ERROR messages and change their behaviour instead of typing the commands twenty times in series. This behaviour can be seen in figure 9. Therefore we will use the number of totally equal command-parameters in a single stream as new descriptor. Also we saw that botnets use different sets of characters for their NOTICE and PRIVMSG command parameters as they do not talk about human related stuff but only exchange machine readable commands for attacks and other tasks (see figure 10). This could lead to an abnormal byte distribution of botnet IRC packets and should be analyzed with a new set of descriptors. We will finally use a similar set of features to the one described in ((Mazzariello 2008b)), as this shows classification rates of hundred percent. Additionally we will try to use the byte distribution features.

Additional Features Fields of protocols between the physical layer and the transport layer are used to classify packets as scan or flooding traffic and are not analyzed further more in this section. The reason is that this paper addresses the analysis of streams isolated from each other and we cannot gain more information about a scan or DDoS stream on its own without correlating it with other streams. In this section we document our first results on analyzing application layer protocols transported over UDP and TCP to find new features we want to analyze to improve the malware stream classification in further work. What we also can use are descriptors describing meta information of streams themselves. The following subsections discuss different protocols and parameters that are promising to get better classification results in further work. As one stream normally contains only one application layer protocol like HTTP, we can divide the selection of new features into several subsections. Many protocols are already discussed in several papers addressing the classification of packets or streams that belong to malware or normal traffic and we try to combine features of this papers with our ideas to get a good overall classification. TCP-Streams Combined with other parameters, the number of transferred bytes of a stream, the number of packets and the temporal length of streams could be useful. For example we observed that human generated IRC streams are longer or contain more packets than botnet IRC streams. Descriptors of this kind could also help to classify encrypted traffic. In some situations it could be possible to identify encrypted malware traffic by packet frequency, packet length or more complex descriptions of packet streams.

JOIN JOIN JOIN NICK NICK NICK ...

IRC IRC is used by many botnets as communication channel. The analysis of about 3000 pcap files containing only malware communication showed us, that this protocol is still used by a significant amount of bots in the wild. Abnormal ratios between IRC commands like JOIN, NICK, PRIVMSG and ERROR are already covered by the descriptors we currently use but there is still room for descriptors that allow a more precise detection of malware streams. To select new features for our classification process we analyzed the traffic of about three thousand malware samples produced in a secure environment. Many of the samples used IRC for communication with its masters. Therefore we focused on extracting good features to differentiate IRC traffic that is generated by human users or malicious bots. This was done without looking for signatures which are typical for a specific malware but for descriptors that allow us to describe the behaviour of the communication partners more general. As a first step to get an overview of the botnet IRC communication we extracted the payload of all TCP streams that appeared to be IRC traffic with tcpflow ((Elson 2010)). First

# b o c i k n a e t 55 # b o c i k n a e t 55 # b o c i k n a e t 55 Serverdrgiro917 Serverdrgiro917 Serverdrgiro917

Figure 9: Example of IRC communication (Equal commands) HTTP In many cases HTTP was used to load binaries for additional functionality. As these binaries are compiled for the windows operating system, we should find PE header data ((Wikipedia 2010)) or typical elements of windows binaries in HTTP streams that transport binaries. One example is the well known string ”This program cannot be run in DOS mode”. Suspicious could be packets that have content types like text/html and contain elements of PE headers. New descriptors that count windows executables transported over http could be helpful for classification. Like the application layer protocol detection we use to detect protocols on non standard ports, it would be possible to create a contenttype detection for the payload of protocols like HTTP or SMTP. This would make it possible to count payloads with wrong content-types to analyze its impact on our classification rates.

45

1988. Elements of Statistical Computing: Numerical Computation. Elson, J. 2010. tcpflow - a tcp flow recorder. http://www.circlemud.org/˜jelson/ software/tcpflow/. Engelberth, M.; Freiling, F.; Goebel, J.; Gorecki, C.; Holz, T.; Trinius, P.; and Willems, C. 2009. Fr¨uhe warnung durch beobachten und verfolgen von b¨osartiger software im deutschen internet: Das internet-malware-analyse system (inmas). In 11. Deutscher IT-Sicherheitskongress. Feily, M.; Shahrestani, A.; and Ramadass, S. 2009. A survey of botnet and botnet detection. Emerging Security Information, Systems, and Technologies, The International Conference on 0:268–273. Hesse, M., and Pohlmann, N. 2008. Internet situation awareness. In eCrime Researchers Summit, 2008, 1–9. L7Filter. 2010. L7-filter web page. http:// l7-filter.sourceforge.net/protocols. Lin, H.-C.; Chen, C.-M.; and Tzeng, J.-Y. 2009. Flow based botnet detection. Innovative Computing ,Information and Control, International Conference on 0:1538–1541. Livadas, C.; Walsh, R.; Lapsley, D.; and Strayer, W. 2006. Usilng machine learning technliques to identify botnet traffic. Local Computer Networks, Annual IEEE Conference on 0:967–974. Lu, W., and Ghorbani, A. A. 2008. Botnets detection based on irc-community. In GLOBECOM, 2067–2071. Mazzariello, C. 2008a. Irc traffic analysis for botnet detection. Information Assurance and Security, International Symposium on 0:318–323. Mazzariello, C. 2008b. Irc traffic analysis for botnet detection. In IAS ’08: Proceedings of the 2008 The Fourth International Conference on Information Assurance and Security, 318–323. Washington, DC, USA: IEEE Computer Society. NERC. 2008. Background: Industry advisory cip: Conficker polymorphic worm. Schoof, R., and Koning, R. Detecting peer-to-peer botnets. Strayer, W.; Lapsely, D.; Walsh, R.; and Livadas, C. Botnet detection based on network behavior. Botnet Detection 1– 24. Threats, E. 2010. Emerging threats. http://www. emergingthreats.net. Wang, W.; Fang, B.; Zhang, Z.; and Li, C. 2009. A novel approach to detect irc-based botnets. In NSWCTC ’09: Proceedings of the 2009 International Conference on Networks Security, Wireless Communications and Trusted Computing, 408–411. Washington, DC, USA: IEEE Computer Society. Wikipedia. 2010. Portable executable. http://en. wikipedia.org/wiki/Portable_Executable. Willems, C.; Holz, T.; and Freiling, F. 2007. CWSandbox: Towards automated dynamic binary analysis. IEEE Security and Privacy 5(2). Wireshark. 2010. Wireshark. http://www. wireshark.org/.

: u . PRIVMSG f e g r d d n o : ! g e t h t t p : / / s t a s h o n l i n e . i n f o / b u i l d / s e t u p 1 0 . exe : u . PRIVMSG f e g r d d n o : ! g e t h t t p : / / down0129 . i w i l l h a v e s e x y g i r l s . com : 8 8 / erdown . t x t : u . PRIVMSG f e g r d d n o : ! g e t h t t p : / / pozeml . com / oc / box . t x t : u . PRIVMSG f e g r d d n o : ! g e t h t t p : / / p o z e m l e . cn / s v / s 2 . t x t : u . PRIVMSG f e g r d d n o : ! g e t h t t p : / / av . l o m e t r . p l / i n s t . php ? l a n g = de&i d =32& s i d =0

Figure 10: Example of IRC communication (Divergence from human communication) Peer to Peer Peer to peer detection by analyzing flows independent from each other seems more and more complicated as botnets do not use standard ports and nowadays encrypt P2P traffic like the Conficker worm. So we won’t concentrate on trying to detect p2p on this detection stage where we analyze streams separately because detection of p2p is a very challenging approach even with correlating multiple streams and the behaviour of hosts in a network.

7. Further Work In the next steps we want to analyze the usage of additional features described in section 6. to improve the classification performance. Another important aspect is the collection of more representative test data. We need a wider range of protocols and different traffic behaviour in the normal data. Especially we need normal data from protocols that could be used by malware. This prevents distorted classification results due to the fact that a protocol used by a malware not contained in the normal data. Also we have to spend more effort to the second stage of the identification process. We have spent less time on this topic so far as mentioned in section 5. For this we think its necessary to preclassify the streams in one of the categories scan, flooding and different kinds of application layer protocols. The reason for this is that we get a more detailed description of malware behaviour. With this we hope to get better detection rates.

References AsSadhan, B.; Moura, J. M.; Lapsley, D.; Jones, C.; and Strayer, W. T. 2009. Detecting botnets using command and control traffic. Network Computing and Applications, IEEE International Symposium on 0:156–162. B¨acher, P.; K¨otter, M.; Holz, T.; Dornseif, M.; and Freiling, F. C. 2006. The nepenthes platform: An efficient approach to collect malware. In Proceedings of 9th Symposium on Recent Advances in Intrusion Detection (RAID’06), 165–184.

46

Efficient Automated Generation of Attack Trees from Vulnerability Databases Henk Birkholz and Stefan Edelkamp and Florian Junge and Karsten Sohr TZI University of Bremen

Abstract

approaches (Qin and Lee 2004; Yu and Frincke 2007), however, try to predict the probability of related near-future security events based on the recognition of attack plans, which contain individual attack steps, final attack goals and possible pre- or post-conditions. It is possible to derive attack plans manually from preprocessed security events in order to match them with upcoming sequences of live events. The ultimate goal though is to provide a set of plausible (and probable) attack plans, corresponding to occurring security events, without the need of detecting a similar sequence of real events beforehand. In this paper we develop an efficient step-by-step process to generate attack trees, which are suitable for further automatic processing by security management systems. Configuration and properties of these attack trees are focused on computer networks and largely based on requirements coming from SIEM systems utilizing plan recognition. The paper is structured as follows. We first discuss attack trees, attack graphs and communication graphs in our setting, outline their known properties and issues. We show how these apply in the context of this paper and which of them can be mitigated. Afterwards we describe the formal definition of graph variants we use in the step-by-step construction of attack trees and analyze the resulting computing costs. Next, we provide computation times of a prototype implementation. Finally, we conclude and discuss interface standardization and future research directions.

In this paper we generate attack trees in form of unfolded attack graphs from vulnerability databases fully automatically. The inference algorithms merges this information with knowledge about the computer network, the software installed, and firewall logs to generate a vulnerability graph of the computer network. From this graph an attack tree is automatically extracted by applying a modified version of Dijkstra’s shortest path algorithm that detects highly vulnerable attacks paths. The derived attack tree is extended to process firewall rules and can be labeled with additional information that is contained in the vulnerability database. Theoretical considerations show that – under some plausible set of assumptions – attack tree generation is a linear time operation. Moreover, experimental results on some randomly generated computer networks illustrate that the approach scales well.

Introduction In order to ensure information security and to establish a corresponding risk management in today’s organizations, various security measures are commonly implemented. One option are attack trees that exploit knowledge about the possible plans of an intruder in a computer network. This paper proposes attack trees as a viable interface between two of the major security management systems: Vulnerability Management Systems (VM systems) combine infrastructure audits and cert advisories as aspects of asset management, in order to enhance the transparency of current risks. Security Incident and Event Management Systems (SIEM systems), consolidate the continuous growing number of monitored security events. They aggregate events based on manually encoded complex rule sets or (semi-) automated methods from Artificial Intelligence such as plan recognition and causal event correlation. To increase the informative value of singular security events regarding specific infrastructure assets, it is common to associate them with additional environment information, which can be provided by up-to-date VM systems. These include: protection requirements, identified vulnerabilities and misconfigurations, incident histories or risk scoring. Sophisticated SIEM systems like ArcSight1 and related 1

Concept The information involved in the generation process of attack trees ranges from basic host information and associated vulnerabilities to reachability information derived from the network topology and firewall policies. VM systems are already able to provide detailed information obtained by automatic vulnerability scans, semi-automated security audits and software management databases. This information can be accessed on a per host basis in combination with inventory databases or asset management systems. The number of systems involved shows the growing complexity of acquiring extensive and relevant information in order to assess the current condition of a given network. Unfortunately, not only the complexity of graph computation increases, but also the amount of incoming events a SIEM system has to process. Our proposed solution focuses on the

www.arcsight.com

47

reduction of graph complexity while maintaining their usefulness in the context of SIEM systems.

Attack graphs are commonly distinguished through their definition of nodes and edges (Lippmann and Ingols 2005). In this paper nodes represent hosts in a computer network. Edges are directed and represent network vulnerabilities, which can be exploited from the source host. Hosts may offer services voluntarily that can be vulnerable to exploit attacks or by misconfiguration, or involuntarily through backdoors implemented in malware. Attacking services of a host by exploiting an existing vulnerability can cause a change in the network state of the targeted host. The primary goal of an attacker regarding a specific host often depends on the hosts relative position in the attack path. A host as final target can be subject to compromise of virtually every security requirement. Usually, compromising the availability of an intermediary host in an attack path is not a valid goal. Non-available hosts in the attack path cannot function as the origin for subsequent attack actions and the corresponding attack path would be rendered invalid. The generation of attack graphs faces specific challenges. Lippmann et al. describe general problems and limitations regarding the computation of attack graphs: obtaining attack details, scaling to large networks, and the computing of reachability as a basis for attack graph generation (Lippmann and Ingols 2005). Given the application-specific scope of attack trees in this paper, not all of these limitations are critical. Basic attack details can be extracted out of observed security events. These events are generated for example by network and host intrusion detections systems (NIDS and HIDS). These systems apply already abstract attack information in form of signatures, anomaly indicators, thresholds or trained classifiers. Another source of attack details independent from observed attacks are for example CERT advisories and vulnerability databases. Both vulnerability databases and CERT advisories usually employ standardized and semi-formalized data structures2 . Standardization of scoring (e.g., the metric value in us-cert vulnerability notes3 ) is no simple task. Most of the values are only approximations, which can lead to difficulties in comparing them. In our solution, we simply employ the CVE-based (Common Vulnerability and Exposure standard (Martin 2004)) service offered by the National Vulnerability Database (NVD) project, and especially the Common Vulnerability Scoring System (CVSS), to obtain vulnerability information. CVSS encodes metrics, which describe the characteristics of vulnerabilities. NVD employs the Base Metric Group, which contains two sets of metrics as intrinsic parameters. The first set defines the metrics access vector, access complexity or authentication and can be interpreted as a precondition. The second set models the severity of impact (e.g., grade of compromisation after successful attack) for each of the security requirements and can be interpreted as a postcondition. Matching of pre- (access metric) and post-conditions (impact metric) specifies prerequisites in which a series of attacks can actually be conducted. It can be used as an op-

Attack Trees Attack trees are made prominent by the work of Bruce Schneier (Schneier 2001; 1999). Their primary use is to evaluate and estimate security issues in a computer network. Roughly speaking, an attack tree is an AND/OR tree annotated with different attributes. Attributes can be Boolean to denote if a certain condition holds, or numerical to judge quantities like time, cost or security impact. Originally, attack trees were designed to be assembled manually. The root node resembles a specific attack goal, leaf nodes represent different attacks and intermediary nodes constitute partial attack goals. In contrast, the attack trees we propose are ultimately based on vulnerability and network topology information. Intermediary nodes represent partial attacks in the context of a larger multi-stage attack. Both root node and leaf nodes can either represent the starting point or the final target of an attack. Attack paths are simple attack plans. They can be used to assist a process of modeling new attack plans, or to enrich a real-time risk analysis of correlated security events, concerning specific hosts and possible (or reachable) final targets in a multi-stage scenario (Yu and Frincke 2007). On- or off-line automatically computed attack trees are a means to efficiently provide attack paths. Depending on the nature of a security event a SIEM system processes, different views on these attack paths (representing a sequence of vulnerability exploits) are interesting, e.g.: • a tree representing possible attack paths from multiple starting points to one target, • possible paths from one starting point to multiple targets, or • all possible paths from one starting point to one final target, including a certain amount of detours in the attack plan. The automatic generation of attack trees based on vulnerability databases poses several requirements to the inference algorithm. First of all, vulnerability score values in common vulnerability databases are integers, often in some small range [0, . . . , C]. The higher the score the more vulnerable a system is. The lower the score the harder an attack is. Our goal is to find highly vulnerable paths, so that we minimize the total of the edges costs on the path, defined as C− score.

Attack Graphs Attack trees are derived from attack graphs, which identify critical bottlenecks in computer networks by showing ways an attacker can compromise hosts (Ingols, Lippmann, and Piwowarski 2006). The conventional use, thereby, is to prioritize the implementation of efficient counter-measures concentrated on identified bottlenecks in the attack graph. To a large extent this is a manual process, often resulting in security policies wrt. bottlenecks (gateways) or hosts with high security requirements.

2 www.cve.mitre.org/docs/docs2005/transformational_standards.pdf 3 www.kb.cert.org/vuls/html/fieldhelp

48

Visibility Graph:

tional filter mechanism to further reduce graph size. The matching itself is based on a user specified filter matrix. A partial confiditially impact for example can be defined as a sufficient precondition for a high access complexity. We propose specialized attack trees as input information for SIEM systems, which are tasked with aggregating and processing observable security events. This shifts the challange of obtaining further attack details away from the attack tree generation process. Depending on the complexity of a given network topology, generating and processing appropriate attack graphs can become a time-consuming operation. This scalability problem can make the ad-hoc identification of final attack goals for multi-stage attacks – following a possible attack plan – difficult. Under normal circumstances it is unrealistic to assume that all possible paths can be secured by manually implementing new security measures. Therefore, it is important to identify critical paths in the attack graph. Critical paths are not limited to bottlenecks, but can be related to critical vulnerabilities like host with high security requirements.

Attacker

Firewall

Workstation 2

Workstation 1

Backup Server

Vulnerability Graph: Attacker

2.3 Firewall 4.2

Communication Graphs

2.3

6.8

2.3 6.8

Attack graphs are derived from communication graphs, which represent reachability (or network visibility) between hosts. In modern networks reachability is often restricted by the enforcement of security policies, while policies themselves can be implemented for example by packet filters, firewalls or service proxies, located on hosts and network nodes (although host firewalls can be interpreted as transparent services, which again can have own vulnerabilities). All information needed for constructing communication graphs is usually available in form of network diagrams or configuration files. Digital aggregation and formalization of this environmental data is the task of modern asset management systems, which for example employ ontologies to represent complex interdependencies. Although the application of firewall rules is an additional step in generation of communication graphs, it is resulting in significant reduction of graph complexity which benefits further construction steps. Configuration and average diameter of communication graphs is affected by the complexity of corresponding networks. Usually networks in today’s organizations are structured hierarchically. The topology thereby roughly resembles a tree. Cycles though can be introduced by redundancies in uplinks, complex routing and address translation rules or virtual switches and routers. The diameter of communication graphs depends of the depth of the corresponding network hierarchy, which in practice rarely exceeds a count of three to four hops. The example used in the following section represents a simple network layout, but also demonstrates the typical proportion of depth in a network and corresponding communication graphs.

Workstation 2

Workstation 1 4.2

4.2 7.7

7.7

6.8

Backup Server

Figure 1: Graph Transformation. edges denoting network visibility. That is, for the inference process start with a graph G = (V, E), where V is the set of host nodes and where E ⊆ V ×V denotes network visibility.

Definition 1 (Communication Graph, Attack Path) A communication graph is defined as an undirected graph GC = (VC , EC ), where VC is a set of hosts and EC ⊆ VC × VC denotes the interconnection over which hosts can communicate. The attacker’s start location s and the intrusion target host t are part of GC , i.e., s, t ∈ VC . An attack path in GC is a sequence of nodes (s = v0 , v1 , . . . , vk = t) with vi ∈ VC , i ∈ {0, . . . , k}.

We may assume at least one existing attack path. For the sake of simplicity the edges are undirected in this generation step as we do not impose a direction for communication once established. A simple example of a communication graph in which a firewall separates the attacker from the internal network is provided in Fig. 1 (top). Additionally, we have a list of installed software, provided by software management and VM systems. This list includes the version of the software or installation packages (automatically derived or made available through asset management systems), because vulnerabilities are often exploited in specific code and binary versions only. The list is modeled as a mapping from V to some set of software S. Next, we have the vulnerability databases, that – in the first approximation – can be thought of a mapping from S to a Boolean value. In a more detailed view, each vulnerability

Implementation The automated construction on attack trees is cast as a graph search problem. This connects modeling the network with the algorithmic generation of the attack tree. In the simplest form the network is a set of hosts, connected via undirected

49

can be associated with a set of attributes, which denote, e.g., its kind, its costs and probability. Each attribute is an individual mapping, calling for multi-objective optimization. Attributes can, however, be superimposed into a single number that shows a trend of how vulnerable a piece of software installed on some host computer is. In our approach we use the set of attributes only to annotate the attack tree, once it is generated. For the construction phase, we merely look at the impact score CVSS, associated with the software installed on the system(s). The complexity is dominated by the lookup of potentially vulnerable software in the database which is supposed to be fast (constant time per edge). If there are no vulnerabilities, a host is supposed to be safe4 . For further processing, edges are assigned to weights according to the computed vulnerabilities. Moreover, edges are now directed, pointing to the host that has vulnerable software installed. For the case that both hosts are vulnerable, a backward edge is introduced. More precisely, the weight of an edge is the highest CVSS score from the set of vulnerabilities of a host (with metric access vector and assertion network or adjacent network). Such network vulnerabilities are most dangerous to a host and most likely to be exploited by an attacker. The higher the score the more vulnerable the host. As absent vulnerabilities correspond to infinite cost of exploiting them, without loss of generality, we may assume a connected graph of vulnerabilities, so that for all possible attacker locations s in V and all possible hosts t to be attacked we have a path from s to t.

Optionally, it is possible to reduce size of the graph by eliminating edges which do not pass a filter, while matching pre- and post- conditions of vulnerabilities. As the number of pre- and post-conditions in CVSS is bounded by 3 this filtering with 3 × 3 operations per edge does not affect the linear time complexity. The vulnerability graph for the example network is depicted in Fig. 1 (bottom). We have assumed that the CVSS score is 7.7 for the backup server, 6.8 for workstation 1, 4.2 for workstation 2, and 2.3 for the firewall. It is immediate that the vulnerability graph is of high importance for both the attacker and the security administrator. The size of the graph, large branching factors, and the presence of low score values makes them rather unattractive in practice. For concentrated work, it is better to compact more plausible attack paths into trees, i.e. weighted acyclic graphs with one root node, from which every other node is reachable on a unique path.

Attack Tree Inference Once the vulnerability graph is generated, the next step is to infer an attack tree. Attack trees as introduced by Bruce Shneier are a common tool in security administration, but usually manually encoded. Here, we aim at inferring attack trees from vulnerability graphs fully automatically. Definition 3 (Attack Tree) An attack tree TA = (VA , EA , w) is a weighted AND/OR tree of |VA | nodes and |EA | = |VA | − 1 edges, where nodes are either labeled OR or AND. The weight function is either a simple numerical score together with a vector of vulnerabilities.

Definition 2 (Vulnerability Graph, Score) A vulnerability graph GV = (VV , EV , w) is a weighted and directed subgraph of the communication graph GV , i.e., VV ⊆ VC and EV ⊆ EC . For the set of vulnerable hosts VV the set EV ⊆ VV × VV denotes the vulnerabilities that can be exploited in the network, with w : EV → {0, . . . , C} being the induced vulnerability score.

For network exploits we are mainly interested in, the attack tree is a mere OR-graph with {v ∈ VA | v is OR node} ⊆ VV .

One problem in applying graph algorithms like Dijkstra’s single-source shortest path procedure (Dijkstra 1959) to convert the attack graph into a tree is that we are naturally interested in attacks for which vulnerabilty scores are high, while Dijkstra’s algorithm is designed to find shortest paths minimizing the sum of the edge weights. Fortunately, we are interested in reweighted edge costs (such as w! (u, v) = 10 − w(u, v)) and participate from the fact that scores are bounded. This reweighted costs match the interpretation that a vulnerability score value of 10 imposes virtually no efforts (e.g., in time, money, computational power etc.) to the attacker, while a score of 0 imposes large efforts to be exploited5 . Once the graph is computed re-weighting is a linear-time operation. Since the resulting edge weights are not negative, Dijkstra’s algorithm remains correct. For general scores by using Fibonacci heaps the running time of Dijkstra’s algorithm is O(|VV | log |VV | + |EV |) (Cormen, Leiserson, and Rivest 1990). As weights are bounded by a constant linear time algorithms with at most O(|VV | + |EV |) operations apply. This linear time

Function w is induced, since the vulnerability score is defined for hosts, so that for all edges (u, v) we have w(u, v) = score(v). Based on the limited value range of entries in the CVSS vulnerability database (maximal score 10 using at most two decimals), by scaling the scores with factor 100 we can safely assume that w(u, v) is an integer bounded by C = 1, 000. This is important for the efficiency of the underlying graph search algorithms, since bucket-based data structures result in practically linear time shortest-paths search algorithms: 1-level buckets lead to√a running time of O(C · |VV | + |EV |), 2-level buckets to O( C · |VV | + |EV |) radix heaps to O(log √ C · |VV | + |EV |) and refined implementations to O( log C · |VV | + |EV |) steps in the worstcase (Ahuja et al. 1990). For the ease of reading, we stick to fractional representation of the vulnerability scores. As applying a filter to an edge can be considered as a O(1) operation, applying the filter to all edges is available in linear time. 4

5

in the scope of risk management there is always a certain amount of residual risk, e.g., a workstation is probably never to be considered safe.

This minimization we aim at is different to finding the longest path wrt. w, which is NP hard (Björklund, Husfeldt, and Khanna 2004).

50

Shortest Path Tree:

complexity is optimal as it means that finding the shortest path tree is as fast as reading the input graph. Next issue is that we usually do not know the attacker’s location in early states of intrusion detection (although this can improve after observing further security events, narrowing down further attack actions and thus starting points by SIEM systems), so that we would have to run Dijkstra’s algorithm in forward fashion for each possible one. Fortunately, we can take the inverse of the graph GV = (VV , EV ), where each edge is reflected, i.e., EV = {(u, v) | (v, u) ∈ EV }. In the inverse graph GV we only need one run through the tree, namely the one starting at the host we are analyzing and try to protect. The inversion of a graph is a linear-time operation. Instead of quadratic time complexity O(|VV | · (|VV | + |EV |)) we are back to linear time complexity O(|VV | + |EV |). We can efficiently compute (backward) shortest path trees for each critical host and (forward) shortest path trees for each possible attacker location, where an exploit has been reported in linear time and space. Another problem is that each host appears only one, so that no alternative attack path is contained in the shortest path tree. Even if we could spot the attacker, we wish to know several plausible paths to predict his course of actions. Our compromise between representing singleton paths to the target and representing an exponential (in acyclic networks) or even an infinite number (in cyclic network) of paths is based on two shortest path trees one rooted at the attacker and one rooted at the target. The algorithm for generating attack trees executes Dijkstra’s single source shortest paths searches on the original and the inverse vulnerability graphs. The shortest path tree rooted at the attack is transformed to a container of paths that can be used extend the shortest path tree rooted at the target host to definitely end at an attacker’s location. By Bellmann’s principle of optimality applied to Dijkstra’s algorithm we know that subpaths of optimal paths are optimal. Moreover, the triangle inequality of shortest paths that for all intermediate nodes i holds: we have δ(a, i) + δ(i, t) ≥ δ(a, t) and |δ(a, i) − δ(i, t)| ≤ δ(a, t), with δ denoting the shortest path distance. The basic idea to generate an attack tree is to merge the forward and the backward shortest path trees. The first one tree remains intact and the other one is converted to a set of (root-to-leaf-node) paths, which are then selected to be attached to the other shortest path tree. To avoid multiple apperences of the attacker location a in on one path, in the shortest path search starting at t we do not enqueue the successors of a, that is we avoid expanding the target. Next, we use the data structure of paths to pad the paths starting at the attacker to the shortest path tree leading to the target host. As one shortest path tree is expanded to a set of path the merging step is mainly determined by the total path length in the shortest path tree, which can be assumed to be linear, since the vulnerability graphs are known to have a small constant diameter d = maxu,v {δ(u, v)}. This induces that the running time for padding, which expands one shortest

Attacker

2.3 Firewall 4.2

Workstation 1

Workstation 2 7.7

7.7 Backup Server

Attack Tree: Backup Server 7.7

Workstation 1

7.7

Workstation 2

4.2

Firewall

6.8

Firewall 2.3

2.3

Attacker

Attacker

Figure 2: Conversion of Trees. path tree to a set of paths from the root node to the leaves, can still be assumed to be a linear-time operation. This transformation is visualized in Fig. 2. We see a small shortest path tree and its extension to an attack tree that has been turned upside-down. We summarize the theoretical observations on the running time of the construction to the following result. Theorem 1 (Complexity Attack Tree Generation) Assuming the diameter d, the maximal score value C and the potential vulnerable software installed on a host to be bounded by small constants, for a given pair of attacker and target location the generation of a attack tree TA can be obtained in time O(|GC |), i.e., linear in the size |GC | = |VC | + |EC | of the communication graph GC . Proof. After computing the score value based on the software installed on a host, the transformation of GC into GV is available in time O(|GC |). As C is bounded by a constant, shortest path search both in backward and in forward direction including computing the inverse graph is available in O(C·|VV |+|EV |) = O(|GV |) = O(|GC |). As d is bounded by a constant, the conversion of one shortest path tree into a set of paths as well as the padding to the other shortest path tree is available in time O(d · |VV |) = O(|GV |) = O(|GC |). !

51

Firewall Graph:

obtained in time linear in the size |GF | = |VF | + |EF | of the firewall graph GF .

Attacker

Local Exploits So far we only looked at network exploits applicable to hosts. For additional local exploits on a system the intruder has entered, vulnerabilities attack trees would usually result in AND-nodes for the attack tree. Nonetheless thay can usually be compiled to a sequential execution, enforcing first the network and then the network exploit to be executed. As indicated earlier, our implemented solution, however, tries to match post-conditions of local exploits to preconditions of subsequent exploits to generate a filter that can be applied to eliminate edges from the vulnerability graph, before the inference mechanism applies.

22 TCP

88 TCP

Firewall 22 TCP 88 TCP

22 TCP

88 TCP

Webserver

Workstation 1

Communication Graph: Attacker

Experiments Our benchmarking is conducted on a (semi-)randomly generated vulnerability graph, thus skipping the process step of constructing a communication graph. User specified absolute upper limits like number of vertices, edges or number of vulnerabilities define the configuration of generated test graphs. The resulting computer vulnerability network graph is formalized in dot notation, and thus has a natural visualization interface. The input information for software installations (and thus corresponding vulnerabilities per host) is also provided as a plain ASCII file. For this prototype the entire implementation is written in script language Ruby. The NVD is stored in a XML database, which is queried by XPath expressions in order to retrieve vulnerability data. The computer we used for the experiments has an Intel Core2Duo 1.6 GHz and is equipped with 3 GB RAM. The communication vulnerability graph as well as the software installed is randomly sampled: unless some stopping criteria like absolute number of vertices or number of vertices per host is met, two nodes are sampled and connected drawing an edge.6 Test results confirm an almost linear progression of computing time in the step by step generation process. The slight super-linearity reflects that we have used binary heaps instead of bucket implementation of the shortest path search resulting in an implementation that performs O((|V | + |E|) log |V |) instead of O(|V | + |E|) steps. By efficiently employing low cost operations like SSSP (single source shortest path) search ad hoc generation of attack trees stays a viable option. Both shortest path and backward shortest path tree correlate in computing time. Fig. shows computation times in relation to increasing number of vertices of the given vulnerability graph. The total runtime of the algorithm can be seen in Fig. 5. Time cost for padding is also emphasized in Fig. and shows the applicability as an

Webserver

Figure 3: Integration of Firewall Information.

Integration of Firewall Information Computer networks are usually protected by firewalls using a set of rules on the communication protocol used and the ports opened. In the following we extend attack tree generation for this setting. The firewall graph is generated on top of the communication graph formed by the set of ports policed by the firewall and the protocols it allows to be forwarded or routed. Definition 4 (Firewall Graph) The firewall graph GF = (VF , EF ) is a graph of nodes VF and edges EF , where edges are assigned to pairs (o, p) of ports o ∈ [0..216 − 1] and protocols p ∈ {TCP,UDP,. . . }.

Based on the set of firewall rules (e.g., obtained from iptable) a firewall graph GF can be transformed into a communication graph GV , for which we already know how to derive attack trees. The translation is exemplified in Fig. 3 and can be seen as a precomputation step to the derivation of attack the tree(s). As firewall information acts like a filter, the time complexities inference algorithms will be affected by the time for matching the communication infrastructure against the firewall rules, which is assumed to be an efficient operation.

6 We are aware that there are many other and more realistic options to sample the graphs, e.g., a spanning tree may constucted connecting every unprocessed node with an undirected edge to the exsting tree, and a few edges to produce cycles. As we expect to be adapted to an asset management system, we will not dwell on different network topologies.

Corollary 1 (Attack Trees from Firewall Graphs) Assuming the matching of the firewall rules to an edge to be bounded by a constant, for a given pair of attacker and target location the generation of a attack tree TA can be

52

All nodes to final target Padding final target Attacker to tree nodes Padding attacker

1000

Graph Size 200 400 1,000 1,200 1,400 1,600 1,800 2,000 2,200 2,400 2,600 2,800 3,000 3,200 3,400 3,600 3,800 4,000 4,200 4,400 4,600 4,800 5,000 5,200 5,400 5,600 5,800 6,000 6,200 6,400 6,600 6,800 7,000 7,200 7,400 7,600 7,800 8,000 8,200 8,400 8,600 8,800 9,000 9,200 9,400 9,600 9,800 10,000

Time in s

800

600

400

200

0 0

2000

4000 6000 Number of vertices in vulnerability graph

8000

10000

Figure 4: Performance of Attack Tree Inference. Individual Timings. 2000

Time in s

1500

1000

500

0 0

2000

4000 6000 Number of vertices in vulnerability graph

8000

10000

Figure 5: Performance of Attack Tree Inference,. Total Timings. ad-hoc method, satisfying the requirements of SIEM systems for assisting the modeling of new attack plans. Table 1 details the measured time costs for vulnerability graphs with different sizes. The number of vertices in generated attack trees (including padding with inverse paths) also increases linear in relation to the number of vertices of the input vulnerability graph (Fig. 6). This keeps the number of paths small enough to remain computable, while still including a certain amount path variation for a specific attack goal, taking into account different approaches an attacker can use to reach a specific final target in the network.

Related Work One of the closest match we found is (Paul, Wijesekera, and Kaushik 2002) that tries to model the problem of an attacker to reach a target as a graph problem. The work refers to precursors (Sheyner et al. 2002) that use model checking with SMV for state space exploration, which are referred to have difficulties to scale. Besides network exploits N that define

Time SSSP Target 0.15 0.92 6.46 8.71 10.56 11.35 21.12 16.22 29.64 27.23 37.25 52.03 54.00 71.63 55.29 79.71 109.59 100.06 110.18 132.40 118.48 175.66 160.47 171.02 179.36 241.14 269.54 251.79 245.96 259.43 325.66 184.61 247.30 370.06 312.11 411.94 351.21 480.72 355.84 478.71 470.53 362.68 710.09 424.82 535.38 666.24 759.68 727.18

Time SSSP Attacker 0.26 0.84 8.11 10.16 11.52 11.72 26.92 19.32 36.94 30.52 41.11 64.05 63.03 88.59 61.65 86.45 133.73 116.67 132.28 164.45 130.96 222.50 178.00 203.74 201.29 300.60 337.42 306.79 285.12 271.12 406.68 180.76 266.29 471.40 354.91 479.50 380.23 589.63 380.11 546.91 532.20 374.19 1055.45 467.58 587.01 797.51 906.43 911.52

Total Time 0.55 2.14 15.12 20.17 23.22 24.48 49.57 37.36 68.52 60.15 81.04 118.52 119.75 163.34 120.13 170.10 247.06 220.66 246.67 301.17 254.61 403.50 343.92 379.64 387.07 547.93 614.22 566.37 539.18 539.27 740.24 376.81 523.68 849.31 675.29 900.31 740.80 1079.25 745.64 1034.80 1013.16 750.50 1776.10 904.26 1,135.53 1,475.18 1,676.44 1,652.68

Table 1: Performance of Attack Tree Generation on Random Graphs (Times in seconds).

53

about constructing a deterministic model rather than to draw inferences on a non-deterministic one. Efficient generation of attack graphs is demonstrated in (Ingols, Lippmann, and Piwowarski 2006). The paper has a strong focus on scalability concerns, addressing the reduction of computing time by using simplified multipleprerequisite graphs. While the runtime scales linear and satisfies increasing demands in ad-hoc availability of attack graphs, the output graphs are intended to be processed manually. The goal here is to assist defenders in evaluating the security impact of infrastructure and policy changes. Bhattacharyya and Gosh (Bhattacharya and Ghosh 2008) proposed to employ AI planning algorithms by casting the construction of attack graph as a planning problem. The computation focuses on pruning redundant attack paths while generating the graph itself. The required domain information however – including vulnerabilities, network topology and security policies – has to be assembled manually and then converted into PDDL (Planning Domain Definition Language) (McDermott 2000).

120000

Number of vertices in tree

100000

80000

60000

40000

20000

0 0

2000

4000 6000 Number of vertices in vulnerability graph

8000

10000

Figure 6: Attack Tree Growth wrt. Input Graph Size.

the edges in the graph, the authors take into account access privileges encoded in form of attributes A and derive approximate pre- and postcondition matching algorithms for generating partial attacks that are polynomial in |N | and |A|. In a preprocessing step the graph based on A and E is constructed and annotated and in a second step wrt. a set of attributes satisfied algorithms for finding one, all or a minimum partial attack are given and analyzed. In general the minimal cardinality attack problem is NP hard (Sheyner et al. 2002). In contrast, our attack tree set generator provide an infrastructure to derive plausible path of an attacker that look most promising for him and most dangerous from the target. Ritchey et al. introduced a network security model called Topological Vulnerability Analysis which is similar to attack graphs and includes layer 2 information in the representation of network topologies (Ritchey and Noel 2002). This results in a more realistic representation of given networks, but also in considerable increase of graph complexity. An exploit model is used a basis for mapping pre and post conditions in a connectivity matrix. While the exploit model approach is similar to our proposed filter matrix based on CVSS metrics, it is difficult to match it with the corresponding security requirements availability, integrity and confidentiality utilized in risk management systems. In (Yu and Frincke 2007) the authors propose a statistical variant of Colored Petri Nets (CPNs). Such Hidden Colored Petri Nets share similarities with Hidden Markov Models (Rabiner 1989) and include transition to CPNs that are not directly observable. Hidden CPNs are used to discover and predict action of an intruder from observed alerts. For alert confidence fusion Dempster-Shafer Theory (Schafer 1976) is used. The advantage of Colored Petri Nets is detecting concurrent processes including attacks like a distributed denial of service attack. The disadvantage is that the model itself has to be known as an input. We share treating the alert correlation problem as an inference problem and the general aim to use a data structure underneath to predict the next action of the intruder. Nonetheless we are mainly concerned

Conclusion We have seen a promising approach to generate attack trees in form of unfolded attack graphs fully automatically, given the topology of the computer network and the software installed together with the information provided in vulnerability databases. Time-space trade-offs have been analyzed and practical experiments on random graphs show that the approach is fast. We expect that the derived efficiencies together with a container for pre-computed shortest path trees to be queried with attacker-target pairs will lead to a new generation of high-responsible SIEM systems. In the future we will extend our approach to the inputs generated by network scanners like Nessus and more complex asset information maintenance systems. Given that the prototype is written and interpreted in Ruby, we expect much better performances using compiled imperative programming languages together with efficient graph search libraries. Acknowledgements This work was supported by the German Federal Ministry of Education and Research (BMBF) under the grant 01IS08022A. We thank the anonymous reviewer for their critical comments, which greatly helped to increase the quality of the paper.

References Ahuja, R. K.; Mehlhorn, K.; Orlin, J. B.; and Tarjan, R. E. 1990. Faster algorithms for the shortest path problem. Journal of the ACM 37(2):213–223. Anderson, R. 2001. Security Engineering. Indianapolis, Ind.: Wiley, 1 edition. Bhattacharya, S., and Ghosh, S. K. 2008. An attack graph based risk management approach of an enterprise lan. Journal of Information Assurance and Security 3.

54

a scalable model for network attack identification and path prediction. JNW 3(4):64–71. Ning, P., and Xu, D. 2003. Learning attack strategies from intrusion alerts. In ACM Conference on Computer and Communications Security, 200–209. Odubiyi, J. B., and O’Brien, C. W. 2006. Information security attack tree modeling: An effective approach for enhancing student learning. In Seventh Workshop on Education in Computer Security. Ou, X.; Govindavajhal, S.; and Appel, A. W. 2005. MulVAL: a logic-based network security analyzer. In 14th USENIX Security Symposium. Paul, A.; Wijesekera, D.; and Kaushik, S. 2002. Scalable, graph-based network vulnerability analysis. In Proceedings of ACM Conference on Computer and Communications Security (CCS), 217–224. ACM. Qin, X., and Lee, W. 2004. Attack plan recognition and prediction using causal networks. In Annual Computer Security Applications Conference (ACSAC), 370–379. Rabiner, L. R. 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proceedings of the IEEE 77(2):257–286. Ritchey, R., and Noel, S. 2002. Representing tcp/ip connectivity for topological analysis of network security. In in Proceedings of 18 th Annual Computer Security Applications Conference, Las Vegas, 25. Rogers, R.; Carey, M.; Criscuolo, P.; and Petruzzi, M. 2008. Nessus network auditing; 2nd ed. Burlington, MA: Syngress. Schafer, G. 1976. A mathematical Theory of Evidence. Princeton University Press. Schneier, B. 1999. Attack Trees - Modeling security threats. Dr. Dobb’s Journal. Schneier, B. 2001. Secrets. dpunkt. Sheyner, O.; Haines, J. W.; Jha, S.; Lippmann, R.; and Wing, J. M. 2002. Automated generation and analysis of attack graphs. In IEEE Symposium on Security and Privacy, 273– 284. Tidwell, T.; Larson, R.; Fitch, K.; and Hale, J. 2001. Modeling internet attacks. In Proceedings of the 2001 IEEE Workshop on Information Assurance and Security United States Military Academy, West Point, NY, 5-6 June, 2001. Yu, D., and Frincke, D. A. 2007. Improving the quality of alerts and predicting intruder’s next goal with hidden colored Petri-Net. Computer Networks 51(3):632–654.

Björklund, A.; Husfeldt, T.; and Khanna, S. 2004. Approximating longest directed paths and cycles. In Proceedings of the 31st International Colloquium on Automata, Languages and Programming (ICALP). Cheung, S.; Lindqvist, U.; and Fong, M. W. 2003. Modeling multistep cyber attacks for scenario recognition. In DARPA Information Survivability Conference and Exposition (DISCEX,1), 284–292. Cormen, T. H.; Leiserson, C. E.; and Rivest, R. L. 1990. Introduction to Algorithms. MIT Press. Cuppens, F., and Miège, A. 2002. Alert correlation in a cooperative intrusion detection framework. In Proceedings of the IEEE Symposium on Security and Privacy, 202–215. IEEE Computer Society. Dewri, R.; Poolsappasit, N.; Ray, I.; and Whitley, D. 2007. Optimal security hardening using multi-objective optimization on attack tree models of networks. In ACM Conference on Computer and Communications Security. ACM. Dijkstra, E. W. 1959. A note on two problems in connexion with graphs. Numerische Mathematik 1:269–271. Flanagan, D., and Matsumoto, Y. 2008. The Ruby Programming Language. O’Reilly. Fu, C., and Fu, L. 2008. Comprehensive assessment model of network vulnerability based upon refined mealy automata. In International Conference on Computer Science and Software Engineering (CSSE), 595–600. Graff, M. G., and Wyk, K. R. V. 2003. Secure Coding: Principles and Practices. O’Reilly. Ingols, K.; Chu, M.; Lippmann, R.; Webster, S. E.; and Boyer, S. 2009. Modeling modern network attacks and countermeasures using attack graphs. In Annual Computer Security Applications Conference (ACSAC), 117–126. Ingols, K.; Lippmann, R.; and Piwowarski, K. 2006. Practical attack graph generation for network defense. In Annual Computer Security Applications Conference (ACSAC), 121– 130. Lippmann, R., and Ingols, K. 2005. An annotated review of past papers on attack graphs. Technical report, MIT Lincoln Laboratory, USA. Maggi, P.; Pozza, D.; and Sisto, R. 2008. Vulnerability modelling for the analysis of network attacks. In International Conference on Dependability of Computer Systems (DepCoS-RELCOMEX), 15–22. Martin, R. A. 2004. Managing vulnerabilities in your commercial-off-the-shelf (cots) systems using an industry standards effort (cve). In COTS-Based Software Systems, Third International Conference (ICCBSS), 206–208. McDermott, D. 2000. The 1998 AI Planning Competition. AI Magazine 21(2). Mell, P.; Scarfone, K.; Romanosky, S.; Members, I. G.; Brook, I. B.; Hanford, S.; Raviv, S.; Reid, G.; and Theall, G. 2007. A complete guide to the common vulnerability scoring system version 2.0. Nanda, S., and Deo, N. 2008. The derivation and use of

55

Efficient Text Discrimination Gary Coen Boeing Research & Technology PO Box 3707 MC 7L-43 Seattle WA 98124-2207 [email protected]

efforts to detect malware packaged within messages as transfer encodings. Research publications on this topic are difficult or impossible to find. In the late 1980s (coincident with the commercial appearance of optical character recognition software products), a focus on text/non-text classification problems emerged in the adjacent field of digital image processing. Instead of character sequences, however, this problem space featured black, white, halftone, and multicolor pixels, and the research used general classification methods to extract low-level textural features from digitized images in order to determine a semantic class for each image constituent [2, 3, 4]. This paradigm discriminated between text and non-text image regions using methods from signal processing (US patents 4547811, 4577235, 4707745, etc.), statistical pattern recognition (US patents 5296939, 5768403, 7313275, etc.), and machine learning [5, 6]. Text discrimination to identify linguistic messages in a stream of alphabetic symbols lay outside the scope of this research paradigm. Text discrimination also contrasts with the text classification discipline that arose in the 1990s. Text classification associates topics with texts, where potential topics are provided by a gold standard inventory or discovered as a byproduct of feature extraction. Initially, the underlying research discipline depended on statistical pattern recognition or machine learning. As novel information resources became available on the maturing internet, stochastic methods were combined with other techniques in order to associate texts with topics. One research program manually removed uuencodings from input in order to prevent word statistics from being miscued by unnaturally repetitive tokens [7], thus insinuating how text discrimination might complement text classification by automating a data preparation task. Despite the apparent need, however, the research literature offers no evidence of automated text discrimination of the sort described above. The research reported here defines a method for text discrimination that focuses on the relation between sound segments and alphabetic symbols. The following sections of this paper sketch linguistic aspects of the problem to solve; linguistic universals governing natural language syllable structure as elements of a solution; an implementa-

Abstract This paper presents an efficient text discrimination method for writing systems that use phonemic alphabets in electronic communication. Text discrimination by this method can provide general support for cybersecurity efforts to detect malware packaged within messages as transfer encodings. The method exploits the relation between sound segments in the phonemic inventory and alphabetic symbols in the writing system to discriminate between texts with exclusively linguistic content and those with (at least some) nonlinguistic content. The basic insight is simple: just as linguistic constraints govern how sound segments combine to formulate syllables and words, elements of the alphabet combine under related linguistic constraints to formulate written language. The text discrimination method exploits a universal system of phonemic sequence constraints within syllables to recognize electronic texts which encode only natural language contents, thereby discriminating between them and others with non-linguistic content, such as binary data encoded in alphabetic symbols.

Problem Automatic discrimination of ASCII encoded content as text or non-text has utility for computer applications like network security and USENET archive processing. For the purpose of discussion, let a delimited segment of an ASCII data stream be known as a message. In this context, a text/non-text discrimination capability recognizes natural language messages passing through the data stream and discriminates between messages which contain only natural language text and those which contain non-text contents, such as ASCII encoding of binary data in UNIX-toUNIX, MIME base64, or similar transfer encodings (cf. [1]:§7). In this way, text discrimination distinguishes between messages with only natural language content, including misspellings and similar anomalies, and messages with non-linguistic content (perhaps embedded within natural language text). Efficient text discrimination by this method may provide general support for cybersecu-rity Copyright © 2010, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.

56

over networks, non-ASCII character encodings are often converted into ASCII for client processing requirements. Often data loss occurs when source texts encode characters beyond the range of client processes. Under these circumstances, conversion logic customarily provides a similar, substitute character in the client character range or a default substitute, often the ASCII question mark. Hence, ASCII encoding contexts are assumed for the text discriminator, although there exists no known obstacle for application of the method in non-ASCII contexts. A more substantive restriction on the method is its confinement to segmental phonographic alphabets—this text discrimination method does not appear extensible in an obvious way to logographic writing systems.

tion of these universals in a word recognizer that differentiates between linguistic and non-linguistic sequences of alphabetic characters; and a control mechanism for encapsulating the recognizer within a text discrimination capability. Finally, test results for a concept demonstrator implementation are presented and discussed.

Approach Like manuscripts, electronic texts are linguistic artifacts encoded in human writing systems. Linguistic elements represented in writing systems range along a scale bounded by logograms (graphic representations of lexical material with no indication of phonology) and phonograms (graphic representations of phonology with no indication of lexical content). Traditional taxonomies position the segmental phonographic alphabet as the evolutionary end-point of a teleology initiated with pictographic writing [8]. Accordingly, phonographic writing systems encode phonological information, but logographic systems do not [9], although variation occurs—Egyptian, Sumerian, Greek Linear B, and Mayan writing systems appear to possess both logographic and phonographic elements [10]. Linguists often assume that human languages sound the way they do as a consequence of two processes—one provides language-dependent phonetic rules that relate phonological structure to physical parameters of speech, and the other provides language-independent constraints that presumably arise as a consequence of human speech production and perception. Every natural language has an inventory of sound segments and phonetic rules which characterize it and, in some fashion, differentiate it from other languages. This language-dependent process coexists with language-independent linguistic universals which constrain the patterns of sound segment combinations. To illustrate the distinction, the phonetic vowel qualities of a particular language are native to its phonemic inventory and rules for vowel realization, but the fact that its vowels have longer duration before voiced (as opposed to voiceless) consonants is arguably a universal phonetic process. Phonology relates sound segments with linguistic units through a description of distinctive features; likewise, alphabetic writing systems relate sound segments with linguistic units through alphabetic symbols. By correlating sound segments with characters, the text discrimination method presented here exploits linguistic constraints on the sequencing of sound segments within syllables to produce a recognizer for well-formed syllables, the composition of syllables into words, and, ultimately, the natural language texts embodied by sequences of words. Logically, since the recognizer distinguishes between messages with natural language content and those with other content, it also serves as a text discriminator. In this sense, this text discrimination method embodies something fundamental about the structure of human language. As a practical matter, ASCII provides an appropriate representational platform for text discrimination. It is, for instance, ubiquitous in computing. In the exchange of texts

Technical Description: the Syllable The syllable is a linguistic universal. The number of syllables in a word coincides with the number of rhythmic units, and this usually equals the number of vowels. As an abstraction, the syllable embodies general constraints on the distribution of sounds in natural languages. The broadest rule of this kind for any given language describes its canonical syllable pattern, the serial ordering of consonants and vowels in stress domains. Canonical syllable patterns are most often represented as a string of C and V symbols (respectively, consonant and vowel), where V may include complex vowel elements like diphthongs. The one kind of syllable which occurs in every language is CV (a consonant followed by a vowel). In some languages, this is the only pattern permitted. Usually, languages permit an optional initial consonant annotated as (C)V, where parentheses indicate optionality.

Figure 1: Simple Syllable

Figure 2: Complex Syllable

Syllables have structure. Every syllable in every language has a rhyme composed of a nucleus with a sonorant quality. The syllable onset is a structural sister to the rhyme. Together, the onset and rhyme constitute the principle syllable constituents. (See Figures 1-2, where ! represents the syllable.) When the syllable contains additional sound segments, they are attached as a coda, a daughter constituent of the rhyme. The syllables of all languages have onsets (or opening segments prior to the principal vocalic nucleus), but not all languages have codas (closing segments subsequent to the nucleus). More elaborate syllable structures add one or more consonants at the beginning of the syllable or in the final position. Syllable patterns like CVC and CCV, for instance, are

57

modest expansions of the simple CV syllable type. Languages may allow multiple consonants in the onset position, but they typically constrain the permitted combinations. For example, the second of two consonants is commonly limited to being one of a small set belonging to either the liquids (sounds commonly represented by the letters r and l) or the glides (vowel-like consonants like those beginning English wet and yet). Languages which permit freer combinations of two consonants in the position before a vowel, or which allow three or more consonants in the onset position and two or more consonants in the position after the vowel, are classified as having complex syllable structure (v. Figure 2). English, for example, exhibits complex syllable structure inasmuch as its canonical syllable pattern is (C)(C)(C)V(C)(C)(C)(C), where the initial optional consonant, when present, must be /s/ (cf. Table 3). Full instantiation of this pattern only occurs in a few words such as strengths, but it is relatively easy to find syllables beginning or ending with three consonants, as in split, texts, spasms, and struck. A universal principle constraining syllable structure depends on the phonetic notion of sonority. Sonorants (e.g., vowels, liquids, glides, and nasals) contrast phonetically with non-sonorants (the obstruents: plosives, fricatives, and affricates). Further, sonority is a matter of degree: for instance, the vowel /a/ is more sonorant than the consonant /m/ as well as the vowels /i/ and /u/. This universal linguistic principle provides for a sonority scale which arranges classes of sound segments from least to most sonorant (cf. [11:88-91]): obstruents < sonorant consonants < vowels. The notion of relative sonority is critical to an accurate description of syllable structure—the peak or nucleus of a syllable is always the most sonorant element, while onset sound segments tend to increase in sonority towards the nucleus, and coda segments tend to decrease in sonority as they occupy positions further away from the nucleus [12:116]. Thus relative sonority imposes class-based sequential constraints on sound segments within syllables, and this is arguably the most important linguistic principle governing syllable structure across the natural languages. It also marks the key insight in the text discrimination method presented here.

the general process associated with relative sonority impacts the syllable structure and sound sequence of all natural language utterances. To facilitate discrimination between sonority sequences compatible with canonical syllable structures native to a group of related languages, it is necessary to Figure 3: refactor the sonority scale to Relative Sonority Grid differentiate between sound segments more finely. Consider the relative sonority classification of ASCII characters and character sequences presented in Table 1. The granularity of representation presented here is sufficient to combine with knowledge of syllable structure to crudely distinguish between character sequences that instantiate syllables (for many western European languages) and those that do not. This constraintbased approach to phonotactics has explanatory force. For example, one might observe that */pv-/, */bz-/, */gx-/, */zm-/, */fn-/, and */nl-/ are rare or unattested syllable onsets in natural languages. One might explain this observation with a phonotactic constraint on syllable onsets such that they may not be composed of adjacent segments from classes 1-5 of Table 1 (excepting anomalous /s-/). Alphabetic character sequences such as ASCII binary data encodings are remarkably indifferent to phonotactic constraints on stress domains, and the text discrimination method presented here exploits this crucial distinction. Sound Segment Class Vowel Glide Rhotic approximant Lateral approximant Nasal A Nasal B Voiced fricative Voiced stop Voiceless fricative Voiceless affricate

Data Representation The classification of sequential constraints associated with the relative sonority scale ex-poses a two-dimensional syllable structure organized around vocalic peaks. This two-dimensional perspective reveals a consistent rise and fall of sonority as phonetically definable sound segments demonstrate membership in natural classes. The sonority grid in Figure 3 (where O=obstruent, N=nasal, L=lateral approximant, F=fricative, G=glide, A=affricate, and V=vowel) illustrates the effect on the phrase crispy fried chicken. The sonority constrained structure of stress domains is a linguistic universal—just as sonority rises and falls according to this classification across the syllables of monosyllabic and polysyllabic word forms in this example,

Voiceless stop

Character Sequence Members a, e, i, o, u y, w wr, rh r, l m, n ng, gm, gn v, z, j b, d, g f, h, th, s, sh ch, gh, ph, sch, tch, tz, wh, x c, ck, ct, p, q, t, k

Sonority Class 11 10 9 8 7 6 5 4 3 2 1

Table 1: Relative Sonority Classes for ASCII Character Sequences

This crude filter fails on rare data like Yiddish kvetshn (meaning to squeeze or complain). Importantly, such exceptions are relatively infrequent and often a matter of historical accident. (In this case, kvetshn is derived from Middle High German quetzen.) Should a partial order like that presented in Table 1 prove inadequate to formulate the basis of a text discriminator for a particular mix of natural languages, it is possible to achieve targeted results by refactoring the relative sonority classification to an appro-

58

priate granularity up to the limit condition of a total order (cf. [13:222]). An obvious inconsistency of the relative sonority scale is its treatment of /s/, which combines with most onsets in languages like English to create a consonant cluster even when the following segment is a plosive. Since plosives are less sonorous than fricative /s/ by this logic, English words like spray, stink, and scurrilous should be impossible, which is contrary to fact. Furthermore, relative sonority does not explain why /sp-/, /st-/, and /sk-/ onsets occur but */ps-/, */ts-/, and */ks-/ do not. To overcome these discrepancies, this text discrimination method treats /s/ as a special case, assigning it a normal fricative sonority level except when it occurs at syllable margins, where it is assigned 0 to preserve the observed relative sonority distribution within the syllable. This treatment acknowledges the exceptional status of stress domains marked by adjacent /s/ and voiceless stops at syllable margins. Admittedly, the relationship between the sound system of a language and its writing system is not always straightforward, nor is it exhaustively rule-governed. For instance, English caught and plea, French eau and plait, and Spanish que each contain more than one letter associated with a vowel, even though each word has only one syllable. Conversely, two or more letters may often denote just one sound, as in the English letter sequences th, gh, sh, ch, tch, dg and the first two letters of Spanish guacamole. Sometimes letters are written which represent no sound whatsoever, like p in English receipt, g in French doigt, h in Italian hanno, or e in German ein. For the most part, observations such as these reveal vestiges of historical change which, when examined closely, are managed correctly by the underlying logic of carefully defined syllable structure and the relative sonority of sound segments. The operation of the text recognizer described in the next section is embedded within control logic with an effective empirical basis to ensure an acceptable outcome when the relative sonority system alone fails to produce desired results.

plexity. Figure 4, for instance, illustrates a simple FSA encoding the first sonority class from Table 1. This text discrimination method recognizes well-formed syllables and their combination into words, but it incorporates no lexical resource. For example, a text discriminator may be encoded to permit any of the stops from Table 1 classes 1 and 4 to form onset clusters with a class 8 lateral approximant as the second consonant (e.g., English crawl, bland, etc.) and to disallow clusters with lateral approximants as the initial consonant. At this level, the text discriminator performs syllable recognition and word recognition, yet the word recognizer is abstract in the sense that it allows any potential word realizable within the space of phonological possibility designated by a particular syllable. This property enables the text discriminator to recognize a superset of the lexical stock of target languages, including neologisms that arise over time, but it is useless otherwise as an acceptor or lemmatizer for those languages. As a classifier, the abstract word recognizer should demonstrate poor recall; its precision, on the other hand, may be markedly higher since it recognizes only alphabetic sequences compatible with target language syllables. Serially-ordered phonotactics mandate leftmost derivation of syllable structure. The syllable recognizer of the text discriminator depends on finite-state technology to detect the initial edge of a syllable and to determine which character classes may follow in sequence. As subsequent input is processed, the FSA rejects as non-words any whitespace delimited sequences incompatible with designated syllable structure constraints it encodes. To ensure the necessary capabilities, the word recognizer is constructed according to the following procedure: 1. Create a regular expression for each constituent of a syllable structure pattern 2. Compile each regular expression thus obtained into a finite-state device 3. Concatenate the finite-state devices thus obtained in the syllable constituent sequence specified in the syllable structure pattern to obtain a syllable recognizer 4. Apply the Kleene-plus operator to the syllable recognizer to obtain a word recognizer The first step above is accomplished with the regular expressions Table 2 (on the next page) assigns to Table 1 sonority classes. With these regular expressions as components, additional expressions are defined to capture the patterned behavior observed in syllable onsets of target languages, and the onset for the syllable recognizer is then constructed from the generalized union of FSAs compiled from these regular expressions. The rhyme and coda are constructed analogously. Regular expressions for the word, syllable, and various syllable constituents are presented in Table 3 in the Appendix. (Note that names for sonority classes and defined constituents appear in the second column of that table as space-saving substitutes for the regular expressions they denote.) For the reader’s convenience, Table 3 correlates

Regular Languages and Syllable Recognition This method presented here uses regular languages to encode sonority scale structural constraints on syllable patterns. With regular language technology, constraints on particular syllable constituents can be articulated as regular expressions over an appropriate alphabet. Kleene’s theorem [14] demonstrates that these regular expressions can be embodied in finite-state automata capable of recognizing the elements of such regular languages. Furthermore, each finite-state automaton (FSA) thus obtained can be combined with others via concatenation, union, or other operations to assemble recognizers for onsets, codas, Figure 4: rhymes, syllables, and FSA for Sonority Class 1 words of arbitrary com-

59

each regular expression with example English words instantiating the indicated patterned sequence. Sound Segment Class Vowel Glide Rhotic approximant Lateral approximant Nasal A Nasal B Voiced fricative Voiced stop Voiceless fricative Voiceless affricate Voiceless stop

Regular Expression a|e|i|o|u y|w wr|rh r|l m|n n g | g [m | n] v|z|j b|d|g f | (s | (p) t) h | s [[c | g | p | s c | t c | w] h | t [l |z] | x] [c (k | t) | p | q | t (h) | k]

In this procedure, the inherently efficient operation of the text discrimination automaton is governed by a process responsible for counting character sequences and performing one multiplication, one division, and two subtraction operations per message, a modest overhead. As a practical matter, this control logic balances precision and recall in text discrimination against the inevitable substrate of noise (e.g., acronyms, abbreviations, initialisms, alphanumerics, and misspellings) that occurs naturally in texts.

Sonority Class C11 C10 C9 C8 C7 C6 C5 C4 C3 C2

Testing and Discussion Testing of the text discriminator was conducted with a text recognizer FSA assembled from the values presented in Tables 2 and 3. Based on multiple trials, the text discriminator control logic threshold parameter v was set to 1.8. Test data included 210 messages of varying lengths (comprising approximately 3,000 total words selected randomly from a list of approximately 260,000 English words). The test set decomposed as follows:

C1

!

Table 2: Regular Expressions for Relative Sonority Classes

To illustrate the information structure assembled at this point, the output FSA for the generalized union of syllable onsets O1-O10 from Table 3 is displayed in Figure 5 in the Appendix. This moderately complex FSA consists of more than a dozen states, 80 arcs, and 400 paths. When combined with automata embodying other syllable constituents, more complex networks result. For example, the FSA for WORD from that same table manages the entire English word structure, including phonotactic constraints on syllable onsets, rhymes, and codas, plus it provides for 1-8 syllables per word. This FSA consists of 1,507 states and 32,811 arcs, making it too complex graphically for display here.

!

Test results indicate word recognition at 98% and text discrimination at 100% precision. The text discriminator accepted as texts only the 110 valid messages, rejecting the other 100 as non-text. Messages with misspellings were accepted, and those with transfer encodings were rejected. The results are summarized in Chart 1, where accepted messages are indicated wherever l values (gray) are plotted beneath corresponding R values (black):

Text Discriminator Control Logic

70

The control logic responsible for operation of the text discriminator possesses a threshold value v, 0R, reject; otherwise, accept

20

10

0 1

8

15 22 29 36 43 50 57 64 71 78 85 92 99 106 113 120 127 134 141 148 155 162 169 176 183 190 197 204

Message

Chart 1: Text Discriminator Test Results For testing purposes, uuencodings were selected to exemplify transfer encodings in general. This seems reasonable—transfer encoding formats are similar, uuencodings have the longest history among the different types, and

60

uuencodings manifest the full range of problems typically encountered in managing transfer encodings of any type. For instance, over decades of use, non-standard uuencoding practices have been conventionalized [15], thus creating problems for software products that try to accommodate standards and non-standard variation in uuencoded format (cf. recognition of uuencoded email attachments [16]). Further complicating matters, contemporary use in service to malicious software deployment often intentionally obscures standard formats from detection.

detection as syllable constituents, thus informing decisions necessary to parse stress domains, words, and texts. According to this method, syllable structure operates as an intermediate representation between the writing system and the word space of a language. Text discrimination can exploit this abstraction to distinguish between messages with only natural language content, including misspellings and similar anomalies, and messages with non-linguistic content (perhaps embedded within natural language text). Initial testing suggests that efficient text discrimination by this method could support cybersecurity efforts to detect malware packaged within messages as transfer encodings

As a convenience for testing, no punctuation was included in the test messages. This artificial condition is easily reproduced in practical computing environments associated with network management and cybersecurity, and punctuation in general is not germane to the core problem addressed. No abbreviations, initialisms, nor acronyms were included in the test messages either. These word forms are easily managed by exhaustively enumerating them in a closed vocabulary approach, compiling the list of such forms into a separate finite-state automaton, and finally unifying it with the abstract word recognizer FSA, thus extending its vocabulary with a local grammar technique. Hence, the test results remain relevant for text discrimination without factoring punctuation, abbreviations, initialisms, or acronyms into the analysis.

Acknowledgements This paper’s content has benefitted from fruitful discussions with Hugh L. Taylor, whose expertise in multi-level network security was indispensible to the formulation of ideas presented here. Any omissions and misconceptions which remain in the paper are the responsibility of the author alone.

References [1] Whistler, K. et al. 2008. Unicode Character Encoding Model. TR#17, Rev. 7. Last accessed July 8, 2009 at http://www.unicode.org/unicode/reports/tr17-7.html. [2] Tamura, H. et al. 1978. Textural Features Corresponding to Visual Perception. IEEE Transaction on System, Man and Cybernetics 8: 460-473. [3] Amadasun, G. and R. King. 1989. Textural Features Corresponding to Textural Properties. IEEE Transaction on System, Man and Cybernetics 19(5):1264-1274. [4] Pass, G. et al. 1996. Comparing Images Using Color Coherence Vectors. Proceedings of Fourth ACM Conference on Multimedia (Boston MA). [5] Wang, D. and S.N. Srihari. 1989. Classification of Newspaper Image Blocks Using Texture Analysis. Computer Vision, Graphics and Image Processing 47:327–352. [6] Inglis, S. and I.H. Witten. 1995. Document Zone Classification Using Machine Learning. Proceedings of Digital Image Computing: Techniques and Applications. 631-6. [7] Scott, Sam and Stan Matwin. 1998. Text classification Using WordNet Hypernyms. Proceedings of the COLING/ACL Workshop on Usage of WordNet in Natural Language. 45-51. [8] Gelb, Ignace. 1963. A Study of Writing. Chicago IL: University of Chicago Press. [9] Sampson, Gregory. 1985. Writing Systems. Stanford CA: Stanford University Press. [10] Macri, Martha. 1996. Mayan and Other Mesoamerican Scripts. In The Word's Writing Systems. Peter T. Daniels and William Bright. Eds. pp. 172–182. Oxford University Press, NY. [11] Spencer, Andrew. 1996. Phonology: Theory and Description. Wiley-Blackwell.

In testing, word recognizer performance was deficient for English words of Slavic origin and common proper nouns like Amdahl, McKenzie, Johns, etc. Careful adjustments to tables 1-3 were sufficient to correct for recognition of these words. Even with these deficiencies, the precision of the abstract word recognizer embedded in the text discriminator is good enough to support desirable text discriminator performance. At a detailed level, the randomly selected test message word forms (inclusive of misspellings) varied between 3 and 21 characters in length, whereas bodies of uuencoding sequences in test messages varied from 6 to 61 characters. The characteristically longer uuencoded subsequences were invariantly identified by the word recognizer as non-conformant to syllable structure constraints. These observations suggest that the high precision of these text discrimination test results is most likely due to two factors: uuencoded strings are indifferent to natural language syllable structure, and the distribution of whitespace among uunencoded string segments is sparser than in natural language text.

Summary This paper describes how to exploit linguistic constraints on the sequencing of sound segments within syllables to produce a text discriminator for natural language texts embodied in an ASCII message stream. The method classifies characters and character sequences according to criteria homomorphic with phonological processes like aspiration, palatalization, flapping, stop release, vocalic peaks, and other cues that vary systematically with syllabic context. Character classification exposes character sequences to

61

[12] Selkirk, Elizabeth. 1984. On the Major Class features and Syllable Theory. In Aronoff, M. and R. Oehrle. (Eds). Language Sound Structure: Studies in Phonology Presented to Morris Halle by his Teacher and Students. Cambridge MA: MIT Press. [13] Ladefoged, Peter. 1982. A Course in Phonetics. 2nd Ed. Harcourt Brace Jovanovich. [14] Kleene, S.C. 1956. Representation of Events in Nerve Nets and Finite Automata. In Shannon, C. and J. McCarthy (Eds.). Automata Studies. Princeton University Press, Princeton NJ. 3-42.

[15] 2004. IEEE and The Open Group. IEEE Std 1003.1. Last accessed August 4, 2009 at http://www.opengroup.org/onlinepubs/009695399/utilities/ uuencode.html. [16] 2006. Microsoft. The text in a newsgroup message or in an e-mail message is incorrectly interpreted as a blank attachment in Outlook Express. Article ID: 898124 - Last Review: October 11, 2006 - Revision: 3.3. Accessed 8/4/09 at http://support.microsoft.com/default.aspx/kb/898124.

Appendix

1

Name O1 O2 O3 O4 O5 O6 O7 O8 O9 O10 K1

Regular Expression 1 (s) C1 (C8) (C10) (s) C2 (C7 | C8) (C10) (p) C3 (C7 | C8) (C10) C4 (C8) (C10) C5 (C8) (C10) C6 (k | p |m) C7 (C8) (C10) C8 C9 C10 C10 (C8) (C7) [C3 | C2 | C1]^{1,2} (s)

English Examples cat, play, sclera, twin, try child, gnome, phone, wry flow, sylph, smug, thrip, thwack being, bye, dye, dry, grow, gyre jazz, vim, vraisemblance, zoos gnarly, gnat, gnaw, gnome, gnu me, myiasis, myotic, nye rack, role, ruby, lady, lull, lupine rhinoceros, rhubarb, wring, wry yank, yes, yule, yurt, will, worst awl, dough, eyry, growth, yawn

K2

C10 (C8) C7^{1,2} (s)

awning, drowns, strewn

K3

C10 C8^{1,2}

myrrh, platyrrhine, pyre, pyrrhic

K4

C8 (C7) [C2 | [C3 | C1]^{1,2}] (s)

alpha, larch, schmaltz, sylphs

K5

C8 (C7) C4^{1,2} (s)

alb, bird, burgh, orb, pulgar, scold

K6

C8 [C3 | C4] C1^{1,2} (s)

first, torsk, twelfth, width, whilst

K7

C8 [C3 | C4]^{1,2} (s)

albs, barbs, dwarfs, guild, petard

K8

C8^{1,2} (C4) (s)

all, errant, jelly, larrup, nulls, well

K9

C7 (C4) C1^{1,2} (s)

oink, length, strengths, thousandth

K10

C7 C4^{1,2} (s)

bung, clang, lambda, sand, thumb

K11

C7 [C3 | C2] (s)

bench, bumph, inch, larynx

K12

C7 (C3) C1^{1,2} (s)

against, bunk, empty, pants, romp

K13

C7 C3^{1,2} (s)

absinth, bumf, canst, tenth

K14

(C8) C7^{1,2}

column, damn, hymn, inn, limn

K15

C6 (C3) (s)

amongst, angst, dinghy, lengthen

K16

C5 (C1) (s)

cozy, daze, eve, fez, hajj, nozzle

K17

C5^{1,2}

buzz, fizz, frizz, jazz, navvy

K18

C4^{1,2} (s)

add, babble, ebb, eggs, rubber

K19

(C4) C3 C1 (s)

asp, bisque, drift, lisps, tusks

[] indicates regular expression grouping, () optionality, | the union operator, and ^{m,n} iteration.

62

K20

(C4) C3^{1,2} (s)

bluffs, chuff, golfs, gruff, mess

K21

C2 (C1) (s)

ache, aitch, bush, cash, patch

K22

(C4) C1^{1,2} (s)

acts, buck, check, receipts, tipple

ONSET

0 | O1 | O2 | O3 | O4 | O5 | O6 | O7 | O8 | O9 | O10 C11^{1,5} | C10

Any

RHYME CODA

Any Any

SYLBL

K1 | K2 | K3 | K4 | K5 | K6 | K7 | K8 | K9 | K10 | K11 | K12 | K13 | K14 | K15 | K16 | K17 | K18 | K19 | K20 | (s) ONSET RHYME CODA

WORD

SYLBL^{1,8}

Any

Any

Table 3: Onset Constituents (On), Coda Constituents (Kn), Syllable Structure, and Word

63

Figure igure 5: FSA unifying O1-O10

64

Lexical Ambiguity and its Impact on Plan Recognition for Intrusion Detection. Christopher W Geib University of Edinburgh School of Informatics 10 Crichton St. Edinburgh EH3 6ED, United Kingdom [email protected] Abstract

not assume that a PR system’s observations all form a single goal, but worse yet we can imagine multiple instances of exactly the same goal (for different attackers or from different sources.)

Viewing intrusion detection as a problem of plan recognition presents unique problems. Real world security domains are highly ambiguous and this creates significant problems for plan recognition. This paper distinguishes three sources of ambiguity: action ambiguity, syntactic ambiguity and attachment ambiguity. Previous work in plan recognition has often conflated these different sources of ambiguity. This paper clarifies this distinction and explicitly studies the effect of syntactic ambiguity on the runtime of a particular plan recognition algorithm. It also argues for new method for controlling plan level ambiguity in probabilistic plan recognition based on the idea of plan “heads” and parsing lexicalized grammars.

2. High ambiguity of individual action observations: Many, if not all, of the actions that are part of compromising a computer’s security have a large number of legitimate uses as well as hostile ones. In fact, often almost all of the actions taken to compromise a computer system are individually completely legal and acceptable. It is only within the context of the collection of actions that they become problematic. This means that individual actions are not highly diagnostic of malicious intent. It is only collections of actions within specific contexts that are diagnostic. Unfortunatly, this kind of ambiguity is problematic for much of the prior research in PR (Bui, Venkatesh, and West 2002; Avrahami-Zilberbrand and Kaminka 2005; Geib 2006; Kautz 1991).

Introduction Given a plan library and a set of observations, the problem of identifying an agent’s plans and goals on the basis of their observed actions is called plan recognition (PR), and is a well studied problem in AI. Previous work(Harp and Geib 2003; Geib and Goldman 2002) has suggested using PR both for 1) recognizing the high level plans of someone that is attacking or misusing a computer system as well as well as 2) lower level intrusion detection (to identify exploits). However, there are two major issues that make a straight forward application of most prior PR research infeasible. 1. Multiple concurrent goals: Much of the prior work in PR has assumed that an agent is engaged in a single goal at any given time.1 However this assumption is simply not supported in computer security domains. Any given networked system will more often than not be under attack from multiple, possibly cooperating, possibly competitive sources. Each source may have a single or multiple goals. We can easily imagine a collection of “hackers” that are all using a set of scanning tools to attempt to identify machines to host their software or just for bragging rights. They may very well be scanning, attempting exploits, and other activities at the same time in an effort to achieve different instances of the same goals. Thus, not only can we

To address these problem, we have formulated PR based on Combinatory Categorial Grammars (CCGs)(Steedman 2000), a grammatical formalism developed for use in natural language parsing(NLP). This has been implemented in the ELEXIR(Geib 2009) system and uses CCGs to represent plan libraries. This formulation requires us to introduce the new idea of plan heads. We will show that making the correct choices about plan heads enables will allow us to control runtimes even in highly ambiguous domains like computer network security. The rest of this paper will be organized in the following way. First, we will discuss related work in plan recognition, then we will provide an overview of ELEXIR. We will continue with some synthetic studies that demonstrate ELEXIR’s ability to control runtimes in highly ambiguous domains, and then discuss limitations of the algorithm.

Background PR has seen a significant increase in interest due to the availability of large quantities of real world sensor data. Following (Carberry 1990) and (Pynadath and Wellman 2000), we are interested in viewing the problem as one of parsing a sequence of observations to produce plans. Starting from real world data and viewing PR as a parsing task, we can see the problem as made up of three major

c 2010, Association for the Advancement of Artificial Copyright ! Intelligence (www.aaai.org). All rights reserved. 1 Some exceptions to this are the early work by (Kautz and Allen 1986) and much of the probabilistic work of (Charniak and Goldman 1990)

65

data theft, denial of service, or simply bragging about having broken system security. We know of no systematic analysis of the effect of varying syntactic ambiguity on PR or specific ways to control it. In this paper, we will first review the PR algorithm (ELEXIR)(Geib 2009) based on parsing plans represented as Combinatory Categorial Grammars (CCGs) (Steedman 2000). We will then discuss how to compute plan level ambiguity. We will then discuss how to use plan heads in the CCG representation to control the effects of of plan level ambiguity on ELEXIR’s runtime.

tasks: 1) recognizing actions, 2) assigning syntactic categories to each of the actions, and 3) combining the actions based on their categories into plans. Each of these tasks must address ambiguity that we will refer to as: action ambiguity, syntactic ambiguity, and attachment ambiguity respectively. This paper will focus on syntactic ambiguity, however, to clearly disambiguate this work, we will briefly discuss related research on the other two forms. Action ambiguity is typically a result of sensor noise. The observation of a single real world action is usually made up of multiple, temporally extended, noisy sensor reports. These reports must be converted into a usable sequence of observations of actions. For example, labeling video frames showing an agent reaching for a coffee mug as part of a grasp-mug action. This problem is typically called activity or behavior recognition. Starting from real world noisy data (video, sonar, passive RF, GPS data and others), successful activity recognition research has used Hidden Markov Models (HMMs) (Bui, Venkatesh, and West 2002), Conditional Random Fields (CRFs) (Liao, Fox, and Kautz 2007; Vail and Veloso 2008), and other forms of Bayesian reasoning (AvrahamiZilberbrand and Kaminka 2005; Hoogs and Perera 2008; Liao, Fox, and Kautz 2005). In many cases, researchers have shown impressive results with significant variation in the sensor noise. However, even a perfect activity recognizer can not eliminate all ambiguity from the problem. Consider unambiguously recognizing a grasp-mug action. The agent’s goal is still unclear. Is the agent going to drink out of the mug? place it on the table? clean it? It is only by considering the larger plans created by sequences of observed actions that we can recognize the goals of the agent. Previous work in PR has not distinguished between syntactic and attachment ambiguity, however in their work on natural language parsing (Sarkar, Xia, and Joshi 2000)(SXJ) clearly lays out the differences between choosing a syntactic category for a given observation (syntactic ambiguity) and finding the correct attachments between the categories (attachment ambiguity) that will result in a sentence (in our case a plan). SXJ show that, in the case of natural language parsing, the computational cost of attachment ambiguity is far less than that of syntactic ambiguity. Their results do not directly transfer to the PR domain due to differences between action grammars and natural language grammars. However, this result suggests exploring the cost of syntactic ambiguity in PR. We leave the study of attachment ambiguity in PR as an area for future work. We can also see this difference in domains that have effectively no sensor noise. Let us return to computer network security. In a well instrumented network, it is possible to observe all of the packets and all of the commands run on the system for any given interval. In this case there is no action level ambiguity; an action is observed in the system or it is not. Again, this doesn’t mean there is no ambiguity about the plans being followed by a user or a hacker. Suppose we have a hacker that has gotten super user access to a computer. This will not tell us what his/her end goal is

ELEXIR Overview

The ELEXIR system(Geib 2009) performs probabilistic PR using a weighted model counting algorithm given a set of observations and a CCG specification of the plans to be recognized in a plan lexicon. To perform plan hypothesis construction, ELEXIR parses the observations, based on the plan lexicon, into the complete and covering set of explanations each of which contains one or more plan structures. ELEXIR then establishes a probability distribution over the explanations to reason about the most likely goals and plans of the agent. The first step is to encode the plans in CCGs. We refer the interested read to (Geib 2009) for complete details of the formalization and algorithm behind ELEXIR. However, an understanding of how plans are represented in CCGs will be important for our discussion.

Representing Plans in CCG

Consider the simple abstract hierarchical plan drawn as a partially ordered AND-TREE shown in Figure 1. In this G

A

B

C

D

a

b

c

d

Figure 1: An abstract plan with partial order causal structure plan, to execute action G the agent must perform actions A, B, C, and D. A and B must be executed before C but are unordered with respect to each other, and finally D must be performed after C. To represent this plan in a CCG, each observable action is associated with a set of syntactic categories. A set of possible categories, C, is defined recursively by: Atomic categories : A finite set of basic action categories. C = {A, B, ...}. Complex categories : ∀Z ∈ C, Z ∈ C and non empty set {W, X, ...} ⊂C then Z\{W, X, ...} ∈ C and Z/{W, X, ...} ∈C . Complex categories represent functions that take a set of arguments ({W, X, ...}) and produce a result (Z). The direction of the slash indicates where the function looks for its arguments. Therefore, an action with category A\{B} is a function that results in performing action A when an action with

66

category B has already been performed. Likewise, A/{B} is a function that results in performing A if an action with category B is executed later. Definition 1.1 We define a plan lexicon as a tuple PL = %Σ, C, f & where, Σ is a finite set of observable action types, C is a set of possible CCG categories, and f is a function such that ∀σ ∈ Σ, f (σ) → Cσ ⊆ C. where Cσ is the set of categories an observation of type σ can be assigned. We may provide just the function that maps observable action types to categories to define a plan lexicon. For example: a := A,

b := B,

c := (G/{D})\{A, B},

d := D.

varying the number of observable actions in the grammar provides a simple way to control the number of categories associated with each observable action and the associated syntactic ambiguity. To see if syntactic ambiguity has a measurable effect even on simple problems, we use totally plans with a tree height of two and a branching factor of five. Thus, each plan has twenty-five steps. For our lexicon we generated sixty-one such unambiguous plans. For each of these plans, the leftmost depth first action of each sub-tree was chosen as the head for the sub-tree. Thus the CCG categories can be thought of as encoded the plan as a series of leftmost depth first sub-trees. With these CCG categories in hand, we generate multiple lexicons with differing levels of ambiguity by controlling the number of observable actions in the grammar. We measure the ambiguity, A, as a real value between zero and one where the number of observable actions, |C|, is given by:

(1)

defines one plan lexicon for our example plan. Definition 1.2 We define a category R as being the root or root-result of a category G if it is the leftmost atomic result category in G. For a category C we denote this root(C) Thus, G is the root-result of (G/{D})\{A, B}. Further, Definition 1.3 we say that observable action type a is a possible head of a plan for C just in the case that the lexicon assigns to a at least one category whose root-result is C. In our lexicon c is a possible head for G. In general, a lexicon will allow multiple categories to be associated with an observed action type. This is the source of syntactic ambiguity s the parser must choose between them. In CCGs combinators (Curry 1977) are used to combine the categories of the individual observations. We will only use three combinators defined on pairs of categories: rightward application: X/α ∪ {Y}, Y ⇒ X/α leftward application: Y, X\α ∪ {Y} ⇒ X\α rightward composition: X/α ∪ {Y}, Y/β ⇒ X/α ∪ β

|C| = (1 − A) ∗ |l|,

(2)

where |l| is the number of leaf actions in the plans represented by the lexicon. Given a set of plans encoded in CCGs we can then systematically vary the ambiguity of the resulting lexicon. We, use formula 2 to compute the number of observable actions for the lexicon, given the desired ambiguity, and then randomly assign each category to an observable action while guaranteeing that each observable action gets at least one category. Using this method we generated lexicons for the same set of underlying plans with ambiguities of 0.0, 0.1, 0.2, 0.3, 0.4, and 0.5. Given these CCG plan lexicons we generated observations to test the system by randomly selecting a root-result categories and producing a plan instance for it based on the plan lexicon. ELEXIR is then timed computing the conditional probability of all the root results found by the algorithm given CCG plan lexicon and the sequence of observations. All of our experiments measuring the runtime for our C++ implementation of ELEXIR were conducted on a MacBook with 4Gb of main memory and 2 2.2-GHz CPUs. Figure 3 shows the average, minimum, and maximum runtimes, testing fifty plan instances each for ambiguity values of 0, 0.1, 0.2, 0.3, and 0.4 with a runtime bound of one minute. All three statistics show significant increases for even very limited amounts of ambiguity. The reason the maximum and average statistics decrease after 0.2 is that the vast majority of the experiments did not return in under a minute. The number of experiments that did return in under a minute is given along the X-axis in the figure. Once the ambiguity exceeds 0.2 more than half of the test cases jumped from runtimes of under ten seconds to over a minute. As the ambiguity increases the number of successful sub-minute tests drops until none of the tests returned in under a minute when the ambiguity reached 0.5 (and average of two categories per observation).

where X and Y are categories, and α and β are possibly empty sets of categories. To see how a lexicon and combinators parse observations into high level plans, consider the derivation in Figure 2 that parses the sequence of observations: a, b, c. Notice, all the a b c A B (G/{D})\{A, B} < (G/{D})\{A} < G/{D} Figure 2: Parsing Observations with CCGs hierarchical structure from the original plan for achieving G is included in c’s category. Thus, once c is observed and assigned its category, we can use leftward application twice to combine both the A and B categories with c’s initial category to produce G/{D}.

Empirical Studies of Ambiguity in ELEXIR Using synthetic grammars and observation streams we can test the impact on ELEXIR’s runtime of varying the syntactic ambiguity of the grammar. Constructing plan lexicons and keeping the underlying plan structure fixed while

Choosing Heads in Plan Lexicons The critical choice made by during lexicon construction is which action types will be the plan heads. Different choices

67

converted to a CCG lexicon by starting at the root of the plan and recursively descending the tree following the actions with the indices given by - ( h * plan-branching-factor) . collecting siblings that are to the left and the right of the action. When a leaf is reached a CCG category is built maintaining the ordering constraints of the original plan. This process is repeated for all sub plans not covered by the initial category. This results in five grammars where the head of each plan moves from left to right over each of the actions of the plan as the value of h is increased. Figure 3: Ambiguity increases min, max and average runtime. Notice the significant ceiling effect above A=0.2 for heads result in different lexicons. For example, the following is an alternative lexicon for our G plan. a := A,

b := B,

c := C,

d := (G\{A, B})\{C}.

(3)

We can also represent the plan for G with the following lexicon which has two possible categories for action a: a := { ((G/{D})/{C})/{B}, ((G/{D})/{C})\{B} }, b := B, c := C, d := D.

(4) Figure 4: Increasing headedness (moving the head to the right) helps control the cost of ambiguity.

There are also a number of still more complex lexicons where other choices are made for the plans heads. (Geib 2009) has pointed out that correctly choosing plan heads can have significant impact on the runtime of ELEXIR. We hypothesize that correctly choosing plan heads can help in addressing syntactic ambiguity. It will be helpful to have a value, h, to quantify where the head occurs within a plan. We will establish a canonical order of actions for the plan, that obey the plan’s ordering constraints, 2 and define the headedness for a particular plan as the rank of the plan’s head action in the ordering divided by the length of the plan. Thus, grammar (3) would have a headedness value of one for the plan for G, grammar (4) would have a headedness value of 0.25 for the plan for G, and our original grammar (1) would have a headedness value of 0.75 for the plan for G.

Varying both headedness and ambiguity restyled in thirty distinct grammars. For each grammar, we ran fifty tests recognizing a single plan. The minimum runtimes for each of the test conditions is graphed in Figure 4. As we have already seen, placing a one minute bound on the runtime is sufficient to prevent some of the test cases from being completed. Therefore, rather than an average we have graphed the minimum runtimes and remind the reader that these figures represent a lower bound on runtimes for these problems. However keep in mind that in the first experiment that none of the test cases with an ambiguity of 0.5 returned in under a minute. Figure 4 provides convincing evidence for our hypothesis. We note that each of the lines for the higher headedness values starts with a faster minimum runtime (sometimes two orders of magnitude) and remain below the 0.0 line and even enables many of the test cases for ambiguity 0.5 to return in under one second. Further evidence of the ability of headedness to aid in controlling ambiguity in plans is seen in Figure 5. This table presents the number of test cases that returned within the one minute bound. It shows that moving the head to the right in a plan increases the number of test cases with a runtime under one minute, relative to plans with the same ambiguity. Thus, even though ambiguity is being increased as we move to the right, increasing headedness in the plans is allowing ELEXIR to run fast enough to return an increasing number of results within the one minute bound. For example, a headedness value of 0.75 enables half of the tests to return in under a minute where the ambiguity of the plan lexicon had prevented any of the test cases returning when the lexicon had a headedness value of 0.0. Thus we

Reducing Runtimes by Choosing Plan Heads Our previous experiment held the headedness of plans constant at 0.0. In order to explore the impact that varying headedness might have on controlling ambiguity, we ran experiments systematically varying the headedness of the plans with five values: 0.0 (the same as our previous experiment), 0.25, 0.5, 0.75, and 1.0. Our hypothesis in this experiment is that larger headedness values will delay commitment to high level goals and thereby reduce the runtime of the algorithm. To create these different lexicons, we used the same set of sixty-one totally ordered plan trees. These plans were then 2 For the purposes of our experiments, it will not be significant that actions that are actually unordered with respect to either other can have differing values for headedness. The fact that we can systematically move through the plan’s actions is more important.

68

h 0.01 0.25 0.5 0.75 1.0

A=0.0 50 50 50 50 50

A=0.1 42 45 45 49 49

A=0.2 25 41 43 37 40

A=0.3 5 39 40 40 39

A= 0.4 1 18 36 25 29

A=0.5 0 10 17 25 26

Creating the natural bias for probabilistically “simpler” explanations of the observed actions. However, when required, the algorithm does correctly consider the less likely explanations.

Figure 5: Number of test cases with runtimes under one minute.

This paper has discussed different sources of ambiguity in the plan recognition. We have provided a systematic study of syntactic ambiguity for a particular PR algorithm. We have shown that even relatively low levels of syntactic ambiguity can be crippling to the runtime of PR algorithms. Finally, we have shown that introducing the idea of heads in plans and moving the heads of plans away from the initial actions of a plan can be a powerful tool to help control the runtime of PR even in the face of significant syntactic ambiguity a critical problem for computer security domains.

Conclusions

can conclude that not only is the minimum runtime for the algorithm being kept low by moving the head of the plan away from the first actions of the plan, but the number of cases that can be brought within a reasonable runtime is also increasing.

Discussion and Limitations

Acknowledgments

These experiments show that a PR algorithm based on CCGs and headedness is viable and provides a principled way to control ambiguity. However, we have not provided an answer for how to choose plan heads during lexicon design. These decisions have to be made by considering three key factors:

The work described in this paper was conducted within the EU Cognitive Systems project PACO-PLUS (FP6-2004IST-4-027657) funded by the European Commission.

References

1. Criticality of early recognition: In cases where early recognition is critical, choosing a head that is early in the plan is better. Earlier heads allow earlier recognition and must be weighed against the runtime. We can certainly imagine domains where the need for early recognition outweighs the runtime costs.

Avrahami-Zilberbrand, D., and Kaminka, G. A. 2005. Fast and complete symbolic plan recognition. In Proceedings of the International Joint Conference on Artificial Intelligence. Bui, H. H.; Venkatesh, S.; and West, G. 2002. Policy recognition in the Abstract Hidden Markov Model. Journal of Artificial Intelligence Research 17:451–499. Carberry, S. 1990. Incorporating default inferences into plan recognition. 471–478. MIT Press. Charniak, E., and Goldman, R. P. 1990. Plan recognition in stories and in life. In Henrion, M.; Schachter, R.; and Lemmer, J., eds., Uncertainty in Artificial Intelligence 5. 343– 351. Curry, H. 1977. Foundations of Mathematical Logic. Dover Publications Inc. Geib, C. W., and Goldman, R. P. 2002. Recient advances in intrusion detection (raid) conference, 2002. Geib, C. 2006. Plan recognition. In Kott, A., and McEneaney, W., eds., Adversarial Reasoning. Chapman and Hall/CRC. 77–100. Geib, C. W. 2009. Delaying commitment in probabilistic plan recognition using combinatory categorial grammars. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), 1702–1707. Harp, S. A., and Geib, C. W. 2003. Principles of skeptical systems. In Proceedings of the AAAI 2003 Spring Symposium on Human Interaction with Autonomous Systems in Complex Environments. Hoogs, A., and Perera, A. A. 2008. Video activity recognition in the real world. In Proceedings of the Conference of the American Association of Artifical Intelligence (2008), 1551–1554.

2. Runtime: In general, as we have shown, to minimize runtime, choosing actions that fall later in the plan as heads is better. 3. Causal structure: We can see in these experiments aligning choices of plan heads with the causal structure produces the greatest computational wins. Thus, all three of these features must be considered by the system builder when encoding a PR domain. While we have spent considerable time describing the way in which ELEXIRaddresses the issue of delaying commitment to root goal hypothesis, we have spent comparatively little time talking about its handling of multiple root goals. This actually falls naturally out of the algorithm that we have outlined here. Since nothing about the algorithm or the probability model requires that an explanation only contain a single category, it is perfectly acceptable for any or all of the hypotheses to have multiple root goals. Since we are producing the complete set of such explanations, hypotheses with multiple root goals naturally fall out of the explanation generation algorithm given here. However, the probability model does have a bias against unnecessarily complex explanations by considering the root priors. Since each root goal’s prior is included within the probability of the explanation, an explanation that has multiple root goals will (depending on the specific priors) usually be less likely than an explanation that uses fewer root goals.

69

Kautz, H., and Allen, J. F. 1986. Generalized plan recognition. In Proceedings of the Conference of the American Association of Artificial Intelligence (AAAI-86), 32–38. Kautz, H. A. 1991. A formal theory of plan recognition and its implementation. In Allen, J. F.; Kautz, H. A.; Pelavin, R. N.; and Tenenberg, J. D., eds., Reasoning About Plans. Morgan Kaufmann. chapter 2. Liao, L.; Fox, D.; and Kautz, H. A. 2005. Location-based activity recognition using relational Markov networks. 773– 778. Liao, L.; Fox, D.; and Kautz, H. 2007. Extracting places and activities from gps traces using hierarchical conditional random fields. In International Journal of Robotics Research, volume 26, 119 – 134. Pynadath, D., and Wellman, M. 2000. Probabilistic statedependent grammars for plan recognition. In Proceedings of the 2000 Conference on Uncertainty in Artificial Intelligence, 507–514. Sarkar, A.; Xia, F.; and Joshi, A. 2000. Some experiments on indicators of parsing complexity for lexicalized grammars. In Proceeding of the COLING 2000 Workshop on Efficiency in Large-Scale Parsing Systems. Steedman, M. 2000. The Syntactic Process. MIT Press. Vail, D. L., and Veloso, M. M. 2008. Feature selection for activity recognition in multi-robot domains. In Proceedings of the Conference of the American Association of Artifical Intelligence (2008), 1415–1420.

70

Suggest Documents