Using Swarming Agents for Scalable Security in Large ... - Sakai@WFU

Using Swarming Agents for Scalable Security in Large Network Environments (Invited Paper) Michael B. Crouse, Jacob L. White,

Glenn A. Fink and Jereme Haack

Errin W. Fulp, and Kenneth S. Berenhaut Departments of Computer Science and Mathematics Wake Forest University Winston-Salem, NC, 27109 Email: [email protected]

Pacific Northwest National Laboratory Richland, WA, USA Email: [email protected]

Abstract- The difficulty of securing computer infrastructures increases as they grow in size and complexity. Network-based security solutions such as IDS and firewalls cannot scale because of exponentially increasing computational costs inherent in de tecting the rapidly growing number of threat signatures. Host based solutions like virus scanners and IDS suffer similar issues that are compounded when enterprises try to monitor them in a centralized manner. Swarm-based autonomous agent systems like digital ants and artificial immune systems can provide a scalable security solution for large network environments. The digital ants approach offers a biologically inspired design where each ant in the virtual colony can detect atoms of evidence that may help identify a possible threat. By assembling the atomic evidences from different ant types the colony may detect the threat. This

decentralized

approach can require, on average, fewer computational resources than traditional centralized solutions; however there are limits to its scalability. This paper describes how dividing a large infrastructure into smaller, managed enclaves allows the digital ant framework to effectively operate in larger environments. Experimental results will show that using smaller enclaves allows for more consistent distribution of agents and results in faster response times.

I. INTRODUCTION Computing infrastructures continue to grow to provide the computational resources required for various large-scale and multi-tenant computing applications. Grid computing and cloud computing are examples of services that offer large amounts of dynamic computing power using large, highly distributed compute servers. However, the size of the in frastructure cannot be measured by the number of physical servers; virtualization can provide multiple hosts on one physical server creating a much larger virtual infrastructure. This is the approach of NSF's GENI [1], DHS's DEfER [2] and other Emulab-based environments [3]. Unfortunately, as these infrastructures grow in size they also become more complex, making traditional system management approaches increasingly impractical. Providing security in large-scale environments is especially challenging. Firewalls and IDS typically form the first line of defense against the misuse of computing resources. These measures are commonly deployed as part of the implementa tion of a host-based security policy. Given the diverse range

of dynamically changing services, virtualization, and per user policy exceptions, managing thousands of individual host policies is quite challenging. Administrators may not even be aware of which services are being offered by individual machines at any time. Scalable new approaches are needed to manage security efficiently. Swarm-based approaches map nicely to computer security problems precisely because of computer infrastructures' dy namic and decentralized properties. Swarm solutions prescribe relatively simple rules for interaction that produce emergent behaviors sometimes referred to as self-organization. The result is swarm intelligence that can address problems that appear to be more complex than the rules used to define the behaviors. Swarm-based approaches adapt to changing threat levels and have been demonstrated to be more efficient than traditional security approaches [4]. In addition, swarm based approaches are robust, since swarms select for colony survival and do not depend on particular individual agents [5]. These features of swarm solutions are important attributes for securing large dynamic infrastructures. The swarm-based security management approach described in this paper leverages the Sensor and Sentinel levels of the digital ants framework hierarchy described in [6]. While Sensors are simple, lightweight, ant-like agents that roam from host to host searching for evidence of problems, Sentinels are immobile, host-level agents that protect and configure a single host or a collection of similarly configured hosts. A Sentinel uses evidence presented by multiple Sensors to determine whether a threatening or suspicious condition exists on its host computer. When a Sensor presents evidence that the Sentinel cannot account for or that it determines is truly suspicious, it will reward the Sensor, causing it to enter the active, pheromone dropping mode for a period of time. This pheromone attracts other Sensors to the suspect host producing stigmergic com munication among Sensors. The pheromone trails dissipate over time so that solved problems no longer attract more Sensors. Pheromone-based systems have been demonstrated to simply and effectively solve highly constrained problems where logic-based, optimizing approaches prove intractable

978-1-61284-857-0/11/$26.00 ©2011 IEEE

[7]. Management of the Sensor population is critical for the digital ants to remain responsive to threats in large infras tructures. In a large network environment Sensors may have to travel long distances to reach a host that will reward it. We have found that subdividing a large network into smaller localities called enclaves improves Sensor distribution and produces faster response times to threats. However, dividing the infrastructure into enclaves also requires increasing the population of Sensors in the infrastructure which unfortunately increases the computational cost of the approach. This paper will describe how to balance the need for responsiveness and efficiency when securing the infrastructure using the digital ants approach. The remainder of the paper is structured as follows: Section II will review the digital ants framework and describe the challenges associated with deploying digital ants to protect a large computer infrastructure. Section III will empirically show the performance of different agent population manage ment approaches and provide some deployment guidelines. Finally, Section IV will summarize the results and describe open areas of research. II. A SWARM-BASED ApPROACH TO S ECURITY The digital ants framework is a hierarchy consisting of the Supervisor, Sergeant, Sentinel, and Sensor levels [6]. The different levels form a mixed-initiative approach, where human administrators' decision-making and authority is com plemented with the computational resources of rational agents. At the highest level of the hierarchy, human administrators, called Supervisors, provide overall governance to the infras tructure and interact primarily with the top level of software agents, called Sergeants, that are responsible for a local subset of a computer infrastructure called an enclave. We define an enclave as a set of geographically or topologically collocated machines that has a single owner and is managed under a common policy. Thus, enclaves resemble the definition of the Internet's Autonomous Systems (ASes), only they add geographic or topological locality [8]. We also envision that enclaves would be much smaller than ASes. Sergeants provide situational awareness to the Supervisor and create enclave policies based on Supervisor guidance. The Sergeant can be aware of issues that span multiple hosts within the enclave and may communicate with peer Sergeants over other enclaves. Sentinel agents provide status to their Sergeant and enforce the Sergeant's policies on enclave hosts. Sentinels also enable Sensors to traverse the geography, the digital ants overlay network. Each Sensor searches for a single, atomic indicator such as network connection frequency, number of zombie processes, or number of open files. Their power is in their numbers, diversity, and stigmergic communication. As they wander, Sensors randomly adjust their current direction similar to the movement of real ants. They compare findings at the current host with findings in their recent visits. If the findings are outliers, the Sensor reports this to the Sentinel. If the Sentinel cannot explain the findings as part of it's normal

operating state, it will reward the Sensor creating pheromone reinforced feedback behaviors. The Sentinel will use evidence from the diverse attracted Sensors to diagnose a problem. A. Considerations for Large-Scale Infrastructures

While higher Sensor populations provide greater respon siveness [4] higher populations also are more expensive both in computation power and communication bandwidth. An advantage of the digital ants framework is that it adaptively varies the Sensor population to increase during attacks and decrease when no threat is apparent. However the system does need to maintain a minimum Sensor population to remain responsive. It will be shown experimentally that the minimum will depend on the size of the infrastructure, where larger infrastructures will require larger minimum populations. If there are proportionally few Sensors in a very large enclave it may require an unacceptably long time for the Sensors to reach a compromised host and jointly diagnose a problem. An example of this would be threats occuring close to the maximum distance from one another, leading Sensors to be gathered at a host the furthest from the new threat. As the size of the infrastructure increases, the time for the Sensors to react to new threats can be inadequate. Consider the scalability problem associated with Sensor distribution on an infrastructure that consists of h hosts, where h is a perfect square. Assume the geography of the h hosts presented to the Sensors is a toroidal grid (square matrix) where every host has eight neighbors. Therefore edges wrap to the opposite side forming an 8-way regular graph. For example a geography consisting of h = 65,536 hosts can be arranged as a 256 x 256 grid. The farthest distance between any two hosts in this geography is the diameter, which is half of the matrix dimension. For the 65,536 example, the diameter is 128. It is important to note that diameter is the longest direct path in the grid, however it is likely that the initial Sensor will take a longer path due to its random stagger. 1 However, once the first Sensor finds a compromised hosts, other Sensor types should find this compromised host more quickly by following the pheromone trail. Although pheromone can reduce the number of steps re quired [4], the delay associated with an initial Sensor is still an issue for large networks. Therefore it might be beneficial to divide a large infrastructure grid into smaller sub-grids, where every sub-grid is still toroidal. Sensors are then evenly distributed and remain within the bounds of their assigned area. This approach ensures that Sensors are better distributed across the entire infrastructure. Using the current geography, the h hosts in the toroidal grid can be split into m sub-grids, each of size n2, i.e. (1) with h, m, and n all integers. Given this representation there are several possible sub-grid configurations. For example, h = 1 The first sensor to discover a potential problem which is then rewardard and moves away from the suspicous host

65,536 hosts could be represented as a single 256 x 256 grid, 4 sub-grids of size 128 x 128, or 16 sub-grids of size 64 x 64. However an arrangement consisting of a large number of small sub-grids may result in a higher total number of Sensors, which is computationally more expensive. Therefore if there is an upper bound on the number of Sensors, it will limit the number and size of the sub-grids. This relationship will be explored empirically in the next section. III. EXPERIMENTAL RESULTS In this section, we investigate the scalability of the digital ants framework using simulations. We experimentally examine the impact of increasing the enclave size and Sensor pop ulations. Analysis will show how appropriate initial Sensor populations for a given enclave size can be identified. The experiments will also give insight into the impact of imple menting sub-grids within the enclave on the distribution of the Sensors within large scale environments. In simulations considered, enclaves with geographies rep resented as square grids as described in section II-A, the grids consisted of 4,096, 16,384, and 65,536 hosts. In the experiments we employed three types of simulated Sensors and required evidence from one of each of the three types to identify the compromised host [6]. Although the initial number of Sensors deployed might vary depending on the experiment, for simplicity the population remained constant for the duration of the simulation (no Sensor birth or death once a simulation has started). We evaluate the performance in each grid size using vari ations of two metrics, hitting time and cover time, which are commonly used to measure the performance of random walk processes and are also relevant to the digital ants framework. We performed each experiment multiple times and recorded the average result for each configuration. A. Hitting Time Analysis

As described in the previous section, the total number of Sensors present in the enclave will affect the responsiveness of the system-more Sensors typically yields faster response. One way of measuring responsiveness is to consider the hitting time (i.e. the number of random steps required for an agent to reach node u from node v in a graph). For these experiments, hitting time was the number of steps required for at least one Sensor of each type to visit a given compromised host. The initial Sensor population and compromised host were located at a distance equal to the diameter of the network; therefore the expected hitting time can be considered, in some sense, an expected worst case for the number of steps (time) required to discover the threat. Figure 1 shows the hitting time associated with increasing population densities of Sensors for three different grid sizes (4,096, 16,384, and 65,536 hosts). The x-axis measures the Sensor density, which is the number of Sensors, per type, divided by the number of hosts. A density of one represents one Sensor of each type per host. As seen in the graph, the hitting time reduces dramatically as the density increases.

Average Hit Time with increasing Agent Density 65,536 Hosts -e-16,384 Hosts ---l>- 4096 Hosts -e-

9 8 7 Q)

�

I Q)

W �

«

6 5 4 3 2

0.5

Fig. I,

1 Agent Density

1.5

Hitting time as agent densities increases for various network sizes.

However at approximately 0.001 density the reduction in hitting time becomes less significant and can be considered the sensible value (therefore there is a diminishing return for higher Sensor populations). The point at which this occurs will be referred to as the effective density. Furthermore, it is important to note that the effective density is independent of grid size. For example a density of approximately 0.001, or 0.1%, was appropriate for the three grids simulated. Figure 1 also shows, as expected, that larger grids tend to have longer average hitting times. For example the hitting time associated with the network of 65,536 hosts is roughly three times larger that for the network of 4,096 hosts. However as described in the previous section, a grid can be sub-divided into smaller sub-grids, which may result in better performance. Assume s is the effective number of Sensors (based on the effective density), where s 2': m since there must be as many Sensors as there are sub-grids (assuming only one type of Sensor for simplicity). Substituting s for m in equation (1) and solving for n yields, n=

�

(2)

where n is the dimension of the sub-grid and must be a positive whole value. Therefore, as s increases, the size of the sub-grid can decrease, which will yield lower hitting times. Consider the grid consisting of 65,536 hosts (256 x 256 grid) depicted in figure 1. Using equations (1) and (2) and an agent density of 0.1%, this grid can be divided into 16 sub-graphs of 4,096 hosts each. For this example this is the smallest sub-grid dimension possible since each sub-grid must have at least three Sensors (one of each type). Again, a large number of smaller sub-grid dimensions is sought since this configuration tends to yield lower hitting times. Given only one compromised host, it will reside in only one sub-grid. As a result the hitting time for the infrastructure is now reduced to the hitting time of the smaller sub-grid. Figure 1 shows a graph of 4,096 hosts with the same agent density.

3

X 10

'

Average Cover Time with increasing Agent Density

IV. CONCLUSIONS

4,096 Hosts --e- 16,384 Hosts ---8- 65,536 Hosts --e--

2.5

0.5

0.005

Fig, 2.

0.01 Agent Density

0.D15

0.02

Cover time as agent density increases with various network sizes,

Therefore using sub-grids and the same density as the grid of 65,536 hosts, the hitting time was reduced by a factor of three. B. Cover Time Analysis

Another metric employed to measure Sensor performance is the cover time (i.e. the number of iterations, or steps, it takes for all hosts to be visited, or covered, by at least one Sensor of every type). When this scenario is applied to the digital ants framework it represents the case where the compromised host is only discovered by the last Sensor type that visits the host. This is different than the hitting time experiments where the findings from first and second visiting Sensor (of different types) are considered useful. Our definition of cover time considers only the spread of agents to every node without considering a particular compromised host as a detection target. Thus, we are measuring the time it takes for at least one Sensor of each kind to visit every node. Therefore the number of steps required to discover the attacker will be considerably higher than observed for the hitting time experiments. Figure 2 shows the cover time for the three different grid sizes as the agent density increases. As with hitting time, the cover time decreases as the agent density increases however the diminishing return happens at higher densities. As expected, larger grid sizes tend to have higher cover times than smaller grids. The cover time for 65,536 hosts is roughly 2.5 times higher than 4,096 hosts. The density where the cover time stabilizes also is higher than the density associated with the hitting time experiments. This is primarily due to the lack of pheromone, and as a result Sensors only move in a random walk fashion. As done in the hitting time experiments, these findings can be used to partition the original grid into smaller sub grids. However this approach requires higher minimum Sensor populations which will yield smaller sub-grid dimensions but may be computationally prohibitive.

The goal of the digital ants framework is to defend these infrastructure while using minimal computational resources and network bandwidth. Traditional, always-on approaches are well-suited in the defense realm but can fall short in efficiency in terms of computational resources especially when used to manage large computing infrastructures. The digital ants framework can provide a scalable, more efficient approach to address both needs. One concern with deploying digital ants in large environ ments is the responsiveness of the system. In these large networks, agents may unfortunately be located distant from a host requiring assistance, resulting in long response times. Experimental results showed that agent density (ratio of agents to hosts) was critical for the responsiveness of the digital ant framework in large environments. For example, it was observed that an agent density of 0.1% provides a relatively good response time with the lowest agent density. Higher agent densities did not provide a significantly better performance. In addition, by dividing the large network infrastructure into smaller parts (enclaves), the response time is equal to a much smaller network. Future work will examine pheromone trail lengths and dissipation rates to determine the appropriate values. Another interesting area of future work is the appropriate creation and termination of agents. This paper addressed the minimum population required, however more work is needed to better understand the appropriate lifetime of agents within the sys tem. ACKNOWLEDGEMENTS

This work was funded by the U. S. Department of Energy and Pacific Northwest National Laboratory. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the any of the sponsors of this work. REFERENCES [ I) [Online). Available: GENI Testbed,http://www.geni.net/ [2) [Online). Available: The DETER Network Security Testbed, http://www.isi.eduldeter/ [3) B, White, J, Lepreau,L. Stoller,R. Ricci,S. Guruprasad,M. Newbold, M, Hibler,C. Barb,and A, Joglekar,"An integrated experimental environ ment for distributed systems and networks;' in Proceedings of the Fifth Symposium on Operating Systems Design and Implementation, December 2002,pp, 255-270, [4) B, C. Williams,"A comparison of static to biologically modeled intrusion detection systems;' Master's thesis,Wake Forest University,2010, [5) H, V. D, Parunak, "Go to the ant: Engineering principles from natural multi-agent system;' Annals of Operations Research, vol. 75, pp, 69-101, 1997, [Online). Available: http://www.jacobstechnology.comlvrc/pdf/gotoant,pdf [6) J, N. Haack, G. A, Fink, W. M. Maiden, D, McKinnon, and E. W. Fulp,"Ant-based cyber defense;' in Proceedings of the 8th International Conference on Information Technology New Generations, 2011. [7) M. Dorigo and L. M. Gambardella, "Ant colony system: a cooperative learning approach to the travelingsaiesman problem;' IEEE Transactions on Evolutionary Computation, vol. I,no, I,pp, 53�,1997, [8) A, Tanenbaum and D, Wetherall, Computer Networks. Prentice Hall, 2011.

Using Swarming Agents for Scalable Security in Large ... - Sakai@WFU

Using Swarming Agents for Scalable Security in Large ... - Sakai@WFU

Suggest Documents