Network Profiling and Data Visualization

1 downloads 0 Views 57KB Size Report
Jun 7, 2000 - Vienna, Virginia 22182. (703) 827-2606. [sf, las]@averstar.com. 6 June 2000. Abstract. This paper describes an on-going research.
TP1-5

3:30

Proceedings of the 2000 IEEE Workshop on Information Assurance and Security United States Military Academy, West Point, NY, 6-7 June, 2000

Network Profiling and Data Visualization Stephen C. Fortier & Lee A. Shombert AverStar, Inc. 1593 Spring Hill Road, Suite 700 Vienna, Virginia 22182 (703) 827-2606 [sf, las]@averstar.com 6 June 2000 Abstract This paper describes an on-going research and development effort to identify security threats through user activity profiling and data visualization. Our premise is that we should be able to characterize user behavior in a so-called model of “good behavior.” We then apply this model to logs of user activity to identify deviations from good behavior; these deviations are candidate attacks. The approach was prototyped with network logs taken from a firewall. Traffic through the firewall was evaluated against an ad-hoc model and results were analyzed with data visualization tools. The prototype showed that suspicious (but not necessarily threatening) behavior can be rapidly identified through the model-based technique. The prototype also highlighted the need for a formal approach to building the models. This paper will describe the prototype activity, and will then propose a modelbased methodology for characterizing good behavior. We believe that this approach is especially suitable for identifying insider attacks, a notoriously difficult task. Research Methods In the prototype, we examined the Company firewall logs to build a model of expected behavior. This data was a log of all connections (but not packet content) through $10.00 © 2000 IEEE

the firewall. We selected this higher-level data because it was readily available, and the results would be applicable to any firewall setup. Two days worth of data from the Company firewall was examined, accounting for about 1.1 million records. An initial model was defined and the logs were filtered to eliminate records that conformed to the model. The initial model was quite simple: 1. All outside traffic destined for Company boundary servers was eliminated. A boundary server is a server whose function is to provide a service to outside entities. The current filter eliminates from analysis connections to Company web servers (port 80) and Company SMTP servers (port 25). 2. Traffic originating inside a Company intranet and destined for an Internet site was removed if it corresponded to a well-known sevice. The two services of interest were HTTP (port 80) and SMTP (port 25) Note that a more robust model might look at this traffic for signs of rogue programs communicating back to “home base,” but this traffic would appear after an attack had been successfully mounted. 3. All traffic between Company intranets (several Company intranets connect through the firewall) was ignored. Again, this traffic could be analyzed to

131

detect compromised systems within the Company intranet. Later analysis of the “inside traffic” led us to postulate a repeatable methodology for modeling insider threats. By their nature, boundary servers should receive lots of connections and it would be hard to identify attacks from connections alone. Analyzing for attacks against such servers requires access to the packet contents, for instance, the web page actually being requested by an outside HTTP client. Initial Results The firewall records connections that have been accepted, rejected, and denied. The firewall also logs connections that are authorized based on S/Key and VPN keys. The prototype only examined IP connections that had been accepted. These constituted well over 99.9 percent of the total connections. The analysis clearly must be extended to other kinds of firewall records (reject, deny, authorize, etc), however accepted connections well over 99.9 percent of the total connections logged. After this first level of filtering, we were left with about 400,000 records to attempt data mining. Data mining allows automatic acquisition, generation, and exploitation of knowledge from large volumes of heterogeneous information [1]. These were then displayed in various dimensions with the SGI Mineset data visualization and mining tool. Mineset provides the capability to examine large amounts of data and quickly identify patterns or relationships in the data. [2], [3].

$10.00 © 2000 IEEE

The dataset was still overwhelming for data visualization, and additional metrics were computed. A “relay event” was defined for hosts on the Company intranet. A single relay event occurs when a host makes an outbound connection after accepting an inbound connection. Relay events would be characteristic of hosts that have been compromised with Trojan horses. Relay metrics were computed for each transaction in the firewall log. If a transaction was sourced by a host, and the host was the destination of its previous transaction, then the current transaction is a relay event. This is shown in the figure below. Host A contacts B, and then some time later B contacts C. The amount of time between the two transactions is not important (for the definition of an event) but there may be no intervening transactions involving host B. During the transaction, host A is the relay host, host B is the source host, and host C is the destination host. See

Figure 1. Figure 1. Abstract model of a Trojan horse.

For the initial data analysis, we computed the following information for each network connection: •

The cumulative number of relay events for the source host



The relay delay, which is the time between the two transactions that constitute the relay event.

132





The network of the relay host. For this analysis we used the canonical breakdown into class A, B, C, or D networks. The network of the destination host, computed as for the relay network.

We then filtered the firewall logs to examine only those transactions that were relay events, for hosts with fewer than 20 relay events. We used MineSet to visualize the resulting records, looking for anomalous events [4]. One such event showed a burst of connections from 10 Company intranet hosts to the network registered to a university in eastern Europe. While not necessarily indicative of an attack, the events would have warranted further investigation. It could have represented a burst of activity by trojan horses. However, because of the age of the firewall data, we did not pursue the events, other than to ascertain that the Company hosts were uncompromised. The chief result of this effort was not to identify particular security threats or breaches, but to assess the utility of the model-based technique for doing so. Our ability to drill down through 1.1 million records to find a relatively small subset of suspicious records encouraged us to pursue a more formal approach. We do note, however, that one way to deploy this “system” would be to filter firewall logs periodically (e.g., overnight) and generate the Mineset views offline. The network administrator would review the Mineset reports and quickly identify anomalous traffic. The filtering model, of course, would evolve. Eventually, this must be modified to a real-time analysis of network traffic, with alarms that are triggered when an attack is in progress. It is certainly desirable for the

$10.00 © 2000 IEEE

model to look at data besides firewall logs, too. We believe that the initial results of our research will lead to a practical way organizations can profile their network traffic and visualize known good behavior while highlighting anomalous behavior. This methodology would be applied to stand-alone locations as well as multiple node models. But, we felt that the insider threat problem was grander in scale and less understood [5]. Therefore we propose extending the approach to address insider threat scenarioes. Insider Threat Methodology The purpose of this methodology is to identify insider threats based on activity within an information system. We believe that the proper approach is to construct a formal model of an “activity trace” within an information system and define a methodology for identifying threats based on activity. There has been discussion for developing methods that focus on abstract models as a means for understanding the diverse security needs in an information system [6]. We plan to conduct experiments to evaluate the methodology and will present an analysis of, and implementation strategies for, the methodology as applied to critical information systems, such as those deployed by the DoD and the financial and banking industry. Reliable identification of insider threats requires distinguishing normal behavior from abnormal behavior of entities in the information system. This can only be done by modeling the good behavior of the system and then hunting for deviations from good behavior. Previous techniques, such as virus scanners, depend on models of abnormal behavior. This approach fails, however, every time a new threat appears.

133

We propose to develop and test a methodology that is robust against novel threats. It is based on a formal model of the proper behavior of a system and applies diagnostic techniques to identify abnormal behavior. The abnormal behavior may or may not be a threat; the methodology assigns probabilities to its conclusions to reflect the likelihood that some behavior is, in fact, unfriendly. The formal model is built in two stages. First, the key objects in an information system are modeled, using a notation like UML. The object model naturally includes methods offered by the objects. Then the attributes of the objects, and of transactions between objects, are described in an information model. Here we propose a notation like EXPRESS [7]. Descriptions of normal behavior are couched in terms of the information model, and use the Rosetta system level description language [8]. The methodology is customizable, which allows the definition of good behavior, and therefore the threshold for categorizing behavior as threatening, to be applied with more or less rigor. Particularly sensitive information systems will apply a very narrow definition of good behavior, to rapidly detect insider threats, while more casual systems will apply a more flexible

$10.00 © 2000 IEEE

definition of good behavior. The formal nature of the methodology gives it a firm foundation that will persevere as the activity within an information system evolves. The formal nature also makes the methodology amenable to formal analysis techniques (yielding provably secure systems), although we do not intend to pursue such analysis immediately. Model System Behavior The first step is to construct a model of system behavior. This model of behavior is of the general behavior of an information system, without regard to the threat nature of the behavior. The model comprises two parts: an object model and an information model. The object model identifies the major entities in the system (users, processes, and resources like files and network sockets) and the possible interactions. This model does not distinguish between threatening and nonthreatening interactions but merely identifies the set of possible interactions. Interactions are, naturally, methods on the objects. Figure 2, below, highlights the technical approach proposed herein. Figure 2. Technical Approach

The object model captures the classes of

134

actors in the system. The purpose of the object model is to derive the information model, which defines how an instance of system behavior may be described.

Figure 2 shows system activity “filtered” through the information model to produce a trace of relevant system activities.

System behavior can be described as a timeordered set of interactions between system entities. This description is called the behavior trace. The behavior trace is structured around the key concept of an activity, during which events occur. Activities are hierarchical, allowing for treelike decompositions. Some activities correspond to object lifetimes; process activities are an example. Other activities are bound by transactions; a user session is an example, bounded by the login sequence and the logout sequence.

The information model defines how system behavior is described. The information model therefore establishes a vocabulary and semantics against which rules may be formulated. Such rules are used to define patterns of appropriate system behavior; system behavior that fails to match any pattern is considered to be a threat candidate.

Attributes will be identified for the entities in the information model. Some of these are characteristics of the object, such as file permissions, and some are characteristics of transactions, such as the time interval between file access by a process. Activities have their own attributes; the most obvious of which are the activities’ start and stop times. Any instance of system behavior can be cast as an instance of the information model. This does not mean that the system behavior will be fully captured, of course. Only

Behavior Matching Patterns

To best define patterns, a declarative rule language will be chosen. The advantage of a declarative approach is that new rules can be added to existing rule-set, either to define alternative good behavior or to restrict the current model of good behavior. To allow for control over false negatives and false positives, the rules have associated confidence levels. Thus, having matched a rule-set, an instance of system behavior may be deemed a threat candidate with some level of confidence; matches exceeding a customizable confidence threshold would trigger alerts. Figure 3 shows a pattern matching the stream of events in the execution trace.

information deemed relevant to insider threat profiling will be part of the model. $10.00 © 2000 IEEE

135

Figure 3. Application of a declarative rule language supports pattern matching.

We intend to apply the Rosetta language for rule definition. Rosetta is a system-level specification language currently being pursued for largescale commercial and DoD systems. The language can be used to state constraints on how a system should behave; its purpose is to define system-level requirements that can be propagated down into subsystems as the system is defined. Averstar has a lead role in the Rosetta language and tool development, and we believe that Rosetta work can be leveraged for threat detection rules. Experimental Evaluation Upon completion of the information model, we plan a series of experiments to both tune and evaluate the models. The experiments will run on a simple test bed consisting of at least two computer hosts networked together. The test bed will be instrumented to collect data items as defined in the information model. The purpose of the test bed is to provide a source of activity trace information. Given the information, we can cast it into the form of the model, and develop rules that distinguish non-threat activities from threat activities. Because the purpose of the test bed is validation of the model, we will not spend much effort making the instrumentation non-intrusive. However, lessons learned from building the test bed should be useful in the eventual deployment of this methodology. Concepts and lessons will be documented in a report that will propose an implementation and deployment strategy. Scenarios corresponding to threat and nonthreat activities will be played on the test bed, and the corresponding data will be collected. We will then construct a model that categorizes the “degree of threat” of $10.00 © 2000 IEEE

activities in the system. We will use data visualization and mining tools, such as the SGI MineSet product for this part of the investigation. Based on the analysis, we may modify the test bed to collect new information. Eventually we will have a set of rules that distinguishes threat activities from nonthreat activities, with useful associated confidence levels. The process of analyzing the data, modifying the test bed, and then modifying the rules mimics the process by which a deployed system would evolve. Conclusions We began with a simple prototype effort to test the idea that system activity can be described by a model of good behavior, and that deviations from the model warrant investigation as potential security breaches. The prototype was successful in that we were able to quickly identify unusual activity. We have now embarked on a plan to expand the scope of the model. By including more data than simple firewall logs, we believe the model can identify insider threat activity, a well-known but little-understood problem. References [1]

Gabrielson, B.C., “Security Using Intelligent Agents and Data Mining.” Proceedings of the National Security Space Architect MIM Technology Forum, Chantilly, VA, 29 June 1999.

[2]

Mortoni, S., et al. “MineSet 3.0 Enterprise Edition.” User’s Guide for Windows, Doc. No. 007-4005001, Silicon Graphics Inc., 1999.

[3]

Mortoni, “MineSet Reference 3558-001, 1999.

[4]

Banks,

S., Vanderberg, H. 3.0 Enterprise Edition.” Guide, Doc. No. 007Silicon Graphics Inc.,

D.L.,

et

al,

“Statistical 136

Visualization for Managing Network Intrusion and Anomaly Detection.” WWW site of Computer Secrutiy division, ITL, 2000. [5]

Baskerville, R., “Information Systems Security Design Methods: Implications for Information Systems Development.” ACM Computing Surveys, Vol. 25, No. 4, December 1993.

[6]

Allen, J., et al. “State of the Practice of Intrusion Detection Technologies,” Technical Report CMU/ SEI-99-TR-028, ESC-99-028, January 2000

[7]

Internation Standards Organization Standard 10303, Part 11, EXPRESS Information Modeling Language, 1991.

[8]

Alexander, P., Kamath, R., Barton, D.L., Fortier, S.C., “Facets and Domains in Rosetta,” Published in the Proceedings of the Forum on Design Languages 1999, September 1999, Lyon, France.

$10.00 © 2000 IEEE

137