statistical Visualization Methods in Intrusion Detection 1 Introduction

Statistical Visualization Methods in Intrusion Detection J. L. Solka

D. J. Marchette

B. C. Wallet

Senior Scientist

Lead Scientist

Director of Pattern Recognition Research

NSWCDD Code B10 Dahlgren, VA

22448-5100

NSWCDD Code B10 Dahlgren, VA

22448-5100

Abstract This paper describes some of our recent eorts in the application of modern statistical visualization methodologies to the problem of the detection of intruders on computer systems. We illustrate the application of color histograms, parallel coordinates, and clustering methods to various problems within the intrusion detection arena. We will also discuss use of these visualization frameworks as an aid to the human analyst in their interpretation of network-based intrusion detection information. This work has been performed in support of the Secondary Heuristic Analysis for Defensive On-line Warfare (SHADOW) intrusion detection system. This is an operational intrusion detection system that has been deployed at numerous facilities world-wide. Some rudimentary background material on the SHADOW system will also be provided.

1

Introduction

The Secondary Heuristic Analysis for Defensive On-line Warfare (SHADOW) system is a network based intrusion detection system created at the Dahlgren Division of the Naval Surface Warfare Center. SHADOW analyzes the headers of the packets that are being sent to a monitored site. Those packet headers that meet certain pre-de ned Boolean rules are dumped to a web-based le for examination by a human operator. The SHADOW sensor sits outside of the rewall so that it might see all suspicious traÆc that is being sent to the site. The sensor's information is periodically sent through the rewall via a secure shell encrypted channel. SHADOW is a freeware package that has been deployed at numerous sites throughout the world. The system focuses exclusively on the headers of the packets rather than the content of the packets themselves. This approach reduces the computational complexity of the system and also side steps many associated privacy issues. This does not allow for one to search for

Chroma, Inc. Burlingame, CA

94010-2020

certain content based keywords that characterize particular attacks. The current SHADOW system relies on the ability of the human operator to parse the produced html-based output les in order to identify anomalous activities. Our goal has been the development new visualization frameworks that support the human in the interpretation of this plethora of data. We have sought to develop these frameworks using tools that are well known to the modern day statistics community. There were several initial capabilities that we wished to provide to the user. First we wanted to be able to evaluate network activity both at the network and system levels. Second we wished to be able to infer machine utilization based on network traÆc patterns. Third we wished to be able to readily detect abnormal network activity. We nally wanted to be able to provide some rudimentary capability to examine communication patterns between machines and also study the temporal nature of such standard system activities as web and mail accesses. The statistical tools that we have used to accomplish these goals include parallel coordinate plots [Weg90], color histograms/data images [Weg90] [MW98],and various clustering methodologies [Eve93]. In addition we have utilized standard circle plots to display system to system inter communications. We have found this toolkit allows us to nicely elucidate structure among the collected network information. During the remainder of this article we will present preliminary results that were obtained via the application of these techniques to various network monitoring data sets that we have collected. We will assume that the reader is familiar with the basics of the various network protocols. The reader is referred to [Ste94] for a thorough discussion of the basics of network communications.

2

Results

We rst discuss the use of the data image as a means to study network and system activity. We are initially interested in determining which machines on our network are active and which machines are inactive. A particular machine's identi cation is provided by its internet protocol or IP address. IP addresses are 32 bit numbers which are generally written as 4 octets (in decimal). Thus a typical IP address might be 10.10.45.7. An address ending in 255 eg. 10.10.45.255 is broadcast to all machines with IP addresses of the form 10.10.45.x. A class B address space consists of the 65K machines corresponding to last two octets of the IP address:10.10.x.x. Each x is a value from 0 to 255 and is called an "octet". We can display activity to all the machines in the space by plotting a 256x256 pixel image, with the columns corresponding to the third octet and the rows the fourth.

is plotted with a black dot if it has been involved in syn/ack activity. So these are the machines that have responded with an acknowledgement to a request for a connection. It is diÆcult to discern much structure in this Figure. However certain columns show no activity at all and therefore correspond to unused cables.

Figure 2: Pixel image of the the machines in Figure 1 with 10 or more connections during the month.

Figure 1: Pixel image of the activity of around 38,000 machines during a one month time frame. Figure 1 displays the pixel image of our local group of machines. Each of the octets have been subjected to a transformation in order to maintain the anonymity of our network topology. So instead of plotting an ordered pair (octet 3, octet 4) at each point we are plotting (f(octet 3), g(octet 4)) at the location. The functions f and g are simple transformations chosen so that the reader can't reconstruct the underlying network architectures. The picture is based on data collected from around 7/16/99 to 8/15/99. There are roughly 38,000 machines represented in the picture. A machine's octet combination

Figure 2 is the same plot as in Figure 1 with the exception that we are only plotting those IPs with 10 or more syn/acks during the time period. Much of the clutter has been suppressed. The vertical line proceeding up the plot corresponds to one of our cables that has numerous machines on it. This picture might be improved through the use of small glyph-based plotting symbols where the glyph would be chosen based on say the log base 2 of the number of counts. This would allow one to identify the respective activity level of the various machines in the network. This capability would be particularly bene cial if one could "drill down" to reveal additional information about that machine merely by mouse clicking on its associated glyph. Figure 3 uses the same data set as the previous two Figures. In this case a black dot is placed at the particular port location if the machine has been involved during the time period in communication on those ports. The ports begin in the lower left hand side of the picture and travel left to right and bottom to top. In this manner the usual 65,536 ports have been converted to a 256 by 256 data image. The lower most band of activity corresponds to the numerous system services along with ftp traÆc. As one proceeds towards the top of the image we next

Figure 3: Pixel image of the the port activities of the data set of Figure 1. notice a relatively sparse band. This is next followed by a band of busy activity and then nally a sparse band. The high band of activity has yet to be identi ed but is probably the result of numerous applications using these as temporary ports for data transfer. Figure 4 is the same plot as the previous page with the caveat that only those ports that have 10 or more packets sent to them are plotted. As discussed previously there are numerous ways in which this graph can be improved. The sparseness of the matrix suggests using a dot chart to portray this information. This would also allow the viewer to assess the amount of traÆc at any given port. We are particularly interested in the dot that occurs in the top left most portion of the graph. Once again some sort of "drill down" capability could be very useful in this case. We now shift or focus to a dierent problem. In this case we are interested in being able to ascertain machine functionality at a site without a priori knowledge of the intended purpose of the various machines at the site. This sort of situation can arise when one is asked to document the inner workings of an existing network. Even in the case where one must ll out accreditation paperwork, one can not be sure that the machine in question is performing the function that it had previously been accredited for. Often a user will change the \mission" for a particular machine during the lifetime of the machine or else they will choose to use IP address x for a machine that does not match the accreditation paperwork associated with x.

Figure 4: Pixel image of the the port activities of the data set of Figure 1 with 10 or more packets during the time period. One of the simplest ways to characterize a machines activity is to monitor how many packets are sent to each of the ports on the machine. Our initial approach in this arena has simply treated the packets as non-temporally correlated entities and hence has not embraced any sort of session concept. Given the counts for each of the ports on a particular machine one can then convert these to a probability of accessing that particular port. These activity vectors can then be clustered as if they were iid observations in some high-dimensional space. Figure 5 is a dendrogram of a few machines activity vectors obtained from site H. Site H is a site that asked us to ascertain their machine activities but has chosen to remain anonymous. We have blurred the IP addresses of the machines for security reasons. This tree/dendrogram illustrates the typical structure obtained via hierarchical clustering in that one may obtain numerous levels of clusterings based on where one cuts the tree. We have taken the liberty of labeling the leaves of the tree with the major service (largest port activity value) oered at that leaf. We have also indicated the name of the service that is typically associated with that port. This type of picture allows one to rapidly ascertain the various services that are being run on one's network. It would not be diÆcult to cross match this clustering against a database that contained each machines reported functionality as indicated during the accreditation process. This dendrogram also allows one to rapidly identify those machines that do not fall into any of the cluster-

Figure 5: Dendrogram of activity vectors from site H. s. A prudent investigation into the uniqueness of these machines could be very important. Figure 6 is a data image of several thousand site H machines. The set of observations was clustered via a standard hierarchical clustering procedure and the observations were laid out based on the ordering obtained from this method. The vertical bands in the image represent the clusters. The activity vectors were estimated based on syn/ack activities at the site. We have taken the liberty of labeling each cluster with the port with largest activity among the observations in that cluster. We apologize for the small fonts on the x-axis. Data imaging is not the optimal means for cluster assessment due in part to the sparcity of the data matrix. One can actually obtain more success with this technique by applying it to the inter point distance matrix. In this case the various clusters reveal themselves as rectangular regions in the image. Figure 7 presents a data image for site N. It is interesting to more closely examine a few of these clusters. The left-most cluster consists of 16 machines. A postclustering examination of these machines reveal that they are serving in a role of ftp servers. This is indicated by a high probability value in the column associated with port 21 and is manifested in the data image by a short horizontal line segment. The next cluster as we proceed toward the right of the gure consists of 80 machines. An examination of this cluster indicates that these machines are functioning as mail-severs as indicated by a large probability of accessing port 25 which oers email service. The next cluster consists of 16 machines.

Figure 6: Data image of several thousand site H machines. It turns out that these machines have a high probability of access to port 80 which is associated with http, World Wide Web (WWW), traÆc. The next cluster, the fourth consists of 16 machines. Most of the machines in this cluster also have high probabilities of accessing port 80. The fth cluster consists of 16 machines that have a high probability of accessing port 443. This port is usually associated with https or secure WWW traÆc. The sixth cluster consists of 240 machines. These machines have a high probability of accessing port 113. This port is usually associated with authentication service traÆc. Jumping ahead to the right-most, seventeenth cluster, in the data image we see a group of 16 machines. These machines have a high probability of accessing port 515. This port is associated with printer traÆc and an examination of the machines in this cluster indicates that they are functioning in this role. In Figure 8 we examine the internal structure of cluster 33 from the site H plot. We utilize the standard dot chart to display the log base 2 of the counts on the active ports in this cluster. Each IP address in the cluster is represented by a dierent colored symbol. We blocked out the associated IP address for the obvious security reasons. This type of chart is handy to examine the internal structure of a particular cluster. In Figure 9 we use the color histogram approach to take another look at this same cluster. The color white indicates the highest value in this plot while the color red is lowest. The x-axis indicates machine name while the y-axis indicates port. This really is just an alternative

Figure 7: Data image of site N machines with associated annotations.

Figure 8: Dotchart for cluster number 33 of the site H data.

portrayal of the same information as in the dot chart of Figure 8. Figure 10 presents a parallel coordinates plot of 1 hours worth of syn ack activity. Each IP address was converted into an integer via a:b:c:d ! a (256)3 + b (256)2 + c (256)1 + d (256)0. Each source IP connection was represented using one of 10 colors. Each axes has been scaled to lie between 0 and 1 using a standard linear transformation. As is common with the standard parallel coordinates plot, the plot suers from a fair amount of over plotting Figure 11 is a parallel coordinates plot of a scan that is directed against numerous machines at site N. The observations have been colored according to source IP address where the machine doing the scanning is colored in red and the other machines are in black. Yes there really is a red line buried among the black between the source IP and source port axes. This type of scan is called a telnet scan in that all of the scan packets are directed at port 23 on the various machines. This is a typical means of ascertaining whether that particular service is available on the machines in question. One can actually scan all the machines on a certain sub net quite easily. Figure 12 portrays a slightly dierent type of scan. We have colored based on the two machines that were scanned using a readily available hacker package nmap. So one of the machines that was scanned is colored black and the other machine is colored red. It is interesting to note that nmap does not scan each and every port

on the target machines but rather a subset based on those ports that contain interesting services or else oer up interesting reconnaissance information. So one could think of this Figure as portraying a parallel coordinates signature of an nmap scan. Figure 13 presents a parallel coordinates plot of telnets to site N's machines within an hour. The source IPs are mostly o site and the destination IPs are mostly on the site. We note that the last axis has been scaled between port 23 and the maximum source port. This of course is done to prevent division by 0 in our scaling routine. A close examination of the plot seems to suggest one o site company that is supporting a bunch of machines telneting in. This is indicated by the group of lines on the source port plot that originate from the same source IP. Figure 14 is a parallel coordinates plot represents incoming syns that are destined to port 25, i.e. mail. The source IPs are typically o site and the destination IPs are typically on site. One can nearly count the number of mail servers at the site based on a careful examination of the plot. Figure 15 represents a picture of who is talking to whom. This picture was created with a freely available on-line package named traÆc-vis. There is unfortunately no information as to who initiated the connection. The software allows one to impose additional rules that prune down the number of connections in the picture and actually make this picture useful. One can actually go hunting for speci c types of connection relationships

Figure 9: Dataimage for cluster number 33 of the site H data.

Figure 10: Parallel coordinates plot of one hour of activity colored by source IP address.

using this approach. Figure 16 pretty well illustrates what happens when one attempts to use the previous traÆc visualization method for a realistic amount of data. In this case we used it with 1 hour's worth of data. One could impose certain rule sets that would render this picture more useful. On a more fundamental level one might wish to modify the plot in order to handle the extreme amount of ink that is currently resident within. Some sort of binning/smoothing method might allow one to capture the salient information while not suering from the extreme over plotting problems. Figure 17 represents one way to portray the time series associated with traÆc on an email server. The bottom most circles represent port 25 email accesses and the top most circles represent port 113 ident accesses. The attempts to access port 80 are very interesting. These could represent either attempts to access the wrong service on the wrong machine or actual exploits attempting to compromise the security of the mail server. Our last set of Figures portray our planned future work in the visualization area. Our previously included plots have focused for the most part on static snapshots of system activity. Ultimately the activity on our network varies as a function of time. We believe that it would be particularly fruitful to view the information as functional curves and to apply techniques from functional data analysis to the data [RS97]. In this manner one could look for particular days or other time periods that might be indications of intrusive activity on our systems.

In Figure 18 each curve in the plot represents a different day for the same mail server. These plots were made using roughly 4 weeks of activity. We believe that the purple dashed curve in the weekday plot represents a particular spam attack against our site. It would certainly be interesting to be able to analyze these curves in order to ascertain what mechanisms are responsible for the various functional components in the curves. We are curious as to the underlying etiology of the troughs in some of the weekend curves. Figure 19 represents traÆc on one of our web servers during the same 4 week period. There appears to be a general \classic" curve that represents the majority of the weekday accesses. There is however at least one curve that clearly seems to represent some sort of outlier behavior. Examining the bottom access one sees two particular curves that possess a peak that occurs around hour 4. It would be nice to know if this occurred on a consecutive Saturday/Sunday pair.

3

Conclusions

We have examined the application of numerous statistical visualization techniques to computer security data. The statistical/visual analysis of such data is fraught with numerous diÆculties. The data itself tends to be unwieldly, and the data sets are massive and extremely high dimensional. For the most part we have focused our attention on the application of color histograms/data

Figure 11: Parallel coordinates plot of a network scan by one machine against a group of machines targeting a single port.

Figure 12: Parallel coordinates plot of a port scan scan by one machine against two machines. nmap was used to conduct the scan.

images and parallel coordinates. These techniques have shown themselves to be a fruitful starting point for the development of visualization systems to aid a human operator in the discernment of unusual activities at the network and machine level.

[Ste94] Richard Stevens.

Acknowledgements The statistical/visualization work discussed in this poster session is being performed at least in part to support future upgrades to the SHADOW and other yet to be developed intrusion detection systems. This work was sponsored by the OÆce of Naval Research Code 311, Dr. Wendy Martinez.

References [Eve93] Brian S. Everitt. Cluster Analysis. John Wiley and Sons, New York, third edition, 1993. [MW98] Michael C. Minnotte and R. Webster West. The data image: a tool for exploring high dimensional data sets. Proceedings of the ASA Section on Statistical Graphics, 1998. [RS97] J. O. Ramsey and B. W. Silverman. Functional Data Analysis. Springer, New York, 1997.

TCP/IP Illustrated, Volume

. Addison-Wesley, Reading, Massachusetts, 1994. 1:

The Protocols

[Weg90] Edward J. Wegman. Hyperdimensional data analysis using parallel coordinates. Journal of the American Statistical Association, 85:664{ 675, 1990.

Figure 13: Parallel coordinates plot of telnet accesses in 1 hour from o site.

Figure 15: Circle plot of who is talking to who.

Figure 14: Parallel coordinates plot of mail accesses in 1 hour from o site.

Figure 16: Disastrous nature of circle plot given too many observations.

Figure 17: Activity on a mail server.

Figure 19: Weekday and weekend web server traÆc.

Figure 18: Weekday and weekend mail server traÆc.