Machine Learning for Post-Event Timeline Reconstruction - CiteSeerX

2 downloads 135 Views 313KB Size Report
a seized hard disk to verify the execution of different applications at different time .... blocks within the free list, deleted files can sometimes be reconstructed.
Machine Learning for Post-Event Timeline Reconstruction Muhammad Naeem Khan

Ian Wakeman

Department of Engineering and Design University of Sussex Brighton BN1 9QH UK Email: [email protected]

Department of Informatics University of Sussex Brighton BN1 9QH UK Email: [email protected]

Abstract— In this paper, we present a novel approach for postevent timeline reconstruction using machine learning techniques. Post-event timeline reconstruction plays a critical role in forensic investigation and serves as evidence of the digital crime. A variety of digital forensic tools have been developed during last two decades to assist computer forensic investigators for digital timeline analysis but most of them cannot handle large volumes of data in an efficient manner. The focus of this paper is to outline the effectiveness of employing machine learning methodology for computer forensic analysis by tracing previous file-system activities and preparing a timeline of the events. Our approach consists of monitoring the file-system accesses, taking file-system snapshots at discrete intervals of time by running different applications and using this data to train a recurrent neural network to recognize the execution patterns of the individual applications. The trained version of the network could subsequently be used for generating post-event timeline of a seized hard disk to verify the execution of different applications at different time intervals.

I. I NTRODUCTION We are developing a machine learning paradigm which can reconstruct the timeline of applications which were running on a computer based on a snapshot of the file-system. Our aim is to allow the generation of toolkits which can determine if and when particular applications, possibly accompanied by appropriate likelihood measures, ran upon a computer using only evidence left upon the file-system. Brian Carrier [1] has categorised Digital forensics into three major phases: Acquisition, Analysis and Presentation. The Acquisition Phase is to save the state of a digital system so that it can be later analyzed. The Analysis Phase is to examine the acquired data to identify evidence of malicious activity and this is the focus of our paper. The Presentation Phase though is based entirely on policy and law, which are different for each country and are beyond the scope of this paper. When dealing with a seized computer, the first act by a forensic analysis is to acquire an image of the hard disks within the computer. Analysts then look for evidence that can assist the case being built for court, based on the discovery of documents and images. Preparing a timeline of the filesystem activity on a computer system is crucial for digital forensic investigation. The timeline is used to answer the

ISBN: 1-9025-6013-9 © 2006 PGNet

questions like who was using the computer at a particular time, which of the applications were running on the system at a specific time and which system and data files were accessed, modified or deleted during that time. However, there is a latent vulnerability associated with timeline as the timestamps may be intelligently manipulated by craftily designed computer programs [2]. In the presentation of evidence from a seized computer to a court, the process by which the evidence was derived must be clearly shown to abide by accepted procedure and rules [3]. If machine learning techniques are to be used in the preparation of evidence, then the chains of reasoning must be comprehensible to the court. The use of pattern-matching approaches such as neural networks would seem to mitigate against this requirement, and it may be that our approach is only useful in indicating the areas of investigation that may be most useful for the forensic expert, rather than used directly as evidence. However, machine learning can obviously be a helpful aid to the expert, allowing a reliable and improved analysis based on the previous authentic results obtained from example data sets [4]. Machine learning model for post event reconstruction is able to cope with the larger complexity of data sets in shorter span of time and enhances the analysis procedure with outstanding performance. Our target operating system is Windows XP, for the obvious reasons of popularity and most often attacked [5]. However, much of our experience is directly transferable to other operating systems, such as Unix or previous versions of Windows. We should add the caveat that we are assuming that the computer has been in normal use, without any attempt to hide or disguise the activities upon the machine. As we shall see below, our approach is trivially balked by a simple clean-up after the computer has been used. A danger to the validity of our approach is the growth in publicly available machines, such as in Internet cafes. As developers become aware that their applications are used on public machines, they may become accustomed to building applications that clean up after they have run, so that sensitive data is not retrievable by subsequent users, and possibly to the extent of protecting

users’ privacy by removing evidence that the applications have run. We present a brief overview of motivation for the project and survey of the background literature in Section II. We describe our proposed design for post-event timeline reconstruction in Section III. Methodology for training the recurrent neural network for post-event timeline reconstruction and details of our experimentations using Internet Explorer as a case study are summarised in Section IV and Section V respectively. We discuss some ideas about prospect future work in Section VI and finally we conclude in Section VII. II. BACKGROUND L ITERATURE Digital evidence is required in a wide range of computer related crimes. The earliest proposed methods for digital forensics focused on the application of statistical methods to identify anomalous activity [6]. More recent anomaly detection methods employ a wide variety of classification schemes to identify anomalous activities. These schemes include rule induction, artificial neural networks, fuzzy set theory and classical machine learning algorithms. In [7] it is proposed that artificial neural networks could be used as alternative for the statistical analysis component of anomaly detection system. In contrast to statistical techniques, machine learning techniques are well suited to learning patterns with no prior knowledge. Therefore, they are recognized as bottomup learning techniques [8]. RIPPER is a well-known rule based anomaly detection tool that employs machine learning to predict system calls [9]. Generalization capabilities of machine learning methods can be employed in creating user profiles based on selection and subsequent classification of their permissions and system usage habits [10]. Keeping in view the large amount of data in an unstructured format, the commonly used rule-based and data mining methods do not work effectively without ample pre-processing and restructuring of the data. Therefore, in the recent past, the researchers have looked into employing artificial intelligence methods for digital analysis. Being motivated by the demands of this challenging field, we have endeavoured to present a machine learning based model for constructing a timeline of the activities happened on a computer system in the past. III. D ESIGN G OALS Our approach is based upon investigating the footprint that an application leaves upon the hard drive, and then developing a neural network that can reliably detect application activity based on input parameters derived from the disk image. The key parts of the footprint are: 1) Log files: When an application runs, some of its activities are recorded within log files, such as the creation of network sockets. However, full logging is rarely switched on, so most logs have only a few events from each application. 2) Registry Entries: Some applications will modify the registry as a part of their execution. This is obviously windows specific, but is nonetheless a useful source of information.

Fig. 1.

Initial design for timeline analysis

3) File System: Applications create, access, modify and delete files, with certain file name patterns and in distinct areas of the file system. 4) Free Blocks: After a file has been deleted, the blocks constituting that file are added to the free list, but without deleting the contents within the block. By examining the blocks within the free list, deleted files can sometimes be reconstructed. The problem of deriving the application footprint is complicated by the fact that much of the application footprint is rewritten the next time the application runs. It is therefore easy to find evidence of the last time that an application ran - e.g. the last access time on the application binary - but it is more difficult to find shadowed application runs. Evidence for these runs has to be derived from events within the log file, temporary files detected either directly on the file system or by searching through the free blocks of the file system. As we move further into the past, the evidence for these runs becomes less reliable, e.g. blocks within the free list are re-used and over-written, and eventually we plan to attach a measure of likelihood to the claim that an application ran at that time. A key part of the process of designing the classifier is in generating appropriate inputs. The inputs are generated based on overlapping time windows. The input events have to be pre-processed to be turned into a form suitable for input to a neural network, and a choice of a fixed set of inputs has to be made for input to the network, such as conversion of filenames to a numeric value etc.

IV. N EURAL N ETWORK T RAINING M ETHODOLOGY As described above, the expected output from the neural network is a timeline of when the application ran. To train the neural network to recognize these instances of application execution, we use the following methodology. A. Scenario Generation We design a set of scenarios in which the application is run within an increasingly complex environment. We design the scenarios in pairs, so that we can generate data for training, and data for testing. We then encode the scenarios within a visual C# script that can start and manipulate the application to specific timings using OLE, and which are repeatable. We then collect disk images from these scenario scripts by running the scripts within VMWare virtual machines. The scenario scripts log the times in which the application ran, by tracking the CPU usage of the managed application. B. Event Mining The collection of candidate events for input to the classifier is first mined for a specific time period from the disk image. Start and end times are specified, and our mining tools sift through the disk image pulling out events within the time window, and insert them into a database. C. Window Size Events are collected from a series of overlapping time windows. The choice of window size, from 1 second to 60 seconds is decided, and the degree of overlap. From observation, an overlap of 50% works well. D. Input Parameter Selection The decision about which input events are used and how they should be represented is made. We have experimented with a number of statistics and representations, such as the 5 largest file sizes created in the window, the 5 most frequently accessed files in the window, the alphanumeric sum over the file name or registry keys accessed. E. Network Training The input parameters and the expected outputs from the script logs are used to train a recurrent network within Matlab, using the disk image generated from the training part of the scenario pair. As file assessment patterns by an application exhibit a time series representation, therefore, we have trained file-system data on a two layer back-propagation Elman neural network. By virtue of a feedback path from output of the hidden layers to the input layer, the recurrent neural network correlates multiple inputs in a repeated fashion and these features helps in recognising time series relationships in the file-system manipulation by the applications. The files accessed by an application were fed to the neural network in their sequence of accessing by the application and were classified as known instances. The files which were indigenous to other applications were classified as unseen for this target application. On the basics of these two categories, the network

was trained for four different kinds of experiments (details are given in the next section). F. Network Testing We use the other disk image from the scenario pair to test the validity of the neural network. If the network is insufficiently accurate, we go back and adjust the choice and preprocessing of input parameters and repeat the cycle. As can be seen, the process of training the neural network requires a high level of expertise about the application, and its interaction with the operating system. As we discuss below, we hope eventually to move to a methodology where machine learning techniques, such as a genetic algorithm, can be used to select the input parameters and their representations. V. A C ASE S TUDY OF I NTERNET E XPLORER We took Internet Explorer (IE) as case study for our experiments. We present details of four of the experiments conducting to determine file accessing patterns by the IE by running it with a blank homepage settings to various level of depth of internet surfing. IE version 6.0 was used for this purpose. Description of the experiment scenario is given in Table 1. The recurrent neural networks were trained for the 39 instances of training sets as mentioned in Table 2 and subsequently were tested on the test set to evauate the accuracy of correctly idenifying the files accessed or not accessed by IE. From the results of these experiment it became clear that the regularity of the file manipulation patterns is important factor in our approach. The more longer instances we have the more accuracy we will obtain. A graph showing tradeoff between the accuracy and size of the data set in depicted as Fig. 2. TABLE I S CENARIO D ESCRIPTIONS Exp # 1

2 3

4

Description Flat execution of Internet Explorer (IE) i.e. running IE without opening/surfing any webpage by typing ’about:blank’ in the URL address text box. The experiment was conducting by launching IE and then instantly closing the IE window. Launched IE and opened website ’mail.yahoo.com’. Checked email and then signed out followed by terminating IE. Launched IE and opened website ’www.bbc.co.uk’. From onward various links for news and stories were opened followed by subsequent surfing of other newspaper websites ’cnn.com’, ’washingtonpost.com’, ’jang.com.pk’, ’rediff.com’. After a long session of about 40 minutes the IE was closed. Launched IE and opened University of Sussex website ’www.sussex.ac.uk’ followed by random surfing of various website. The surfing was halted two times to launch other applications like Notepad, Windows Media Player and some windows-based games. The timing of launching these applications was noted down so that neural network could be trained for two separate categories of files which are accessed by IE and other applications.

TABLE II ACCURACY OF THE DERIVED NETWORK OVER THE SCENARIOS Exp #

1 2 3 4

Duration of filesystem activity 2 sec 117 sec 2430 sec 3329 sec

Window Size (in seconds)

Step for sliding window (in seconds) 1 1 5,10,20,30 5,10,15 5,10,20,30,60 5,10,20,30 5,10,20,30,60 5,10,20,30

Total training instances 2 9 14 14

Accuracy

79.8 82.1 83.8 88.9

% % % %

and exploit regularities in the data. The successful implementation of Support Vector Machines (SVMs) for handling classification and regression problems for huge amount of data can be benefited as a future dimension to this project. For real world forensic analysis, there could be millions or billions of data records to be analysed, therefore, an efficient data mining tool is required to handle such a larger data sets. As an extension to this project, a data management tool deems necessary which consists of data structures for storing different data types in database or text files, set of programs for accessing these data structures and reformatting the data to make it acceptable for machine learning tool. VII. C ONCLUSION

Fig. 2. Graph showing relationship between size of dataset and percentage of true identification of files manipulated by IE.

VI. F UTURE D IRECTIONS The limitation of our approach is that the run-time overhead is considerably high. However, we think that we could achieve better performance by using more advanced statistical analysis techniques which support training data in unsupervised mode. The key issues in our research are automatic input parameter learning and extracting useful information from large data sets. Bayesian neural networks could possibly be one of the methods to be applied for automatic input parameter learning to make the forensic analysis more robust. By calculating probabilities of different parameters in the model and employing maximum likelihood estimator (expectation maximization), we can sift the parameter required to provide a clear picture of the file-system activity. This idea is to use Boltzmann machine learning method to model the underlying probability distribution of a data set and then to derive conditional probability distribution for pattern classification and obtaining a sequence of events. Another possibility is to use Genetic Algorithms (GAs) to determine the neural network input parameters for designing automotive forensic analysis tool. The optimization can be enhanced by finding the maximum value of randomly picked input parameters and thus choosing the ones with maximum likelihood of exhibiting stronger relationship in the forensic model. The generation and testing strategy of GAs can identify

Timeline of computer evidence can provide a broader picture of the sequences and timing of events relating to computer misuse. In this paper we have presented a flexible and end-user friendly approach for forensic analysis for post-event timeline reconstruction using recurrent neural network. A framework for this experiment was developed and a number of experiments were carried out to learn the file-system manipulation by the applications. Layers of abstraction methodology was employed for extracting, collating, processing and training file-system data. A parallel validation approach by generating various SQL scripts on the pattern of rule-based method was also adopted to cross check the accuracy of the results. A constraint of this methodology is that separate neural networks are required to be trained for different applications and a new set of clean-state instances of file-system manipulation would be required for newer versions of applications. However, a deep insight for resolving this issue could be a prospectus future work. R EFERENCES [1] Carrier, B. D. Open Source Digital Forensics Tools: The Legal Argument. http://www.digital-evidence.org/papers/opensrc legal.pdf. [2] Bishop, M. Computer Security: Art and Science. Pearson Education, Inc., 2003. [3] Thomas, D. S. and K. Forcht. Legal methods of using computer forensics techniques for computer crime analysis and investigation. Issues in Information System 5(2). 2004. [4] Chan, P. K., M. V. Mahoney, et al. A Machine Learning Approach to Anomaly Detection. Technical Report CS-2003-06. 2003. [5] Stolfo, S. J., et al. A Comparative Evaluation of Two Algorithms for Windows Registry Anomaly Detection. Department of Computer Science, Columbia University, New York NY 10027, USA. [6] Helman, P. and G. E. Liepins. Statistical Foundations of Audit Trail Analysis for the Detection of Computer Misuse. IEEE Transactions on Software Engineering 19: pp. 886-901. 1993. [7] Ryan, J., M. Lin, et al. Intrusion Detection with Neural Networks. AI Approaches to Fraud Detection and Risk Management: Papers from the 1997 AAAI Workshop (Providence, Rhode Island). Menlo Park, CA: AAAI: pp. 72-79. 1997. [8] Carbone, P. L. Data mining or knowledge discovery in databases: An overview. In Data Management Handbook. New York: Auerbach Publications. 1997. [9] Cohen, W. Fast effective rule induction. In 12th International Conference on Machine Learning (ICML 95): pp. 115-123. 1995. [10] Marin, J., D. Ragsdale, et al. A hybrid approach to the profile creation and intrusion detection. In Proceedings DARPA Information Survivability Conference and Exposition II. DISCEX’01. IEEE Comput. Soc, Los Alamitos, CA, USA 1: pp. 69-76. 2001.

Suggest Documents