Phase Space Detection of Virtual Machine Cyber ...

1 downloads 0 Views 1007KB Size Report
Mark Ryan, "Cloud computing security: The scientific challenge, and a survey of ... [25] L. M. Hively, J. T. McDonald, N. Munro, and E. Cornelius,. "Forewarning of ...
Phase Space Detection of Virtual Machine Cyber Events through Hypervisor-level System Call Analysis Joel Dawson School of Computing University of South Alabama Mobile, AL USA jad1324@jagmail. southalabama.edu

Jeffrey T. McDonald School of Computing University of South Alabama Mobile, AL USA [email protected]

Lee Hively College of Engineering Tennessee Tech University Cooksville, TN USA [email protected]

Mark Yampolskiy School of Computing University of South Alabama Mobile, AL USA [email protected]

Charles Hubbard Government Accountability Office Washington, D.C. USA [email protected]

Abstract— The growth of the cloud computing ecosystem has afforded many new opportunities to businesses and consumers alike; however, with this new computing context comes new risks, and much attention has been given to the security dangers inherent in the architecture of cloud-based systems. Researchers, however, have done little to address the risk of advanced persistent threat intrusions, specifically in regard to the use of rootkits, which are powerful, stealthy pieces of malware that have grown in popularity with cybercriminals and nation state actors. These programs threaten a system by acquiring root privilege and then, using a variety of stealth tactics, evading detection and removal by modern anti-malware tools. In this research, we validate that the approach of Oak Ridge National Laboratory’s Beholder project is applicable to the context of rootkit detection within a running virtual machine. We do this by collecting and analyzing system calls collected on the hypervisor level. The analysis employs a novel nonlinear, phase-space algorithm to derive time-serial cyber dynamics, and then uses these dynamics to characterize potentially anomalous system behavior through the comparison of nominal and test behavior profiles. Our results demonstrate that this technique is effective in flagging variance between the timing traces of an infected and an uninfected machine, thus indicating the presence of a running rootkit. Keywords— Cloud computing security, virtual machine, cyber anomaly detection, phase-space analysis, graph theory, malware, rootkits

I. INTRODUCTION The recent rise to prominence of large-scale cloud computing revolutionized the process of consumer- and enterprise-level computation provisioning. People and businesses can now, through the paradigms of software-, platform-, and infrastructure-as-a-service, take advantage of flexible, distributed, and easily-allocated computational power through the medium of the internet at reasonable costs of time and money.

This material is based in part upon work supported by the National Science Foundation under grant DUE-1241675 XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE

Todd R. Andel School of Computing University of South Alabama Mobile, AL USA [email protected]

However, this new model of computing is not without its risks. Per NIST’s formal definition of cloud computing [1], large-scale commercial cloud computing providers necessarily exercise wide-ranging control over the computing infrastructure and software that they offer as products. Clients of these companies have limited say over the configuration and administration of the cloud infrastructure, and must trust that the sensitive information that they submit to the cloud is securely transmitted, processed, and stored. With few exceptions, the onus for providing security for the cloud ecosystem falls on the service providers. The challenges of securing this new ecosystem have not been lost on either researchers or industry journalists. Rashid [2], distinguishes between twelve major security threats facing vendors and customers, which ran the gamut from denial-ofservice attacks to insufficient provider diligence. Data breaches and advanced persistent threat infiltration are most relevant to this research. Cloud security has also been the focus of recent academic attention. Contributors to this discussion have addressed a range of issues, from high-level architectural or conceptual analysis [3], [4] to detailed technical solutions [5]. Despite this recent attention and as recent surveys show [6]-[9], the body of work surrounding cloud security is still developing and much work remains to be done. The detection and remediation of virtual machine (VM) malware infections has been a focus of recent research. The risk of detection or damage by malware prevents detection solutions from being implemented in the guest OS’s user or kernel space; however, all other layers of the virtualization architecture, including hardware, have played host to research on anomalyand malware-detection solutions [9]. Of particular interest to this research is the topic of VM introspection, in which an admin-level VM or the hypervisor itself invokes hypervisorlevel functions to elicit information from a guest OS without

the guest’s knowledge. This technique is advantageous in that it can provide detailed information about a VM without the risk of alerting an infection to the act of introspection. The disadvantage, however, is that information that is retrieved in this manner must undergo translation from the semantics of the guest OS to the semantics of the underlying hypervisor architecture. This translation process creates a semantic gap that can obscure relevant features of introspection-based data. In this paper we investigate the following: Can malwareindicative data obtained through introspection or side-channel measurements retain enough detail through the translation process to indicate the presence of a rootkit or other malware? If it does, can we apply physics-based analysis techniques to perform the same tasks that are typically reserved for statistical or machine learning algorithms? In this research, we implement an experimental system for detecting malware infection in a running VM. This system operates by invoking hypervisor-level functions to collect system call data from the VM, then applying a nonlinear, phasespace algorithm to assess the presence of anomalous data in the collected traces. The algorithm accomplishes this by performing classic anomaly detection: establish a baseline of uninfected data and compare it with data collected during normal operation. Our research builds on previous work by the researchers at Oak Ridge National Laboratory’s Beholder project [10]-[12], who successfully demonstrated the efficacy of the algorithm for cyber event modeling and thresholding in a standard Linux host environment. We attempt to validate the accuracy of the algorithm in our research, as well as apply it to a new data collection context. In Section 2, we discuss work related to our research question, while in Section 3, we discuss the experimental methodology that we employ. Finally, in Section 4, we present the results of our research, and in Section 5, discuss our conclusions and any potential future directions for research. II. RELATED WORK A. Rootkits There are numerous detection and prevention solutions for rootkits [10], [13]-[16]. Many of these solutions have focused on detecting rootkit activity at the kernel level, or through system metrics [17], or through some focus on analysis of the hard drive or other storage medium [18]. The limitation of these techniques is the fact that they operate within the domain of influence of the rootkit, and could potentially fall victim to active (e.g. direct tampering via kernel-level operations, evasion after detection) or passive (e.g. hiding malicious files on the hard drive with a new steganographic technique) rootkit defenses. A cloud architecture can, for VM rootkit infections, sidestep this issue by locating the data collection and analysis software on a lower abstraction layer than the rootkit in question.

Providing that the infection has no way of exploiting a vulnerability that allows it to escape a virtualized environment, any detection solution located outside the VM is effectively unreachable by the rootkit and can operate with impunity. However, unreachability does not necessarily mean invisibility, and a piece of malware that detects that it is being monitored may alter its behavior to further stymie data collection activities. The judicious use of instrumentation and built-in diagnostic functions is critical. To respond to this challenge, the subject of VM introspection has become an active area of research [9]. Introspection is used to perform a number of anti-malware tasks inside a VM, including detection [9], [19] and behavior profiling [20]; current research has investigated a number of mechanisms to enable introspection, including admin VM polling, masquerading as common OS processes, and comparing the semantic views from inside and outside a VM [9]. The work of Dinaburg et al. [21], is a representative example of this line of research. They propose an analyzer called Ether that detects known malware, in spite of known obfuscation techniques, by analyzing memory writes and system call traces from outside of an infected VM. The solution succeeds in spotting a large number of obfuscators, and because it manages several markers of scanner presence, it is completely transparent to malware detection. However, the authors of the paper admit limitations in their implementation environment that hindered success. Also, Ether is not an anomaly detector per se, but an analyzer of a pre-selected executable or process; it does not provide the wide-ranging scrutiny of an anomaly detector. The study most relevant to ours is that performed by Luckett, et al. [22], which attempts rootkit detection by applying neural network analysis to data gathered from an infected, running VM. The researchers gathered system call data from the hypervisor level by invoking the strace function on a running VM both before and after an infection with the KBeast rootkit, then trained a feed-forward and recurrent neural network on call sequence data with the Levenberg-Marquardt algorithm. The resulting confusion matrix for each network reveals poor performance for the feed-forward network (67.7%), but shows that the recurrent network is quite accurate (95.9%). While our research shares a number of marked similarities with the work of Luckett, et al., the Luckett study features a number of significant differences with ours as well. Firstly, the work of Luckett, et al. relies on the system call timing data of a full system call trace, while ours focuses on system call timing for isolated calls; this choice was made for the flexibility of polling specific calls, as well as for the research goal of investigating single-call efficacy. Secondly, our work applies a unique, nonlinear, phase-space algorithm to the problem of rootkit detection through the analysis of VM side-channel data. B. Dynamical, Phase-Space Analysis Dynamical systems are mathematical systems that utilize a function to describe the disposition of a point in space

according to its corresponding time value. Because many

Fig. 1 Phase-space analysis algorithm [11]

natural or physical phenomena can be described by time-serial data, dynamical systems are a ubiquitous presence in many scientific disciplines; weather behavior, physiological phenomena, and population dynamics, among many others, can be represented by a time-dependent mathematical function. Recent research has demonstrated that computer systems exhibit behavior typical of a dynamical, deterministic nonlinear system [23] [24]. Researchers used a nonlinear, time-delay embedding technique to model the cyber dynamics in a simple running loop program, and discovered that a low-dimensional attractor was present in the reconstructed dynamical model, as well as distinctly chaotic behavior. The algorithm used in this research has been shown effective in the past in applying nonlinear dynamics to prediction problems in biomedical science [25] and mechanical and structural failures [26], [27]. Recent efforts [10]-[12] have expanded this focus to include the detection of cyber events through nonlinear, phase-space analysis of residual sidechannel data from a running host computer. We propose that this algorithm could be used to detect the presence of malicious software through the collection and analysis of data from a nonintrusive side channel. A condensed description of the nonlinear algorithm follows; for a more complete treatment, refer to [25]. Process indicative time serial data is collected and divided into equal-size cutsets, which hold information about the behavior of the system within a specific time segment. The algorithm then applies, if appropriate, a quadratic fit filter regime to remove noise and other trace artifacts from the data sample. However, because of the reliability of the data source and collection setup, we chose not to apply the filter for this research. The algorithm then applies a symbolization process to the data: a predetermined number (S) of symbols is evenly substituted for the data points (gi) on a range between the minimum (gn) and maximum (gx) values of the first cutset.

In the next step, the algorithm performs time-delay embedding on the symbolized data. Taken’s Theorem [28] guarantees that, with a time-delay-embedding of a sufficiently high dimension, the features of topology (connectivity) in a dynamical system can be reconstructed from a finite data set. The embedding converts the symbolized data into a series of unique, delay-embedded nodes; the lag and dimension (L and d) parameters define, respectively, the distance and number of data points to combine into the final embed vector. The algorithm defines a second set of time-delay states via the variable M. The connected states form linked graphs, which provide topologically invariant measures via graph theorems. The algorithm uses these measures to define the degree of variance between two cutset-derived “graphs”. The dissimilarity measures that are used in this research are 1) nodes in graph A that are not in graph B; 2) nodes in graph B that are not in graph A; 3) links in graph A that are not in graph B; and 4) links in graph B that are not in graph A. A combination operation among the baseline cutsets produces 𝑉 , the mean dissimilarity measure, and σ, the average standard deviation for the baseline cutsets. The algorithm then compares each test cutset to each cutset in the baseline block, and averages the results to produce an average dissimilarity measure for the ith cutset in the body of test data; an associated normalized dissimilarity measure is then created for each test cutset with 𝑉 and σ. The dissimilarity measure value defines, in relation to a predefined threshold, the degree to which a given test cutset varies from the baseline. Exceeding the threshold dictates a tabulation response from the algorithm; if N successive threshold crossings are recorded, the algorithm indicates an event. Both the size of the threshold and the number of successive threshold crossings that indicate an anomaly are trainable parameters, and are optimized during program execution. The algorithm’s use of dissimilarity measures and thresholds lends itself well to concrete indications of detection success. For our purposes, we define success as a record of detection with a high incidence of true positives and negatives, and a low incidence of false positives and negatives. Clearly, accurate detection is of paramount importance in real world implementation; false positives and negatives, beyond contraindicating the proper response to a cybersecurity incident, waste time, money, and resources by engendering mistrust in a security countermeasure. Table 1 describes the trainable parameters of the algorithm, as well as the dangers of choosing values that lie too far to either end of the recommended range. Seven of the parameters (B, d, L, M, N, S, and w) are “phase-space” parameters, whose values determine the topology and embedding properties of the phasespace model. A computationally-intensive training period can select for the best phase-space values through repeated Monte Carlo runs. Random selection of phase-space parameters is efficient because the prediction results exhibit markedly fractal behavior, and thus cannot reliably be chosen with a genetic or statistical algorithm.

TABLE 2. BEHOLDER DETECTION RESULTS [12]

TABLE 1. SUMMARY OF TRAINABLE PARAMETERS

Parameter B, base cases

Small value short baseline

d, dimension J, features

under-fitting few features

K, successive occurrences above threshold (SOAT) L, time-delay

short forewarning

M: yi  yi+M N, points per cutset S, symbols UT, threshold w, filter width

small unfolding short correlation scarce statistics noise rejection small change fast artifact

Large value long baseline over-fitting many features long forewarning

excessive unfolding long correlation blurred change too precise large change slow artifact

Typical 7-15 2-26 1-4 1-50

10-80 10-80 20,00050,000 2-4 -5 to +5 2-70

Two of the parameters (K and UT) are “detection” parameters, whose values influence the speed and accuracy of event detection. Because the available range of these parameters is relatively small, optimization can be performed efficiently at runtime through an exhaustive numerical search. Finally, the parameter J is an optional “feature extraction” parameter that can be applied at runtime to emphasize certain data features. C. The Beholder Project The Beholder Project [10], [12] utilized the nonlinear algorithm laid out in [25]-[27] and applied it to the modeling of cyber dynamics in the domain of rootkit detection. In their experiment, Beholder researchers collected system call traces from two systems: an uninfected system and a system infected with the KBeast rootkit. The KBeast rootkit, during installation, replaces pointers to system call functions and alters the tcp4 process table. Because KBeast performs these actions, the researchers hypothesized that comparing system call traces from infected and uninfected systems would show measurable differences. The researchers implemented a system collection protocol that used a custom Linux kernel module in a client host computer to record the system calls, convert them per cutset to a nonlinear, phase-space graph, generate any dissimilarity measures for these graphs, and write the measures to a local JSON file. The collection period took place over an 18-minute period, during which time user activity was simulated with five consumer-level programs. At the end of the call collection period, the JSON file was transferred over a local LAN to a server host, which stored the results in a database and archived the file.

The researchers submitted files to analysis that featured two million data points total: one million points of uninfected, “baseline” data and one million points of infected. The research objective was to measure a variance between the uninfected and infected data traces. They used data from two system calls – open and read – and observations from five different applications, plus no application, to derive twelve distinct data sets. They then linearly skewed the data and applied four MATLAB functions, FLOOR, CEIL, ROUND, and FIX, to further process it. The results demonstrate a 100% variance rate, with varying rates of detection speed in terms of prediction distance. Table 2 summarizes their results. While very promising, the researchers acknowledged the limitations of this first phase of their work, which was intended as a proof-of-concept. Their published work tested a single piece of malware and did not implement certain features of the nonlinear algorithm, namely the artifact filter or the Monte Carlo parameter training phase. Furthermore, no uninfected test data was used alongside the infected test data, so the Beholder researchers were unable to test the algorithm’s proficiency in distinguishing between uninfected and infected test traces. Finally, critical parts of the system are implemented in user space in the infected system, which makes the software vulnerable to tampering by kernel-level malware. While this factor can be controlled for in an experimental setting, a realworld implementation would require countermeasures to malware interference. The Beholder researchers cited the testing of more rootkits and more system calls as future research goals, with the aim of validating the algorithm in real-world, real-time detection. Furthermore, the team will implement anti-tamper countermeasures for the detection software in the client. III. METHODOLOGY We evaluate the applicability of the Beholder project’s algorithm to a new computational context, namely, that of a VM being invoked by a hypervisor. By collecting system calls from hypervisor commands that have been called on a running VM, we hypothesize that system call traces of sufficient clarity and resolution can be obtained to allow the successful detection of a rootkit infection inside the scanned VM. Our ultimate goal is to assess and potentially improve the effectiveness and utility of nonlinear phase-space anomaly detection approaches and provide further field-study results that will aid in their realworld implementation.

A. Experimental Design In the interest of maintaining continuity with the Beholder results, we attempt to reproduce as many of the features of that project’s experimental design as are relevant to our desired objectives. Towards that end, we have reused the KBeast rootkit and the Ubuntu operating system as the infection environment for the visitor VMs. The data collection framework is based on the work of Hubbard [29], who proposed and implemented a two-host network for collecting, transferring, and storing system call data. The first host has VMWare’s ESXi installed as both a bare metal operating system and hypervisor; its primary function is to launch the VMs and invoke the commands that are used to collect system calls. This first host is connected via LAN to a second host, which has VMWare’s VSphere installed and is used to control and monitor the behavior and contents of the first host. A user runs a script on the hypervisor host to start the system call collection process, where he or she must specify a target VM, the length of time that the call collection should run, and the number of times that the test should run. Test length and the number of iterations determines the data corpus size. Because one hour of data collection provided us with 500,000 data points per tested call in each trace, we chose two hours per trace as our data collection period to give us one million data points per call from each trace. We made one collection of nominal/clean data and one collection of infected data. We ran the tests on a pair of cloned VMs, one of which was infected with the KBeast rootkit. The script collects system calls by invoking Linux’s strace command on ESXi’s vmsvc power.on, power.off, and get.summary commands. Power.on and power.off are run once per call collection cycle, while get.summary is run on a loop for the length of time specified by the user when the script is first run. A custom startup script, which spawns threads and initiates a large file write/copy procedure, stresses the CPU and disk resources of the system during this time, which provides a baseline of hardware activity against which rootkit activity can be seen. Once the strace returns have been collected in individual files, they are processed and sorted, reducing the body of data to a single text file with two columns of data: system calls and their associated timing values. The data is separated by infection status, then sorted into call-specific data files. We then choose system calls that, in our opinion, seem most likely (or least likely) to have been altered by a rootkit infection, then write that data into speciallyformatted binary files. These binary files feature data from a single observation file, with two million data points per observation file. Each observation file is split between a first half of nominal data, which is clean, and a second half of test data, which is meant to simulate data from everyday system operation and can be either clean or infected. Like the Beholder research, the test data segment features data from an infected system. We chose six system calls to test: open, close, read, futex, mmap2, and clock_gettime. Open and read were chosen for the possibility of comparison with the Beholder research, which

also used these calls, while the others were chosen to represent a range of possible reactivity to rootkit activity, from the likely (e.g. close), to the unlikely (e.g. clock_gettime). The parameter set was also chosen to copy Beholder. Like Beholder, we elected to not perform any artifact filtering or Monte Carlo parameter-training, and instead decided to use this parameter set: number of symbols, S = 10; time-delay lag, L = 1; inter-symbol lab, M = 1; number of nominal cutsets, B = 10; number of phase-space dimension, d = 2; and points per cutset, N = 50,000. The value of the threshold and the number of successive occurrences are selected dynamically during runtime. Our analysis process was divided into two parts. In the first part, we used an implementation of the algorithm developed for [25] to process the binary files and produce the series of dissimilarity values for each system call. We based the dissimilarity values on the following graph-invariant dissimilarity metrics: (1) nodes in graph A that are not in graph B, (2) nodes in graph B that are not in graph A, (3) links in graph A that are not in graph B, and (4) links in graph B that are not in graph A. In the second part, we utilized Beholder’s MATLAB script to apply feature extraction to the dissimilarity measure data and train the threshold and SOAT value parameters. The feature extraction step rounds each data point on the dissimilarity curve to a discrete integer value by applying one of four MATLAB functions, which are represented by the J parameter : ROUND, FLOOR, CEIL, and FIX. All possible values of the threshold (UT), SOAT (K), and feature extraction function (J) are exhaustively searched to produce the quickest meaningful variance between the nominal and test data. In keeping with the algorithm’s requirement for simultaneous event detection in multiple dissimilarity measures to produce a final event detection result, the final result produced four line graphs, one for each dissimilarity measure, with the best detection times for each. B. Scope We restrict our analysis to the rootkit KBeast in testing, and to the software previously discussed: Ubuntu, ESXi, and VMWare VSpere. We also reuse the file format and parameter set used by the Beholder project, and choose not to implement either the artifact filter or the parameter training process of the Nonlinear algorithm. Like the Beholder project, the results from this research are meant to be a proof-of-concept, and not indicative of true real-world operation. IV. RESULTS A. Detection Results As mentioned previously, because of our requirement for one million data points for each chosen system call, we ran the data collection script for two hours. We collected two samples, one from an uninfected VM, and one from a clone of that VM that had been infected with KBeast, then sorted the traces by system call. We analyzed the data from six system calls: open, read, close, futex, mmap2, and clock_gettime. Of these six, open and close are directly hooked by KBeast; as such, we

Fig. 2 Comparison of futex call traces

anticipate that event data from these calls would strongly indicate the presence of a KBeast infection. If the data for any of the other system calls proves to be indicative of an infection, we hypothesize that this is due to second-order causal interactions between them and the hooked system calls during normal program execution. Figure 2 shows the traces collected for the futex system call. We utilized the Forewarning software [25] to generate dissimilarity measures for each system call, then used Beholder’s MATLAB code to train the threshold (UT), SOAT (K), and feature extraction (J) parameters. The training process

Fig. 3 Detection results for open system call

Fig. 4 Detection results for close system call

achieved optimization through exhaustive search, favoring the values for each parameter that produced the fastest detection speed. Optimization is performed for each dissimilarity measure independently; as such, the final detection parameters for each dissimilarity measure often vary from the others.

Fig. 5 Detection results for futex system call

TABLE 3. COMPARISON OF RESULTS FOR OPEN AND READ [12]

TABLE 4. COMPARISON OF DETECTION TIME, THRESHOLD, AND SOAT

4,3,4,4 4,3,4,4 4,2,3,2 5,9,3,9 2,5,2,3 3,3,4,1

Figures 3, 4, and 5 show the results for the open, close, and futex system calls, respectively. The four line graphs show the results for each dissimilarity measure in the following order, if cutset graphs A and B are being compared: nodes unique to A, nodes unique to B, links unique to A, links unique to B. The two lines represent the data both before and after feature extraction: the blue, before, and the red, after. The algorithm uses the red curve in the training step for the threshold and SOAT values, as well as for event detection. Events, if they are present, are indicated with a black star. As previously noted, global event detection by the algorithm requires simultaneous “detection states” for all dissimilarity measures. B. Analysis These results are broadly comparable to Beholder’s, which were acquired by testing two system calls – open and read – in six different simulated consumer contexts. Table 3 compares the Beholder project’s results to ours. Each value represents, in order, the detection time value for each tested dissimilarity measure, in this case nodes unique to A, nodes unique to B, links unique to A, links unique to B. Our results show that we have successfully replicated the Beholder results with a different data collection context, namely, system calls collected at the hypervisor level from a running VM. We further conclude that the same insight that the Beholder researchers were able to apply to their data source is applicable to ours, namely, that hypervisor-level system call collection can be used with a nonlinear, phase-space algorithm to distinguish between infected and uninfected system states. In comparing our results to Beholder’s, we find a marked increase in detection speed in our results for the open system call. Because a substantial portion of our collection occurs during bootup, we hypothesize that the speed increase results from heavy use of the open call during a time of heavy system load. This view is further supported by our results for the read call, which are far more in line with Beholder’s. We also extended the Beholder results by testing timing data for six other system calls: close, futex, mmap2, and

SOAT

Infected (I)

5,5,2,2 3,1,6,1 4,1,5,5 2,2,2,2 3,2,3,2 2,1,1,1

Threshold

open read close futex mmap2 clock_gettime

DT (#cutsets) 1,1,1,1 3,2,3,2 1,1,1,1 1,1,1,1 2,1,2,1 1,1,1,1

1,22,1,23 1,3,1,3 1,3,1,3 1,4,1,4 1,4,1,4 1,3,1,3

1,1,1,1 3,1,3,1 1,1,1,1 1,1,1,1 2,1,2,1 1,1,1,1

Uninfected (U)

Bootup/Diagnostic/ Shutdown No app App 1 App 2 App 3 App 4 App 5

Detection time (# cutsets) for each dissimilarity measure Our results Beholder’s results [12] Open Read Open Read 1,1,1,1 3,2,3,2

open read close futex mmap2 clock_gettime

1,1,1,1 1,2,1,2 1,1,1,1 1,2,1,2 2,1,2,1 1,1,1,1

1,29,1,28 1,4,1,4 1,4,1,4 1,5,1,5 1,4,1,4 1,4,1,4

1,1,1,1 1,1,1,1 1,1,1,1 1,1,1,1 2,1,2,1 1,1,1,1

Difference (U -I)

VALUES

open read close futex mmap2 clock_gettime

0,0,0,0 -2,0,-2,0 0,0,0,0 0,1,0,1 0,0,0,0 0,0,0,0

0,7,0,5 0,1,0,1 0,1,0,1 0,1,0,1 0,0,0,0 0,1,0,1

0,0,0,0 -2,0,-2,0 0,0,0,0 0,0,0,0 0,0,0,0 0,0,0,0

clock_gettime. The calls close and futex were chosen for their close causal association with open and read during normal system operation; we thus anticipated a measure of correlation between their results and those of the directly hooked system calls. Mmap2 and clock_gettime, on the other hand, are loosely associated with open and read, and so we anticipated little or no correlation between result sets. Table 4 summarizes our results for all six system calls. Of the six system calls tested, open, close, futex, and clock_gettime show the fastest detection times, which implies that data collected during bootup provides sufficiently KBeastindicative data. The read call, on the other hand requires more time to detect variance. This suggests that read, unlike most of the other calls, requires data from the latter half of the bootup process before detection can be effectively performed. The two system calls that were directly hooked by KBeast – open and read – show the greatest threshold variance between the infected and uninfected samples. This is to be expected, as the execution of extra code during the invocation of a hooked call would, barring an extenuating circumstance like a suspended or trapped call, cause the call to take longer to execute. Of the non-hooked calls, futex shows the most variance. While a definitive explanation for this result would require a finer-grained analysis of the system call trace, we speculate that this is because of the additive effect of the longer open call on the duration of the futex call. This is consistent with the result of close, which is causally associated with open and

read, but wouldn’t have its execution time directly changed by the hooked code. Furthermore, the variance of the close system call is identical to the variance of the clock_gettime system call, which was chosen for its lack of causal linkage to open, which suggests that the close variance is the result of the natural variability of computer execution on a multi-core system.

[5]

These results are not without their limitations. Because we constrained ourselves to Beholder’s parameter set, we were unable to test whether other parameter sets might return better results. Furthermore, by emphasizing detection on the first local maximum or minimum when training the threshold and SOAT parameters, the MATLAB code has produced results that are harder to generalize to other computational contexts or even subsequent data collections on the test system.

[7]

V. CONCLUSION AND FUTURE WORK This research has demonstrated that, when the same analysis tools and file format as the Beholder project are applied to VM-derived system calls collected at the hypervisor level, variance between the infected and uninfected traces can be detected with the same level of accuracy and speed as the events that were flagged in an in-host architectural context. By replicating the results of the Beholder Project with this new data source, we conclude that we have further validated the strength of a nonlinear, phase-space algorithm as a modelling and analysis technique for side-channel cyber event detection. In the future we plan to further validate this data source and algorithm by testing them with more rootkits and more varied execution environments. The objective of this work would be to make the behavior of the nonlinear algorithm more generalizable, as well as to address the very real risk of model overfitting when testing with a small malware corpus. We would also like to expand our experimental design to more broadly follow previous work with the nonlinear algorithm. We plan to implement the two-step Monte Carlo/Brute Force parameter selection methodology of previous epilepsy forewarning research [25]. We believe that this course of action will give us deeper insights into the behavior of the nonlinear algorithm.

[6]

[8] [9]

[10] [11]

[12] [13]

[14]

[15]

[16] REFERENCES Peter Mell and Tim Grance, "The NIST definition of cloud computing," National Institute of Standards and Technology, Special Publication 800-145 2011. [2] Fahmida Rashid, "The dirty dozen: 12 cloud security threats," InfoWorld, March 2016. [3] Dimitrios Zissis and Dimitrios Lekkas, "Addressing cloud computing security issues," Future Generation Computer Systems, vol. 28, no. 3, pp. 583-592, 2012. [4] Sean Carlin and Kevin Curran, "Cloud Computing Security," International Journal of Ambient Computing and Intelligence, vol. 3, no. 1, pp. 14-19, January-March 2011.

[1]

[17]

[18]

[19]

Nathalie Baracaldo, Elli Androulaki, Joseph Glider, and Alessandro Sorniotti, "Reconciling End-to-End Confidentiality and Data Reduction in Cloud Storage," , Scottsdale, 2014. S. Subashini and V. Kavitha, "A survey on security issues in service delivery models of cloud computing," Journal of Network and Computer Applications, vol. 34, no. 1, pp. 1-11, January 2011. Minqi Zhou, Rong Zhang, Wei Xie, Weining Qian, and Aoying Zhou, "Security and Privacy in Cloud Computing: A Survey," , Washington, DC, 2010. Mark Ryan, "Cloud computing security: The scientific challenge, and a survey of solutions," The Journal of Systems and Software, vol. 86, no. 9, pp. 2263-2268, September 2013. Michele Sgandurra and Emil Lupu, "Evolution of Attacks, Threat Models and Solutions for Virtualized Systems," ACM Computing Surveys (CSUR), vol. 48, no. 3, p. 46, February 2016. Jarilyn M. Hernández, Aaron Ferber, and Stacy, Hively, Lee Prowell, "Phase-Space Detection of Cyber Events," , New York, NY, 2015. L. M. Hively and J. T. McDonald, "Theorem-based, datadriven, cyber event detection," in Proceedings of the Eighth Annual Cyber Security and Information Intelligence Research Workshop, January 2013, p. 58. J.M.H. Jiménez et al., "Beholder: Phase-Space Detection of Cyber Events," 2013. Canturk Isci and Margaret Martonosi, "Runtime Power Monitoring in High-End Processors: Methodology and Empirical Data," Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture (MICRO 36), 2003. Lei Liu, Guanhua Yan, Xinwen Zhang, and Songqing Chen, "VirusMeter: Preventing Your Cellphone from Spies," in Proceedings of the 12th International Symposium on Recent Advances in Intrusion Detection (RAID '09), Engin Kirda, Somesh Jha, and Davide Balzarotti, Eds. Berlin: SpringerVerlag, 2009. William Eberle and Lawrence Holder, "Insider Threat Detection Using Graph-Based Approaches," in Proceedings of the 2009 Cybersecurity Applications & Technology Conference for Homeland Security (CATCH '09). IEEE Computer Society: Washington, DC, 2009. J. L. Hernandez, L. Pouchard, and J. T. McDonald, "Developing a Power Measurement Framework for Cyber Defense," in Proceedings of the 8th Cyber Security and Information Intelligence Workshop. Oak Ridge National Laboratory: ACM Publishing, January 2013. John Demme et al., "On the feasibility of online malware detection with performance counters," in Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA '13). New York, NY: ACM, 2013. N. Borges, G. A. Coppersmith, G. G. L. Meyer, and C. E. Prieve, "Anomaly detection for random graphs using distributions of vertex invariants," in 45th Annual Conference on Information Sciences and Systems. Baltimore, MD, 2011. M. R. Watson, N. u. h. Shirazi, A. K. Marnerides, A. Mauthe, and D. Hutchison, "Malware Detection in Cloud Computing Infrastructures," IEEE Transactions on Dependable and Secure Computing, vol. 13, no. 2, pp. 192-205, March 2016.

[20]

[21]

[22]

[23] [24]

Shun-Wen Hsiao, Yeali S. Sun, and Meng Chang Chen, "Virtual Machine Introspection Based Malware Behavior Profiling and Family Grouping," arXiv preprint, vol. arXiv:, no. 1705.01697, 2017. Artem Dinaburg, Paul Royal, Monirul Sharif, and Wenke Lee, "Ether: malware analysis via hardware virtualization extensions," in Proceedings of the 15th ACM conference on Computer and communications security (CCS '08). New York, NY: ACM, 2008. P Luckett, J. T. McDonald, and J. Dawson, "Neural Network Analysis of System Call Timing for Rootkit Detection," in Cybersecurity Symposium 2016 (CYBERSEC-2016), Coeur d'Alene, ID, 2016. T., Diwan, A., & Bradley, E. Mytkowicz, "Computer systems are dynamical systems," Chaos: An Interdisciplinary Journal of Nonlinear Science, vol. 19, no. 3, p. 033124, 2009. Z. Alexander, T. Mytkowicz, A. Diwan, and E Bradley, "Measurement and dynamical analysis of computer

[25] [26] [27]

[28]

[29]

performance data," Proceedings of the 9th international conference on Advances in Intelligent Data Analysis, 2010. L. M. Hively, J. T. McDonald, N. Munro, and E. Cornelius, "Forewarning of epileptic events from scalp EEG," Biomedical Sciences and Engineering Conference, pp. 1-4, 2013. L.M. Hively, "Prognostication of helicopter failure," ORNL, ORNL/TM-2009 2009. V. Protopopescu and L.M. Hively, "Phase-space dissimilarity measures of nonlinear dynamics: Industrial and biomedical applications," Recent Res. Devel. Physics, vol. 6, no. 2, pp. 649-688, 2005. F. Takens, "Detecting strange attractors in turbulence," in Dynamical Systems and Turbulence: Lecture Notes in Mathematics, D. A. Rand and L. S. Young, Eds.: SpringerVerlag, 1981. Charles Hubbard, "Data Collection for Cyber Anomaly Event Detection," , Master's Thesis, 2015.