BRIDGING THE SEMANTIC GAP IN VIRTUAL MACHINE ...

10 downloads 96526 Views 1MB Size Report
particularly thank a number of my best friends: Leling Wang, Weisheng Xie ... cloud providers to perform real-time monitoring of virtual machine states by ..... gap leverages the Linux crash utility (a kernel dump analysis tool), but this approach.
BRIDGING THE SEMANTIC GAP IN VIRTUAL MACHINE INTROSPECTION VIA BINARY CODE REUSE by Yangchun Fu

APPROVED BY SUPERVISORY COMMITTEE:

Zhiqiang Lin, Chair

Kevin W. Hamlen

Latifur Khan

Bhavani Thuraisingham

c 2016 Copyright Yangchun Fu All rights reserved

To my family.

BRIDGING THE SEMANTIC GAP IN VIRTUAL MACHINE INTROSPECTION VIA BINARY CODE REUSE

by

YANGCHUN FU, BS, MS

DISSERTATION Presented to the Faculty of The University of Texas at Dallas in Partial Fulfillment of the Requirements for the Degree of

DOCTOR OF PHILOSOPHY IN COMPUTER SCIENCE

THE UNIVERSITY OF TEXAS AT DALLAS May 2016

ACKNOWLEDGMENTS The completion of this dissertation is the end of a long journey to my PhD study. I would like to thank those who have supported me during this journey. First and for most, I am deeply grateful to my advisor, Professor Zhiqiang Lin, for his incredible guidance, endless support and encouragement throughout my entire PhD life. He introduced me the field of systems and software security. He taught me from scratch on how to do research from idea formulation to problem solving and paper presentation. He helped me build the necessary technical skill sets (e.g., debugging and kernel hacking). His passion in research always inspires me to be a better researcher. My committee members, Professor Kevin W. Hamlen, Professor Latifur Khan, and Professor Bhavani Thuraisingham, have been also extremely helpful for my dissertation with careful reviews and valuable comments. A special thank goes to Professor Hamlen for giving me invaluable advice on paper writing and presentation. I also would like to express my gratitude to all of my collaborators, Yufei Gu, Kenneth Miller, Alireza Saberi, and Junyuan Zeng in S3 Lab. I am deeply indebted for their tremendous help, especially when the deadline came. I cherish the time of working with them. I am also grateful to my mentor outside of UT Dallas, Dr.Junghwan Rhee of NEC Research Lab, for giving invaluable guidance during my summer intern at Princeton. My life in Dallas would not be that easy and happy without my friends. I would like to particularly thank a number of my best friends: Leling Wang, Weisheng Xie and Junyuan Zeng. The friendship with them made all the differences in the past six years. v

Finally, my family members have always been there for me with unconditional love and support. This dissertation is dedicated to my parents (Shuhua Fu and Xixian Wu), and my awesome wife, Yajie Li. March 2016

vi

PREFACE This dissertation was produced in accordance with guidelines which permit the inclusion as part of the dissertation the text of an original paper or papers submitted for publication. The dissertation must still conform to all other requirements explained in the “Guide for the Preparation of Master’s Theses and Doctoral Dissertations at The University of Texas at Dallas.” It must include a comprehensive abstract, a full introduction and literature review, and a final overall conclusion. Additional material (procedural and design data as well as descriptions of equipment) must be provided in sufficient detail to allow a clear and precise judgment to be made of the importance and originality of the research reported. It is acceptable for this dissertation to include as chapters authentic copies of papers already published, provided these meet type size, margin, and legibility requirements. In such cases, connecting texts which provide logical bridges between different manuscripts are mandatory. Where the student is not the sole author of a manuscript, the student is required to make an explicit statement in the introductory material to that manuscript describing the student’s contribution to the work and acknowledging the contribution of the other author(s). The signatures of the Supervising Committee which precede all other material in the dissertation attest to the accuracy of this statement.

vii

BRIDGING THE SEMANTIC GAP IN VIRTUAL MACHINE INTROSPECTION VIA BINARY CODE REUSE Publication No. Yangchun Fu, PhD The University of Texas at Dallas, 2016

Supervising Professor: Zhiqiang Lin

Virtual Machine Introspection (VMI) has been widely used in many security applications, such as intrusion detection, malware analysis, and memory forensics. However, it is generally believed to be a tedious, time-consuming, and error-prone process to develop a VMI tool because of the semantic gap. In this dissertation, we present a number of new approaches to bridge the semantic gap via binary code reuse. More specifically, based on different security constraints, we have developed three approaches, Vmst, Hybrid-Bridge, and HyperShell. Vmst makes a first step in bridging the semantic gap via an on-line binary code reuse and enables native inspection programs to automatically become introspection programs. Hybrid-Bridge improves the performance of Vmst by one order of magnitude through training memorization and decoupled execution. It is thus feasible for cloud providers to perform real-time monitoring of virtual machine states by using HybridBridge. Both Vmst and Hybrid-Bridge ensure the code integrity of VMI tools. By trusting kernel code of target machine, HyperShell, a hypervisor layer shell for automated guest OS management, redirects syscalls into target machine for execution to bridge viii

the semantic gap. We have developed a number of enabling techniques including system call execution context identification, redirectable data identification, kernel data redirection, training memoization, and reverse system call execution to realize these approaches. We have obtained the following preliminary results. Vmst was successfully tested with 25 commonly used utilities atop a number of different operating system (OS) kernels including both Linux and Microsoft Windows. Hybrid-Bridge significantly improves the performance of existing binary code reuse based VMI solutions with at least one order of magnitude for many of the tested benchmark tools. HyperShell has an average 2.73X slowdown for the 101 tested utilities compared to their native in-VM execution and less than 5% overhead to the guest OS kernel.

ix

TABLE OF CONTENTS ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

v

PREFACE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

vii

ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

viii

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xiv

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xv

CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.1

Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.2

Why Virtual Machine Introspection is Useful? . . . . . . . . . . . . . . . . .

3

1.3

Why Semantic Gap Problem is Challenging? . . . . . . . . . . . . . . . . . .

4

1.4

Why Binary Code Reuse? . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

1.5

Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

1.6

Dissertation Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

CHAPTER 2 BACKGROUND AND OVERVIEW . . . . . . . . . . . . . . . . . . .

8

2.1

2.2

Redirecting Instruction Level Data Access for Virtual Machine Introspection

8

2.1.1

Bridging the Semantic Gap via Online Kernel Data Redirection . . .

8

2.1.2

Bridging the Semantic Gap via Decoupled Execution and Training Memoization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

Redirecting Syscall Execution for Virtual Machine Introspection . . . . . . .

11

2.2.1

11

Bridging the Semantic Gap via Syscall Execution Redirection . . . .

CHAPTER 3 BRIDGING THE SEMANTIC GAP VIA ONLINE KERNEL DATA REDIRECTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.1

Technical Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

3.1.1

Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

3.1.2

Threat Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

3.1.3

Scope and Assumptions . . . . . . . . . . . . . . . . . . . . . . . . .

18

3.1.4

System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

x

3.2

SysCall Execution Context Identification . . . . . . . . . . . . . . . . . . . .

20

3.3

Redirectable Data Identification . . . . . . . . . . . . . . . . . . . . . . . . .

27

3.4

Kernel Data Redirection . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

32

3.4.1

Syscall Redirection Policy . . . . . . . . . . . . . . . . . . . . . . . .

32

3.4.2

Virtual to Physical Address Translation

. . . . . . . . . . . . . . . .

34

3.4.3

Directing the Access . . . . . . . . . . . . . . . . . . . . . . . . . . .

35

3.5

Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

36

3.6

Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

38

3.6.1

Effectiveness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

38

3.6.2

Performance Overhead . . . . . . . . . . . . . . . . . . . . . . . . . .

41

3.6.3

Generality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

43

3.6.4

Security Applications . . . . . . . . . . . . . . . . . . . . . . . . . . .

45

3.7

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47

3.8

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

49

CHAPTER 4 BRIDGING THE SEMANTIC GAP VIA DECOUPLED EXECUTION AND TRAINING MEMOIZATION . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.1

4.2

4.3

Technical Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

50

4.1.1

Observation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

50

4.1.2

System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

51

4.1.3

Threat Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

54

Fast-Bridge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

54

4.2.1

Variable Redirectability . . . . . . . . . . . . . . . . . . . . . . . . .

55

4.2.2

Instruction Redirectability . . . . . . . . . . . . . . . . . . . . . . . .

56

4.2.3

Data Redirection Using Dynamic Patching . . . . . . . . . . . . . . .

58

Slow-Bridge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

66

4.3.1

Detecting the System Calls of Interest . . . . . . . . . . . . . . . . .

66

4.3.2

Redirectable Variables Identification . . . . . . . . . . . . . . . . . .

67

4.3.3

Inferring Instruction Redirectability . . . . . . . . . . . . . . . . . . .

68

4.3.4

Data Redirection . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

69

xi

4.4

FallBack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

69

4.5

Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

70

4.6

Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

72

4.6.1

Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

72

4.6.2

Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . .

74

4.7

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

78

4.8

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

80

CHAPTER 5 BRIDGING THE SEMANTIC GAP VIA SYSCALL EXECUTION REDIRECTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.1

5.2

Technical Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

82

5.1.1

Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

83

5.1.2

Scope and Assumptions . . . . . . . . . . . . . . . . . . . . . . . . .

85

5.1.3

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

85

Host OS Side Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

88

5.2.1

Syscall Dispatcher

88

5.2.2

Syscall Data Exchanger

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

91

Guest VM Side Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

91

5.3.1

Helper Process Creator . . . . . . . . . . . . . . . . . . . . . . . . . .

91

5.3.2

Reverse Syscall Execution . . . . . . . . . . . . . . . . . . . . . . . .

93

5.3.3

Syscall Data Exchanger

. . . . . . . . . . . . . . . . . . . . . . . . .

94

5.4

Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

95

5.5

Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

96

5.5.1

Effectiveness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

97

5.5.2

Performance Overhead . . . . . . . . . . . . . . . . . . . . . . . . . .

99

5.5.3

Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

5.3

5.6

Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

5.7

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

CHAPTER 6 LIMITATIONS AND FUTURE WORK

. . . . . . . . . . . . . . . . 107

CHAPTER 7 RELATED WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 7.1

Binary Code Reuse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 xii

7.2

Virtual Machine Introspection . . . . . . . . . . . . . . . . . . . . . . . . . . 110

7.3

Kernel Rootkit detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

7.4

Dynamic Data Dependency Tracking . . . . . . . . . . . . . . . . . . . . . . 112

7.5

Memory Forensics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

7.6

Hybrid-Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

7.7

Training Memoization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

CHAPTER 8 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 VITA

xiii

LIST OF FIGURES 3.1

System level behavior (in terms of syscall trace) of a typical user level uname program. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

16

3.2

An architecture overview of our Vmst. . . . . . . . . . . . . . . . . . . . . . . .

19

3.3

Typical kernel control flow when serving a syscall. . . . . . . . . . . . . . . . . .

22

3.4

Shadow memory state of our working example code. . . . . . . . . . . . . . . . .

29

3.5

Normalized performance overhead when running Vmst. . . . . . . . . . . . . . .

42

4.1

Code Snippet of System Call sys getpid and the Corresponding Data Structures in Linux Kernel 2.6.32.8. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

51

An overview of Hybrid-Bridge. To use our system, assume end users use ps to perform the introspection of a memory snapshot from an untrused OS. If the meta-data is sufficient (provided in step °), there will be no FallBack and Fast-Bridge executes normally as in step ¬. Otherwise, Fast-Bridge will be suspended, and FallBack will be invoked (step ­) along with the snapshot of the guest OS and the command log (that is ps). Next in step ®, SlowBridge will be started with the guest snapshot and the inspection command (namely ps in this case) to produce the missing meta-data. After Slow-Bridge finishes (step ¯), it will send the meta-data for training memoization and inform the FallBack to resume the execution of Fast-Bridge with the new metadata (step °). Step ­ to ° will be repeated whenever the meta-data is missing in Fast-Bridge. Except Fast-Bridge, the Slow-Bridge and FallBack components are both invisible to end users. . . . . . . . . . . . . . . . . . . . .

52

4.3

Fast-Bridge Slowdown Compared to KVM. . . . . . . . . . . . . . . . . . . .

75

4.4

Fast-Bridge Speedup Compared to Vmst. . . . . . . . . . . . . . . . . . . . .

76

4.5

Execution time of inspection tools in Hybrid-Bridge with five different memory snapshots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

78

5.1

System call trace of utility hostname and one of its sys uname implementation.

84

5.2

An Overview of the HyperShell Design. . . . . . . . . . . . . . . . . . . . . .

86

4.2

xiv

LIST OF TABLES 3.1

Statistics of context switch when running ps

. . . . . . . . . . . . . . . . . . .

24

3.2

Introspected System call in our Vmst. . . . . . . . . . . . . . . . . . . . . . . .

33

3.3

Evaluation result with the commonly used Linux utilities without any modification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

39

Evaluation result with the commonly used Windows utilities without any modification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

40

3.5

Evaluation result with the Native customized Linux programs. . . . . . . . . . .

40

3.6

OS-independent testing of Vmst. . . . . . . . . . . . . . . . . . . . . . . . . . .

44

3.7

Attack Vector Description of the Tested Rootkits . . . . . . . . . . . . . . . . .

46

3.8

Rootkits experiment under Kernel-2.6.32-rc8

. . . . . . . . . . . . . . . . . . .

46

4.1

A Code Snippet of sys getpid and the Corresponding Patched Code for NonRedirectable and Redirectable Page . . . . . . . . . . . . . . . . . . . . . . . . .

60

4.2

Taint Propagation Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

67

4.3

Correctness Evaluation Result of Hybrid-Bridge and the Statistics of the number of each Instruction Types. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

73

Performance of each component of Hybrid-Bridge and its comparison with Vmst. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

74

4.5

Performance comparison of Fast-Bridge and Virtuoso . . . . . . . . . . . .

77

5.1

Syscalls in cp with different execution policy. . . . . . . . . . . . . . . . . . . . .

89

5.2

Evaluation Result of the Tested Utility Software. . . . . . . . . . . . . . . . . .

98

5.3

Micro-benchmark Test Result of GVM. . . . . . . . . . . . . . . . . . . . . . . . 101

5.4

Macro-benchmark Test Result of GVM.

3.4

4.4

xv

. . . . . . . . . . . . . . . . . . . . . . 104

CHAPTER 1 INTRODUCTION 1.1

Motivation

Within just a few years, cloud computing has been part of everyone’s daily computing life (e.g., accessing emails, surfing web pages, watching Youtube). The main enabling technique for cloud computing is the virtualization, which virtualizes the hardware resources and allocates them based on the need. It not only has significantly increased the utilization of many computing capacities, but also enables a set of new applications. One of the compelling security applications enabled by virtualization is the Virtual Machine Introspection (VMI) (Garfinkel and Rosenblum, 2003). VMI pulls the in-VM OS state to the outside virtual machine monitor (VMM), and enables external monitoring of the runtime state of a guest-OS. The introspection can be placed in a VMM, in another VM, or within any other part of the virtualization component, as long as it can inspect the runtime state of the guest-OS, including CPU registers, memory, disk, network, and other hardwarelevel events. Located one layer below the Operating System, VMI has a strong isolation from the guest-OS. Because of such strong isolation, VMI has been widely used in many security applications (Bauman et al., 2015), such as intrusion detection (e.g., (Garfinkel and Rosenblum, 2003; Payne et al., 2007, 2008)), malware analysis (e.g., (Jiang et al., 2007; Dinaburg et al., 2008)), process monitoring (e.g., (Srinivasan et al., 2011)), memory forensics (e.g., (Hay and Nance, 2008)), and even network firewalls (e.g., (Srivastava and Giffin, 2008)). However, there are a number of challenges that limit the use of this mechanism. The most significant of these is the Semantic Gap (Chen and Noble, 2001) problem. Unlike developing in-VM inspection software where programmers have rich semantics and abstractions 1

2 such as files, APIs and system calls provided by guest-OS; when developing the introspection software out-of-VM, there are no abstractions for hypervisor programmers regarding the in-VM state, and they often have to interpret the in-VM hardware-layer state including processors, physical memory, and devices at the outside VMM layer. In the past decade, there have been many proposed approaches for solving the semantic gap problem, operating under different constraints and varying from being purely manual to fully automated. An early solution (Garfinkel and Rosenblum, 2003) to bridging the semantic gap leverages the Linux crash utility (a kernel dump analysis tool), but this approach requires the kernel to be recompiled with the debugging symbols. The other approach involves locating, traversing, and interpreting known data structures of the in-guest memory. While the latter approach has been widely adopted (e.g., (Petroni et al., 2004; Jiang et al., 2007; Baiardi and Sgandurra, 2007)), many of them rely on a manual effort or compilerassisted approach (e.g., OSck (Hofmann et al., 2011)) to locate the in-guest kernel data and develop the in-guest semantic-equivalent code for the introspection. Such a process often has to be repeated for different kernels, which may suffer from frequent changes due to the new releases or patches. Furthermore, it may also introduce vulnerabilities for attackers to evade these hand-built introspection tools (Garfinkel, 2003; Dolan-Gavitt et al., 2011). One of the prominent solution for Semantic Gap is to apply for binary code reuse. Dolan-Gavitt et al. (Dolan-Gavitt et al., 2011) presented Virtuoso, which makes a large step in narrowing the semantic gap by generating introspection program from tracing. However, Virtuoso suffers from coverage problem. In this dissertation, we seek to design a new way from binary code reuse to automatically bridge the Semantic Gap and devise new applications based on VMI.

3 1.2

Why Virtual Machine Introspection is Useful?

Virtual machine introspection is an out-of-VM monitoring technique. By moving the monitoring functionality out of VM, VMI can have tremendous benefits compared to in-VM monitoring. In particular, we can have: • Trustworthiness: It is generally believed to be much harder for attackers to tamper with the software running at out-of-VM, because there is a world switch for attacks from guest to hypervisor (unless the VMM has vulnerabilities). Therefore, VMI software can gain a higher trustworthiness. • Higher Privilege and Stealthiness: Traditional security software (e.g., anti-virus, or host intrusion detection) runs inside the guest-OS, and in-VM malware can often disable the execution of these software. By moving the execution of security software out of VM, we can achieve higher privilege (same as hypervisor) and stealthiness, and make them invisible to attackers. For instance, malicious code (e.g., kernel rootkit) often disables the ps command from showing the running malicious process, and disables the rmmod command needed to remove a kernel module. Through enabling the execution of these commands out of VM, we can achieve higher privilege and stealthiness, and prevent the rootkits from tampering with the security software. • Complete View: Another advantage of VMI over in-VM monitors is that the hypervisor has full access to all the memory, register, and disk state of the VM atop which the guest OS runs. Not only can we observe each applications state, but also the kernel state, including those invisible ones hidden by attackers, which is often challenging to achieve through in-VM approaches. • Transparent Deployment: To deploy a security monitor at the hypervisor layer, we need neither an account in the guest OS, nor any need to install the software inside

4 the OS. Instead, everything can run transparently at the hypervisor layer without even disrupting services. • Less Vulnerability: In-VM systems usually have to trust the entire guest OS kernel, which tends to have a huge code base. However, VMI often only needs to trust the underlying hypervisor, which has a smaller code base than OS Kernel. This smaller attack surface leads to less vulnerability.

1.3

Why Semantic Gap Problem is Challenging?

Semantic gap problem has hindered the practicality of VMI in two ways:

• Automation. VMI needs to be automated. Interpretation of low level data typically requires detailed, up-to-date knowledge of the internal OS kernel workings. For example, to introspect the pid of a running process in a Linux kernel, hypervisor programmers often have to traverse the corresponding task struct to fetch its pid field, whereas in-VM programmers can simply use getpid system call. Acquiring such knowledge is often tedious and time-consuming even for an open source OS. For a closed source OS, they may have to reverse engineer the internal kernel workings for the introspection, which may be error-prone. • Efficiency. VMI tools should not impose too large overhead compared to in-VM programs, especially considering the fact that security applications are often time sensitive. For example, when an intrusion is detected, it often requires an immediate response. However, bridging semantic gap usually involves complex operations which may introduce large overhead. Meanwhile, we do not like an VMI tool creates additional overhead to the hypervisor.

5 1.4

Why Binary Code Reuse?

Since there are a large number of introspection tools running inside VM for state monitoring, why not to just reuse these software, and run them directly at the hypervisor layer? These tools also have less vulnerability compared to manually created VMI tools. Also, earlier efforts such as Virtuoso have demonstrated that it is possible to automatically generate introspection programs with minimum human effort. Note that the key idea of Virtuoso is to create introspection programs from the traces of the in-VM trusted programs. More specifically, given an introspection functionality (e.g., list all the running processes), Virtuoso will train and trace the system-wide execution of the in-VM programs (e.g., ps) by an expert, automatically identify the instructions necessary for accomplishing this functionality, and finally generate the introspection code that reproduces the same behavior of the in-VM programs. Inspired by Virtuoso, we would like to investigate further other approaches to bridge the semantic gap by leveraging binary code reuse.

1.5

Contributions

This dissertation addresses the Semantic Gap challenge by developing new binary code reuse techniques. Our contributions can be summarized as follows: • We present three approaches to bridge the semantic gap via binary code reuse, including Vmst,1 Hybrid-Bridge, and HyperShell.2 The key idea is through binary code reuse which reuses the code of introspection program in secure machine to inspect the state of un-trusted guest-OS in a target VM. There can be two types of code reuse: user-level and kernel-level. If we reuse both user level and kernel level code, 1

Vmst is short for VM-Space Traveler.

2

HyperShell is short for hypervisor layer shell.

6 we can achieve the strongest security and isolation. Our Vmst and Hybrid-Bridge follow this approach. However, if we can trust the kernel code of target VM and reuse only user-level code, we can enable a large number of legacy programs to become introspection programs. This is the approach of our HyperShell where we trust the guest OS kernel code. • We have built the corresponding prototypes and demonstrated with experimental results that our approaches are highly practical and feasible. In particular, Vmst automatically generate a number of VMI tools for both Linux and Windows OS. HybridBridge substantially improves the performance of the existing VMI solutions by one order of magnitude. HyperShell can be used for timely, uniformed, and centralized guest OS management, especially for private cloud.

1.6

Dissertation Overview

An outline of this dissertation is illustrated as follows. • Chapter 1 provides the overview of Virtual Machine Introspection and its challenges, our dissertation motivation, and goals. • Chapter 2 offers a brief introduction of the approaches we proposed including Vmst, Hybrid-Bridge and HyperShell. • Chapter 3 first presents our Vmst, a new technique that can seamlessly bridge the semantic gap and automatically generate the VMI tools. The key idea is that, through system-wide instruction monitoring, Vmst automatically identifies the introspection related data from a secure-VM and online redirects these data accesses to the kernel memory of a product-VM, without any training. Vmst offers a number of new features and capabilities. Particularly, it enables an in-VM inspection program to automatically become an out-of-VM introspection program.

7 • Chapter 4 presents Hybrid-Bridge, a new system that uses an efficient decoupled execution and training memoization approach to automatically bridge the semantic gap. The key idea is to combine the strengths of both offline training based approach and online kernel data redirection based approach, with a novel training data memoization and fall back mechanism at hypervisor layer that decouples the expensive Taint Analysis Engine (TAE) from the execution of hardware-based virtualization and moves the TAE to software-based virtualization. • Chapter 5 presents HyperShell, a practical hypervisor layer guest OS shell that has all of the functionality of a traditional shell, but offers better automation, uniformity, and centralized management. This will be particularly useful for cloud and data center providers to manage the running VMs in a large scale. To overcome the semantic gap challenge, we introduce a reverse system call abstraction, and we show that this abstraction can significantly relieve the painful process of developing software below an OS. More importantly, we also show that this abstraction can be implemented transparently. As such, many of the legacy guest OS management utilities can be directly reused in Hybrid-Bridge without any modification. • Chapter 6 discusses the limitations and future work. • Chapter 7 compares with the related work. • Chapter 8 concludes this dissertation.

CHAPTER 2 BACKGROUND AND OVERVIEW Motivated by the existing limitations of binary code reuse based approach, we present new approaches to bridge the semantic gap for VMI. Based on whether to trust the kernel code of target OS, we have developed three approaches of binary code reuse, namely, Vmst, Hybrid-Bridge, and HyperShell, by reusing the same code but providing different data from different machine transparently.

2.1

Redirecting Instruction Level Data Access for Virtual Machine Introspection

Our first two approaches use a dual-VM based architecture to bridge the semantic gap. In particular, we have a secure VM running trusted OS for online binary code reuse and we automatically redirect instruction level kernel data access to the target VM.

2.1.1

Bridging the Semantic Gap via Online Kernel Data Redirection

The first approach Vmst seamlessly bridges the semantic gap and automatically generates the VMI tools. It is designed with transparency to the in-VM guest-OS in mind, and it has achieved nearly full transparency against an in-VM OS kernel. For example, without any modification, Vmst directly supports a number of most recently released Linux kernels. With the slight modification, it also supports Microsoft Windows kernel such as WindowsXP. Meanwhile, when using Vmst for introspection, for a particular OS, end-users will only need to install the corresponding trusted version of the guest-OS in the VM shipped with our Vmst, and attach (or mount) the in-guest memory. The in-guest memory could be a 8

9 live memory for VMI or a memory snapshot for memory forensic analysis. Subsequently, end-users can use a variety of OS utilities (e.g., ps, lsmod, tasklist.exe) to inspect the state of the guest-OS. New features and capabilities. To advance the-state-of-the-art, Vmst offers a number of new features and capabilities, especially when compared with Virtuoso. In particular: • It enables automatic generation of secure introspection tools. Such security is achieved by the nature of VMI (i.e., the strong isolation) and the technique of our automatic tool generation. Similar to Virtuoso, our VMI-tools are literally generated from the trusted OS code as well as the widely used and tested utilities without any modification. As such, our VMI tools are more secure than many other manually created ones. • The automatically generated VMI tools are also more reliable than tools generated from Virtuoso, because Virtuoso cannot guarantee the path coverage in their training phase, yet we have retained all the code and there is no training involved. • It directly generates a large volume of VMI tools for free, but Virtuoso has to train each program one by one to get the new VMI tools. • With Vmst, it allows the user-level programmers to develop new user-level programs natively to monitor system status for the introspection, whereas in Virtuoso programmers have to use some special APIs. • It also allows the kernel-level programmers to develop native device drivers for inspecting the kernel states for the introspection. In short, Vmst significantly removes the road-block for VMI software development, and it can automatically enable an in-guest legacy inspection program to become an introspection program without any involvement from end-users.

10 2.1.2

Bridging the Semantic Gap via Decoupled Execution and Training Memoization

The huge performance overhead of the existing solutions (including our VMST) to bridge semantic gap significantly hinders their practicality, especially for critical users such as cloud providers who wish to perform real-time monitoring of VM states at large scale. Therefore, we present Hybrid-Bridge, a hybrid approach that combines the strengths of both Virtuoso (from the perspective of offline training) and Vmst (from the perspective of online taint analysis (Newsome and Song, 2005) and kernel data redirection (Fu and Lin, 2012)). At a high level, Hybrid-Bridge uses an online memoization (Michie, 1968) approach that caches the trained meta-data in an online fashion for a hardware-virtualization based VM (e.g., KVM (Kivity et al., 2007)) to execute the native inspection command such as ps,lsmod,netstat, and a decoupled execution approach that decouples the expensive taint analysis from the execution engine, with an online fall back mechanism at hypervisor layer to remedy the coverage issue when the meta-data is incomplete. With such a design, our experimental results show that Hybrid-Bridge achieves one order of magnitude faster performance than that of similar systems such as Virtuoso and Vmst. More specifically, Hybrid-Bridge decouples the expensive online dynamic taint analysis from hardware-based virtualization through online memoization of the meta-data, and we call this execution component Fast-Bridge. However, we still need a component to perform the slow taint analysis and precisely tell those redirectable instructions (which are part of the meta-data), and this is done by the second component we call Slow-Bridge. Therefore, Hybrid-Bridge is a combination of Slow-Bridge, which extracts the meta-data using the online kernel data redirection approach from a software virtualization-based VM (e.g., QEMU (Fabrice, 2005)), and Fast-Bridge, a fast hardware virtualization-based execution engine via memoization of the trained meta-data from Slow-Bridge. End users will only

11 need to execute the native inspection utilities in Fast-Bridge to perform VMI, and SlowBridge will be automatically invoked by the underlying hypervisor. Hybrid-Bridge does not have the path coverage issues as Virtuoso because it contains a fall back mechanism that works similarly to the OS page fault handler. That is, whenever there is a missing meta-data, Hybrid-Bridge will suspend the execution of Fast-Bridge and fall back to Slow-Bridge to identify the missing meta-data for the executing instructions. After Slow-Bridge identifies the missing meta-data, it will update and memoize the trained meta-data, and dynamically patch the kernel instructions in Fast-Bridge and resume its execution. Therefore, Hybrid-Bridge executes the instructions natively in Fast-Bridge most of the time. Only when the trained meta-data is incomplete, it falls back to the Slow-Bridge. These VM-level fall-back, memoization, and synchronization can be realized thanks to the powerful control from hypervisor.

2.2

Redirecting Syscall Execution for Virtual Machine Introspection

Our third approach bridges the semantic gap by redirecting syscall to the target VM for execution. This leads to HyperShell, a hypervisor shell which can be used for automatic in-VM management. 2.2.1

Bridging the Semantic Gap via Syscall Execution Redirection

With the increasing use of cloud computing and data centers today, there is a pressing need to manage a guest operating system (OS) directly from the hypervisor layer. For instance, when migrating a virtual machine (VM) from one place to another, we would like to directly configure its IP address without logging into the system (if that is possible), similarly for firewall rule update; when there is a malicious process detected, we would like to directly kill it at hypervisor layer, similarly for malicious kernel modules; when there is a need to scan viruses, we would like to uniformly scan viruses for all of the running VMs regardless of

12 who owns and manages the VM, whether the VMs might be using an unknown file system, or whether the file systems might be encrypted. However, if we use a traditional OS shell, a user interface that is automatically executed when a user successfully logs in a computer, this would first require an administrator’s password. But, hypervisor providers may not (always) have the administrator’s password for each VM, and even when they do have the passwords, it is painful to maintain them considering today large cloud providers usually run millions of VMs . Second, it would also require the installation of the management utilities inside each guest OS. Whenever there are updates for these utilities, it is painstaking to update all of them in each VM. Therefore, the presence of a hypervisor layer shell (HyperShell for brevity) for all guest OS would allow cloud providers to have an automated, uniformed, and centralized service for in-VM management. Unfortunately, such a layer below shell is challenging to implement because of the semantic gap. Specifically, the semantic gap exists since at the hypervisor layer we have access only to the zeros and ones of the hardware level state of a VM—namely its CPU registers and physical memory. But what a hypervisor layer program wants is the semantic information about the guest OS, such as the running processes, opened files, live network connections, host names, and IP addresses. Therefore, a layer below management program must reconstruct the guest OS abstractions in order to obtain meaningful information. A typical approach to do so is to traverse the kernel data structure, but such an approach often requires a significant amount of manual effort. To advance the state-of-the-art, we introduce a new abstraction called Reverse System Call (R-syscall in short) to bridge the semantic gap for hypervisor layer programs that will be executed in our HyperShell. Unlike traditional system calls that serve as the interface for application programs from a layer below, R-syscall serves as the interface in a reverse direction from a layer up (with a way similar to an upcall (Clark, 1985)). While hypervisor programmers can use our R-syscall abstraction to develop new guest OS management

13 utilities, to largely reuse the existing legacy software (e.g., ps/lsmod/netstat/ls/cp) we make the system call interface of R-syscall transparent to the legacy software, resulting in no modification when using them in HyperShell. In addition, we also make HyperShell transparent to the guest OS, and we do not modify any guest OS code. All of our design and implementation is done at the hypervisor layer.

CHAPTER 3 BRIDGING THE SEMANTIC GAP VIA ONLINE KERNEL DATA REDIRECTION

1

In this chapter, we present Vmst, our first attempt to bridge semantic gap via binary code reuse. In particular, we discuss the technical overview in Chapter 3.1, three core components in Chapter 3.2, Chapter 3.3, and Chapter 3.4, respectively; we present the implementation details in Chapter 3.5, evaluation in Chapter 3.6, discussion in Chapter 3.7, and finally summary in Chapter 3.8.

3.1

Technical Overview

We present VM-Space Traveler (Vmst for brevity) (Fu and Lin, 2012), a new system to seamlessly bridge the semantic gap and automatically generate the VMI tools. The key insight of our technique is that a program P(x) is often composed of code P and data x; for the same program, P is usually identical across different machines, and the only difference is the run-time consumed data x. Normally, for a machine A, its P always consumes the x in A. Thus, if we can make P (suppose an inspection program such as ps) in A transparently consume the data y in B (i.e., without the awareness that y comes from B), then we automatically generate an introspection program P 0 such that P 0 (x)=P(y). More specifically, to realize this idea for VMI programs, we can redirect (i.e., hijack and direct the access) the 1

c

2012 IEEE. Reprinted, with permission, from Yangchun Fu and Zhiqiang Lin. “Space Traveling across VM: Automatically Bridging the Semantic Gap in Virtual Machine Introspection via Online Kernel Data Redirection”, In Proceedings of the 2012 IEEE Symposium on Security and Privacy (S&P’12), pages 586-600. c DOI:http://dx.doi.org/10.1109/SP.2012.39. And 2013 ACM. Reprinted, with permission, from Yangchun Fu and Zhiqiang Lin. “Bridging the Semantic Gap in Virtual Machine Introspection via Online Kernel Data Redirection”, In Journal of ACM Transactions on Information and System Security (TISSEC), Volume 16 Issue 2, September 2013. DOI:http://dx.doi.org/10.1145/2505124

14

15 memory read of the relevant kernel instructions which are responsible for the introspection, as long as we can automatically identify them. In the following, we first present the problem statement in Chapter 3.1.1, discuss the threat mode in Chapter 3.1.2 and finally describe the scope and assumption in Chapter 3.1.3.

3.1.1

Problem Statement

Observations. The goal of our Vmst is to bridge the semantic gap and enable automated VMI tool generation. The basic observation is that many introspection tools are mainly used to query the guest-OS state, e.g., listing all the running processes (ps), opened files (lsof), installed drivers (lsmod), and connected sockets (netstat). In fact, these program-logics have already been shipped in OS distributions with the corresponding user level utilities. Thus, instead of building an introspection tool P 0 from scratch, we can actually reuse the user level as well as OS kernel code P to automatically implement P 0 . Consider a specific example to better understand our observation. Without introspection, normally when we run a utility program to inspect an OS state, e.g., get the running Linux kernel version, as shown in Fig. 3.1, the OS kernel will execute a series of system calls such as create a new process (execve), set the end of the data segment (brk), check (access) the library (e.g., ld.so.nohwcap), map the standard shared library (open, fstat64, mmap2), execute the uname system call (syscall for short) which is used to get name and information about current kernel, output the result (write), and exit the process (exit group). With introspection, we can see that in order to fully reuse the OS as well as user level program code P, we should redirect the data read that is only related to the desired introspection functionality. In our running kernel retriving example, it should be the data x within the uname system call. For data in user space and other irrelevant kernel space, there should be no redirection and we should keep both kernel and other user processes running correctly.

16

1 2 3 4 5 6 7 8 9 10 11 12 13 14 … 26 27 28 29 30 31 32 33 34 35 36 37 38

execve("/bin/uname", ["uname", "-r"], [/* 12 vars */]) = 0 brk(0) = 0x9d80000 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory) mmap2(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb77fd000 access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory) open("/etc/ld.so.cache", O_RDONLY) = 3 fstat64(3, {st_mode=S_IFREG|0644, st_size=15017, ...}) = 0 mmap2(NULL, 15017, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb77f9000 close(3) = 0 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT (No such file or directory) open("/lib/i686/cmov/libc.so.6", O_RDONLY) = 3 read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\0n\1\0004\0\0\0"..., 512) = 512 fstat64(3, {st_mode=S_IFREG|0755, st_size=1327556, ...}) = 0 mmap2(NULL, 1337704, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xb76b2000 open("/usr/lib/locale/locale-archive", O_RDONLY|O_LARGEFILE) = 3 fstat64(3, {st_mode=S_IFREG|0644, st_size=1527584, ...}) = 0 mmap2(NULL, 1527584, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb753c000 close(3) = 0 uname({sys="Linux", node="debian", ...}) = 0 fstat64(1, {st_mode=S_IFCHR|0600, st_rdev=makedev(4, 1), ...}) = 0 ioctl(1, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo ...}) = 0 mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb77fc000 write(1, "2.6.32.8\n", 9) = 9 close(1) = 0 munmap(0xb77fc000, 4096) = 0 close(2) = 0 exit_group(0) = ?

Figure 3.1. System level behavior (in terms of syscall trace) of a typical user level uname program. Problem Definition. To achieve the goal of enabling native trusted utilities automatically become introspection software and bridging the semantic gap for hypervisor programmers, the central problem in our system is how to automatically (1) identify the introspection execution context, (2) identify the data x in kernel code that is related to the introspection, (3) redirect x, and (4) keep all the processes running at the VMM layer. Challenges. However, this is a challenging task in reality since OS kernel is designed to manage computer hardware resources (e.g., memory, disk, I/O) and provide common services (i.e., syscalls) for application software, it has a very complicated control flow and data access pattern. In particular, the kernel usually contains many routines for resource (e.g., page tables) management, interrupt and exception handling (e.g., timer, keyboard, page fault), context switch, and syscall service. When serving a system call, an interrupt could occur, a page

17 fault (an exception) could occur, and a context switch could occur. Obviously, we do not want to redirect the kernel data access in context switches, page fault handlers, or interrupt service routines. Also, we do not want to redirect the data access in the execution context of any other processes. Data access may be code reads or data reads. One of the advantages for VMI is that attackers usually cannot modify the introspection code. Thus, we do not want to load any code from an untrusted guest and we have to differentiate kernel code and data. Also, data could be in kernel global, heap, or stack regions. Obviously, we cannot redirect the kernel stack read, otherwise it will lead to a kernel crash (because of control data such as return addresses in the stack). We have to thus identify where the redirectable data is. Moreover, when redirecting the data, we have to perform the virtual to physical address translation. Otherwise it will not be able to find the correct in-guest memory data. Further, we have to perform copy on write (COW) of the redirected data to ensure there is no side effect of the in-guest memory. 3.1.2

Threat Model

The goal of Vmst is to seamlessly bridge the semantic gap for hypervisor programmers when developing the introspection software, as if they are still developing the software inside the guest-OS and they can use all the OS abstractions. More importantly, since there are many native inspection programs (e.g., ps,lsmod,netstat) inside guest-OS, Vmst can automatically enable them to become introspection programs. Consequently, the trusted computing base (TCB) for Vmst includes (1) the binary code translation based VM (i.e., the secure-VM), (2) the corresponding kernel with the same version to the guest-OS, and (3) the native utilities used to manage the OS. This TCB is usually maintained by hypervisor, isolated with the product-VM. The attacks that Vmst defends directly are those attacks that tamper with the inguest native inspection software and OS kernel code. This includes a large volume of OS

18 kernel rootkits which often tamper with either OS kernel code (e.g., system call table) or application binary code (e.g., ps command). The attacks that Vmst cannot defend directly are those direct kernel object manipulation (DKOM) attacks, which modify kernel data structures instead of kernel code to hide the presence. However, it can mitigate them through facilitating the development of DKOM rootkit detection tool (as we demonstrate in §3.6.1) by using kernel APIs without worrying about the semantic gap.

3.1.3

Scope and Assumptions

Being OS-independent (i.e., directly supporting a variety of OS kernels) is one of our design goals. Therefore, when designing Vmst, we should use the general knowledge from the OS design principles (Bovet and Cesati, 2005; Bach, 1986), and we should avoid, if possible, hard coding any specific kernel addresses (otherwise it would be too kernel specific). However, we cannot blindly support all OSes because we still rely on some OS knowledge – particularly the definition of system call interfaces (i.e., the syscall number, arguments, and return values), which are often OS specific. Therefore, we focus on Linux/UNIX and Microsoft Windows OS, atop the widely used x86 architecture, leaving other OSes such as Minix as future efforts. Also, we assume our own VMM can intercept the system-wide instructions, because we need to dynamically instrument the instructions and redirect the data access if the executing instructions are introspection related. In addition, similar to Virtuoso (Dolan-Gavitt et al., 2011), we assume end-users must have a trusted copy of in-guest OS kernel. The trusted copy will be installed in the secureVM of Vmst as part of our TCB, and executed along with the utilities to provide the introspection. The reason is that without the identical trusted guest-OS kernel copy, when we redirect the introspection data, it will lead to a wrong in-guest memory address most likely.

19

lsmod

ps

netstat

... ...

Introspection Applications Common Utilities Syscall Execution Context Identification

Redirectable Data Identification

Kernel Data Redirection

C R/W O W

R/O

Kernel Data

Kernel Code

Kernel Secure-VM

Product-VM

VM-Space Traveler

Figure 3.2. An architecture overview of our Vmst. 3.1.4

System Overview

An overview of Vmst is presented in Fig. 3.2. For an untrusted OS running in a product-VM, suppose end-users want to perform its introspection using the off-the-shelf tools, they only need to install the corresponding trusted version of the in-guest OS on top of our own secureVM and invoke the commonly used standard utilities without any modification. They do not have to perform any manual effort to understand (or reverse engineer) the OS kernel and write the introspection program. However, if they do want to customize an introspection program, they can develop these programs natively (e.g., invoking native APIs/system calls) without worrying about any OS kernel internals and the semantic gap. Note that the hypervisors of the product-VM and our secure-VM in Fig. 3.2 can be completely different. Vmst is only bounded with our own secure-VM and is transparent to the guest product-VM, which can be a VM running on top of XEN/KVM/VMWare/HyperV/VirtualBox/VirtualPC/QEMU. There are three key components inside Vmst: (1) syscall execution context identification, (2) redirectable data identification, and (3) kernel data redirection. Syscall execution context identification is used to identify the system call execution context relevant to the introspection, and ensure the kernel data redirection only redirects the data x in the context of

20 introspection. The second component redirectable data identification pinpoints the exactly x (and its closure) that needs to be redirected under the context informed by the syscall execution context identification. The last component Kernel data redirection performs the data redirection of x. Copy-on-write (COW) will be performed if there is any data write on x. Note that our goal is to inspect (read-only) the state of a product VM without changing its state. Therefore, we need COW to avoid any side effect of the product VM especially when there is a data write on x. After finishing the introspection, we throw away the copied pages. Next, we present the detailed design of each component of our VMST respectively. We first describe how we identify the system call execution context at the VMM layer in Chapter 3.2, then we show how to track those redirectable data in Chapter 3.3, and finally how we perform the kernel data redirection in Chapter 3.4.

3.2

SysCall Execution Context Identification

Observation and Insight. The first step of Vmst is to identify the precise syscall execution context, in which we perform the data redirection for the necessary syscall. When an introspection program is running, there are two spaces: user space and kernel space. In x86 architecture, each process has a unique CR3 value to locate its page global directory (i.e., pgd). Therefore, we could isolate the corresponding kernel as well as user space by using the CR3 value. This observation has been widely used in many of the introspection systems (e.g., (Jones et al., 2006, 2008; Payne et al., 2007, 2008)). However, there is still a caveat. Process could run with multiple threads, and control flow is often thread-specific. While using CR3 can identify a process execution context, it cannot precisely isolate the specific syscall context because all of the threads for the same process can execute syscalls. As such, we have to differentiate the thread context for the

21 same process at VMM layer. Fortunately, we have an observation: while multi-threads for the same process share the same CR3 (threads share the same virtual address), each process at kernel level has a unique kernel stack (this is dynamically allocated) which can be used to isolate the thread execution context at kernel level. Therefore, we propose to use CR3, and kernel esp value (with a lower 12bits cleared mask) together, to uniquely differentiate and isolate the fine-grained thread execution context. Using CR3 is to quickly isolate the process of our interest, though the masked esp is adequate for our purpose. Then the next question is how to acquire the right CR3 value of a monitored process, given only the name of an introspected process. Note that our secure-VM needs to be transparent to the in-VM OS, and we should not traverse any specific task struct to get the process name field even though we could. This turns out to be a challenging task. Before diving into our solution, let us first describe what we could do at the VMM layer. It is trivial to identify the syscall entry point. In Linux, user level programs invoke int 0x80 or sysenter/syscall instructions to enter the kernel space (Windows-XP just uses sysenter/syscall). Therefore, by intercepting these two instructions at VMM layer, it suffices to identify the beginning of a syscall execution context. However, the real difficulty lies in how to identify the exit point of a syscall. A naive approach may directly intercept the sysexit or iret instruction to determine the exit point. However, this approach would not work because of interrupt and exception handling as well as the possibility of a context switch during the execution of a syscall. As illustrated in Fig. 3.3, at a high level, when serving a syscall, an interrupt could occur and kernel control flow may go to the interrupt handler. An exception such as a page fault (when a syscall routine accesses some unmapped memory region of the process) could also occur. In addition, at the syscall exit point or during the service of a syscall, a context switch could occur (e.g., a process has used over its time slice). A context switch could also occur in the interrupt and exception handler.

22 sysenter/int 0x80

Interrupt Handler Exception Handler

Syscall Service Routine

Context Switch

sysexit/iretd

Figure 3.3. Typical kernel control flow when serving a syscall. Fortunately, we have another observation: since our secure-VM virtualizes all hardware resources (e.g., through emulation), we can easily observe and control these hardware states including the interrupt and timer at the VMM layer, as long as we can keep our own introspection process and kernel running correctly. More specifically, we use the following approaches to handle interrupt, exception, and context switch. Handling Interrupt and Exception. First, we need to exclude the Interrupt execution context. Generally, there are two kinds of interrupts: synchronous interrupts generated by CPU while executing instructions, and asynchronous interrupts generated by other hardware devices at arbitrary times. In x86 architecture, synchronous interrupts are designated as exceptions, and asynchronous interrupts as interrupts. When an interrupt occurs, whether it is an exception or a hardware interrupt, it will first issue an interrupt vector number in the hardware interrupt controller. This controller will pick up the corresponding interrupt handler, to which the kernel control flow will transfer. By monitoring this controller and tracking the interrupt number, we can differentiate syscalls (e.g., int 0x80) and other interrupt or exception handlers, and we can track the beginning of an interrupt service. In our design, right before the interrupt handler gets executed, we will set a global flag to indicate data in the current execution context is not redirectable (as the kernel control path

23 will be in the interrupt context). Also, as an interrupt always ends with an iret instruction, we are able to track the end of an interrupt. However, the interrupt could be nested. That is, when serving an interrupt, the kernel could suspend the execution of the current interrupt in response to a higher priority one. Therefore, we use a stack data structure to track the execution status of the interrupt handler. In particular, we use a counter to simulate the stack. Whenever an interrupt other than a syscall occurs, we increase the counter; when an iret instruction executes, we decrease the counter. If the counter becomes zero, it means all the interrupt service has finished. Note that the counter is only updated when the execution context is within the introspection process, and initially it is zero. Another possible design is to track the program counter (PC) in the syscall routine to determine the end of an interrupt, since after an interrupt handler finishes, it will transfer the kernel control flow back to the syscall routine (the next PC). However, we observe that such a design will have a problem. For example, in Linux kernel 2.6.32-rc8 (the working testing kernel we used during the design of Vmst), the syscall routine will call the cond reschedule function to determine whether a context switch is needed (in particular checking the TIF NEED RESCHED flag in the kernel stack), and it is also called in the interrupt and exception handler routine (Bovet and Cesati, 2005). If an interrupt occurs during the execution of cond schedule in the syscall context, this approach will mistakenly identify the end of an interrupt handler. Therefore, using kernel call stack and PC together can be a viable approach as they precisely define the execution context. Note that we we did not take this approach as it needs to traverse the kernel call stack which tends to be expensive. The stack-based approach above allows us to determine the interrupt handler context, or more specifically, the top half of an interrupt. However, one may worry about how we identify the bottom half of an interrupt as most UNIX systems, including Linux, divide the work of processing interrupts into two halves (Bovet and Cesati, 2005). Fortunately, bottom halves are run in separate kernel threads and it is not a problem for Vmst.

24 Table 3.1. Statistics of context switch when running ps Cases I II III IV V VI

Where Context Switch Occurs Arbitrary Places Voluntarily relinquishing Syscall Return Syscall Subroutine Exception Syscall Blocked

#times 108 1 3 23 16 0

Percentage 71.5 0.7 2.0 15.2 10.6 0.0

Controlling Context Switch. Our approach is to disable the timer interrupt to prevent context switch. Context switches are one of the key techniques to allow multiple processes to share a single CPU. Basically, it is a procedure of storing and restoring the context state (including CPU registers, CR3, etc.) of a process (or a kernel thread) such that the execution can be resumed from the same point at a later time (Bovet and Cesati, 2005; Bach, 1986). A context switch could occur in a variety of cases in Linux/UNIX and Windows including: • Case-I: arbitrary places, when an asynchronous interrupt happens (could be timer) and the process has used its CPU time slice (preempted); • Case-II: when a process voluntarily relinquishes their time in the CPU (e.g., invoking sleep, waitpid or exit syscall); • Case-III: when a syscall is about to return; • Case-IV: other syscall subroutine places, besides syscall return point, in which the kernel pro-actively checks whether a context switch is needed; • Case-V: in exception (e.g., page fault) handler; or • Case-VI: when a syscall gets blocked. During our design, we traced the execution of the ps command, and collected the statistics of where a context switch happens. We report this statistics in Table 3.1. Among these six cases, four (Case-I, Case-III, Case-IV, and Case-V that account for 99.3% in our profile) are

25 triggered due to the time slice. Case-II (0.7% because of the exit syscall) is not of concern because the entry of the sleep or waitpid syscall can be detected and the redirection in these syscalls’ execution context, including any other possible context switches, can be disabled. After context switching to other processes, it will switch back to these syscalls and we are able to detect it by just looking at the CR3. Also, an introspection program typically will not invoke the blocking-mode syscalls (Case-VI). Meanwhile Case-V can be detected by our exception handler. Therefore, one of the key ideas of Vmst is that as long as we can keep the running introspection process always owning the CPU, we can prevent context switch happening until the monitored process exits, or we can allow context switch as long as we can pro-actively detect it (such as the case of sleep syscall). Note that at VMM layer we own the hardware and we can modify the timer such that the process will not feel it has gone beyond its time slice. Also, Virtuoso mentioned an approach to disable the context switch by running the training programs with a higher priority (e.g., using start /realtime on Windows and chrt on Linux) (Dolan-Gavitt et al., 2011). However, this approach alone will not work in Vmst as the determination of a context switch will still be executed, and the data access in the determination context will hence be redirected (that is not desirable). Also, it might not always be true that a user-level process can get the highest priority. Acquiring the CR3 value of the introspection process. Now we are ready to describe how we acquire the correct CR3 value when only given a to-be-executed process name. Notice in Fig. 3.1, when a process is executed, it will first call the execve syscall. By inspecting the value in ebx at the moment when this syscall gets executed, we are able to determine the process name. However, the value of CR3 when executing this syscall at this moment belongs to its parent process. During the execution of this syscall, it will release almost all resources belonging to parent process (through flush old exec function in kernel source code) and update the CR3, which is the right moment to acquire the CR3 for our monitored

26 process. Therefore, by monitoring the update of CR3 (a mov instruction) in execve syscall context, we are able to get our desired value because there is no other CR3 update since we disabled context switching. The reason why the new pgd is updated in the context of execve is the following: In general, a child process is created using fork syscall. The child is assigned a pgd whose context is copied from its parent. The child will then call execve to load the new binary, and it will have a new executable image. Consequently, during the execution of execve, it will create a new pgd, and release the memory descriptor, all memory regions, and all page frames including page table entries assigned to the process, and finally update the CR3, which is the moment to get the correct pgd of the child process. The above CR3 acquiring approach tends to be Linux kernel specific and requires the detailed knowledge of execve syscall. In fact, there is an alternative approach and we actually use it for Windows OS. That is, we can monitor all CR3 values from the boot of our secure-VM and detect the newly used CR3 since a new CR3 certainly belongs to a new process. For this alternative approach, we need to track the life time of a pgd. Our instrumentation is to maintain a map between the CR3 and the process. Whenever a process dies (detected through such as exit group syscall as noticed in Fig. 3.1), we remove its CR3 from our map. As such, we are able to determine whether a given CR3 belongs to a new process. Summary. By tracking the interrupt service routine and disabling the timer for context switches, our syscall context identification is able to largely identify the syscall execution context of a monitored process. However, it still does not fully isolate all the syscall service routines. For example: the cond schedule function will be called in many places to determine whether a context switch is needed, including all of the syscall exit points. Obviously, we will redirect the data access of this function if we do not have any other techniques to remedy this. We cannot white list this specific function as it would be too kernel-specific (though

27 that is a viable option). Fortunately, our second technique, redirectable data identification, solves this problem and will automatically tell the data in this function is not redirectable (because cond schedule will access data in thread info that is always allocated in kernel stack, and data in kernel stack is not redirectable).

3.3

Redirectable Data Identification

Observation and Insight. The goal of our redirectable data identification is to identify the kernel data x that can be redirected to the in-guest memory of product-VM. Thus, we have to first determine what kind of data should be redirected. Normally, when writing an introspection program manually, we traverse the kernel memory starting from a global memory location (exported in the system.map file for instance) to reach other memory locations including the kernel heap by following pointers. As such, one of the intuitive approaches would be to track and redirect the data x that is from kernel global variables and derived from global variables through pointer dereference and pointer arithmetic instructions. Note that at instruction level, we can easily identify the kernel global variables, which are usually literal values after the kernel is compiled and identical for the same OS version for a given global address. By dynamically instrumenting each kernel instruction and checking whether there is any data transitively derived from global variables (a variant of widely used taint analysis (Chow et al., 2004; Newsome and Song, 2005; Egele et al., 2007; Yin et al., 2007)), we are able to identify them. Our early design naturally adopted this approach. However, we found another design that is simpler and can save more shadow memory space for the data dependency tracking. Since it is a boolean function to determine whether a piece of data is redirectable, instead of tracking all the redirectable data, we can track which data is unredirectable. Certainly, it is the data dereferenced from stack variables or derived from them because kernel stack

28 variables manage the kernel control path, and they vary from machine to machine for an identical OS at a different time. Though our redirectable data identification is a variant of taint analysis, there are still significant differences. Below we sketch our design and highlight them. Shadow Memory. Similar to all other taint analyses, we need a shadow memory to store the taint bits for memory and CPU registers. We keep taint information for both memory and registers at byte granularity by using only one bit to indicate whether they are redirectable (with value 1) or not (with value 0). However, we have to use three pieces of shadow memory, shadow S and shadow V for the memory data and shadow R for registers. S is used to track the unredirectable stack addresses, and V and R are used to track whether the value stored in the corresponding memory addresses or registers when used as a memory address needs to be redirected. Considering our working example shown in Fig. 3.4, if we only have S, for the instruction at line 17 mov 0xc(%ebp),%ecx, we will move the taint bit 0 to ecx. When the kernel subsequently dereferences a memory address pointed to by ecx, we will not redirect it. However, we should redirect it as this address is actually a global memory address. Therefore, we will keep taint information for both the stack address and its value because of pointers. Taint Source. Right before the introspection process enters the first monitored system call, we initialize the taint bits for the shadow state. For R, all are initialized with 0, as the parameters passed from the user space are local to our secure-VM. For S and V, the taint bits are allocated and updated on demand when the kernel uses the corresponding addresses. The taint bit for the esp register is always 0. When loading a global memory address (a literal value that falls into kernel memory address space), the taint information for the corresponding register or memory will be set to 1. Some special instructions (e.g., xor

29 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

c1001178: ... c1001196: c1001197: c100119c: c10011a1: ... c100295c: c1002961: ... c100297d: ... c1188a43: c1188a44: c1188a49: c1188a4b: c1188a4e: c1188a51: c1188a52: c1188a55: ... c118871f: c1188720: ... c118880d:

a1 24 32 77 c1

mov

0xc1773224,%eax

50 68 21 a4 5c c1 68 20 30 77 c1 e8 9d 78 18 00

push push push call

%eax $0xc15ca421 $0xc1773020 c1188a43

bd 00 e0 ff ff 21 e5

mov and

$0xffffe000,%ebp %esp,%ebp

8b 4d 08

mov

0x8(%ebp),%ecx

55 ba 89 8d 8b 50 8b e8

push mov mov lea mov push mov call

%ebp $0x7fffffff,%edx %esp,%ebp 0x10(%ebp),%eax 0xc(%ebp),%ecx %eax 0x8(%ebp),%eax c118871f

55 89 e5

push mov

%ebp %esp,%ebp

8b 4d 08

mov

0x8(%ebp),%ecx

ff ff ff 7f e5 45 10 4d 0c 45 08 c5 fc ff ff

0 0 0

1 1 1 1

0 0

eax ebx ecx edx esp ebp ...

eip = c1188a55 esp = ce657010 0 0 0 0 0

1 1 1 0 0

0

0



ce657024 ce657020 ce65701c ce657018 ce657014 ce657010 ce65700c ce657008 ce657004 ce657000 ...

0 0 0 0 0

1 1 1 0 0

0

0 0 0

eax ebx ecx edx esp ebp ...

eip = c118880d ebp = ce657008

1

eax

1 0 0 0

ebx ecx edx esp ebp ...

Shadow State at



Shadow State at

Shadow State at

ce657024 ce657020 ce65701c ce657018 ce657014 ce657010 ce65700c ce657008 ce657004 ce657000 ...



eip = c1188a4b esp = ce657014

eip = c10011a1 esp = ce65701c ce657024 ce657020 ce65701c ce657018 ce657014 ce657010 ce65700c ce657008 ce657004 ce657000 ...



ce657024 ce657020 ce65701c ce657018 ce657014 ce657010 ce65700c ce657008 ce657004 ce657000 ...

0 0 0 0 0

1 1 1 0 0

0 0

0 0

0

0

1

eax

0 0 0 0

ebx ecx edx esp ebp ...

Shadow State at

Figure 3.4. Shadow memory state of our working example code.

30 eax,eax, sub eax,eax) will reset register value, and consequently we will set their taint bits with 0. Propagation Policy. The propagation policy determines how we update the shadow state. At a high level, similar to all other taint analyses, based on the instruction semantics, we update it. However, as we have three pieces of shadow memory (i.e., S, V, and R), we have significantly different policies. In particular, for S, we always update its shadow bit with 0 whenever we encounter a stack address. We can regard S as a bookkeeping of all the exercised stack address. Later on, when dereferencing a memory address, we will query S about whether we have seen such an address before. The taint-bit value in S (which is always 0) is not involved in any data propagation. In fact, we could eliminate S because, in practice, nearly all the stack addresses (involved in an x86 instruction) are computed (directly or indirectly) from esp. For example, as shown in the last two instructions (line 23-25) of our working example, we can easily infer 0x8(%ebp) is a stack address, and we do not need to query S. The main reason we keep it is to track their usage and make sure that the stack address will not be redirected. For V and R, we use the following policies. • Data Movement Instruction. For one directional data movement A → B, such as mov/movsb/movsd, push, and pop, we update the corresponding R(B) or V(B), with the taint bit in R(A) or V(A). For data exchange instructions A ↔ B, such as xchg, xadd (add and exchange), we update shadow state for both operand. Note lea may be a special case of “data movement”. It does not exactly load any data from memory, but it may load a memory address. Therefore, we will check if the source operand generates a stack address, and if so we will update V or R of the destination operand with 0. For example, at line 16, it loads a stack address to eax, and we will consequently update R(eax) with 0.

31 • Data Arithmetic Instruction. As usual, for data arithmetic instructions such as add, sub, or, we should update shadow state by ORing the taint bit of the two operands. However, this is only true for operands that are both global and heap addresses as well as their propagations. Note that if one of the operands in these instructions is a literal value but not within kernel address space, there is no need to update any shadow state. If either of the operands is stack address related, we will update the taint bit with 0. Considering the instructions in line 8-11, we will first taint ebp with 1 as 0xffffe000 is a literal and within kernel address space; at line 9 when we execute and %esp,%ebp, because the taint bit for esp is always 0, we will get the new taint bit for ebp as 1; next at line 11 when we dereference memory 0x8(%ebp), we will redirect it, which is wrong. Therefore, the stack address will hijack the normal propagation policy and clear the operand taint. • Other Instructions. A large body of instructions do not contribute to the taint propagation, such as nop, jmp, jcc, test, etc. For them, we will only check whether any memory address involved in these instructions needs to be redirected. One may wonder whether we could get rid of this complicated data dependence analysis scheme, given that our secure-VM already intercepts each instruction and knows the stack base address when entering the kernel; it should be straightforward to check if a memory access is on the secure-VM kernel stack. However, there is an issue that an in-guest kernel heap address of the product-VM could also be in the range of the secure-VM stack address, because there is no explicit boundary between kernel stack and kernel heap. In fact, a process’ kernel stack is dynamically allocated when a process is created. As such, we cannot differentiate whether a given address is in a secure-VM kernel stack or product-VM kernel heap using the simple address range without tracking the data dependence.

32 3.4

Kernel Data Redirection

Having been able to identify the syscall execution context and pinpoint the data x that needs to be redirected, next we present how we redirect the kernel memory access. As discussed in Chapter 3.1 that not all syscalls need to be redirected, we first describe our syscall redirection policy in Chapter 3.4.1. Next, we discuss how we perform the virtual to physical address translation including how to handle copy-on-write (COW) in Chapter 3.4.2. Finally, we present our redirection algorithm in Chapter 3.4.3. 3.4.1

Syscall Redirection Policy

Recall in the syscall trace of our uname example (Fig. 3.1), we discussed that our syscall redirection policy has to be syscall-specific. That is, based on the semantics of each syscall, we decide whether we should redirect the data access. As such, we have to systematically examine all the syscalls. In fact, syscall knowledge has been widely studied in the security literature, especially in intrusion detection (c.f., (Forrest et al., 1996; Provos, 2003; Garfinkel, 2003)). Syscalls of Linux/UNIX in general can be classified into the following categories according to a comprehensive study made by Sekar (Sekar, Sekar): file access (e.g., open, read, write), network access (e.g., send/recv), message queues (e.g., msgctl), shared memory (e.g., shmat), file descriptor operations (e.g., dup, fcntl), time-related (e.g,. getitimer/ setitimer), process control related (e.g., execve, brk, getpid), and other system-wide functionality related including accounting and quota (e.g., acct). In our introspection settings, as we are interested in pulling the guest OS state outside, (1) syscalls dealing with retrieving (i.e., syscall starting with get in Linux/UNIX, or syscalls starting with NtQuery in Windows) the status of the system, and (2) syscalls related to file access are of particular interest. Specifically, our introspected syscalls for Linux/UNIX are summarized in the 2nd column of Table 3.2.

33 Table 3.2. Introspected System call in our Vmst. Category

State Query

File System

Linux/UNIX get(p|t|u|g|eu|eg|pp|pg|resu|resg)id getrusage, getrlimit,sgetmask, capget gettimeofday,getgroups,getpriority getitimer,get kernel syms, getdents getcwd,ugetrlimit,timer gettime, timer getoverrun,clock gettime,uname clock getres,get mempolicy, getcpu open, fstat, stat, lstat, statfs fstatfs, oldlstat, ustat, lseek, llseek read, readlink, readv, readdir

Microsoft Windows NtOpenKey, NtQueryKey, NtQueryValueKey, NtEnumerateValueKey NtQueryInformationProcess NtQueryPerformanceCounter NtQuerySystemTime NtQueryObject, NtQuerySystemInformation NtQueryInformationToken NtOpenFile, NtCreateFile, NtReadFile NtWriteFile,NtQueryInformationFile NtDeviceIoControlFile

In particular, we are interested in the file access related syscall because of the proc files in Linux/UNIX. Note that proc is a special file system which provides a more standardized way for dynamically accessing kernel state as opposed to tracing methods or direct access of kernel memory. Actually, standard, important utility programs such as ps, lsmod, netstat all read proc files to inspect the kernel states. Also, for disk files, there is no redirection (because VMI largely deals with memory), and we have to differentiate them by tracking the file descriptors. To this end, we maintain a file descriptor mapping whenever the introspected process opens a file, and we differentiate whether the opened file is a proc file by checking the parameters of open syscall. Note that nearly all of our key techniques in Vmst are OS-independent (our design goal). However, the syscall redirection policy, as described, appears to be kernel-specific and it requires the detailed knowledge of each syscall conversion and the semantics for a particular OS. Therefore, to support other kernels such as Microsoft Windows, we need to scrutinize each Windows syscall to determine whether they are redirectable. For instance, as presented in the 3rd column of Table 3.2, we are interested in roughly 16 syscalls for process introspection (such as running tasklist.exe) that includes 8 NtQuery syscalls, and networking inspection (such as running ipconfig.exe) that includes NtDeviceIoControlFile.

34 3.4.2

Virtual to Physical Address Translation

When dynamically instrumenting each kernel instruction, we are only able to observe the virtual address. If a given address is redirectable, we have to identify its in-guest physical address in product-VM. Consequently, we have to perform the MMU level virtual to physical (V2P) address translation. To this end, we design a shadow TLB (STLB) and shadow CR3 (SCR3) in our secureVM, which will be used in our introspection process during address translation if we need to redirect a given address α from our secure-VM to product-VM. SCR3 is initialized with the product-VM’s CR3 value at the moment of introspection,2 and is used for kernel memory address iff α needs to be redirected, and similarly for STLB. Meanwhile, as discussed earlier in §3.1, we have to perform COW to avoid any side effect of the in-guest OS running in the product-VM if there is a data write on the redirected data. This time, our design is to extend one of the reserved bits in page table entries to indicate whether a page is dirty (has been copied) and add one bit each to our software STLB entry. Note that this is one of the advantages of instrumenting the VMM because we can add whatever we want in the emulated software such as our STLB even though the original hardware does not contain such an extension. In addition, for the page table entry we just extend one of the reserved bits to achieve our goal. Certainly, we can also make a shadow page table and extend it with a dirty bit for page entry if there does not exist any reserved bit. Detailed Address Translation Procedure. Before the start of an introspection process, STLB is initialized with zero. When a kernel address α needs to be redirected and it is a data read (i.e., memory load) operation, we first check whether STLB misses. If not, we 2

If we take a snapshot of the guest memory (e.g., for forensics) we will log its CR3 and this value will be loaded into our SCR3.

35 directly get the physical address P A(α) derived from STLB. Otherwise, we get its P A(α) of the in-guest physical memory by querying SCR3 and performing the three-layer address translation. At the same time, we fill our STLB for address α with the physical address of P A(α) such that future reference for the address sharing the same page of α can be quickly located (the essential idea of TLB). Also, the STLB entry only gets flushed iff its entries are full and we have to replace, because we only have one SCR3 value. Unlike regular TLB, all of its entries have to be flushed whenever there is a context switch. If there is a memory write on α, similar to read operation to check whether STLB hits, we also check whether the target page is dirty by querying the dirty bit in our STLB entry. If it is not dirty or STLB misses, we perform the three-layer address translation by querying SCR3 and the page tables, from which we further check whether the target page is dirty. If not (the first time data write on this page), we set the dirty bit of the target page table entry as well as our STLB, and perform a target page copy and redirect the future access of the original page to our new page such that any next data write to this page will not have a problem as the whole page has already been copied. Otherwise, we just set the dirty bit in the STLB entry.

3.4.3

Directing the Access

After we have described all of the necessary enabling techniques, we now turn to present the details of our final data redirection procedure. Specifically, as shown in Algorithm 1, for each kernel instruction i, we will check whether its execution is in a syscall execution context (line 3). If so, we check whether the data access of the current syscall context needs to be redirected (line 4). If not, there will be no instrumentation for i. Next, we perform the redirectable data tracking for the operand of i (line 5) by updating our shadow state according to the instruction semantics. After that, for each memory address used in i (line 6), if it is a data read (line 7), we will invoke the V2P address translation

36

Algorithm 1: Kernel data redirection 1: Require: SysExecContext(s) returns true if syscall s is executed in a syscall execution context; SysRedirect(s) returns true if data access in s needs to redirected; RedirectableDataTracking(i) performs our redirectable data identification and flow tracking for instruction i; MemoryAddress(i) returns a set of memory addresses that need to be accessed by instruction i. NotDirty(α) queries STLB, or SCR3 and the page table to check if the physical page located by α is dirty. V2P(α) will translate the virtual address of α and get its physical address by querying STLB, or SCR3 and the page tables and update STLB if necessary. 2: DynamicInstInstrument(i): 3: if SysExecContext(s): 4: if SysRedirect(s): 5: RedirectableDataTracking(i); 6: for α in MemoryAddress(i): 7: if DataRead(α): 8: P A(α) ← V2P(α) 9: Load(P A(α)) 10: else: 11: if NotDirty(α): 12: CopyOnWritePage(α) 13: UpdatePageEntryInSTLB(α) 14: P A(α) ← V2P(α) 15: Store(P A(α))

function to get the corresponding address (line 8), and load the data (line 9). Otherwise (line 10), we will check whether the target page is dirty or not (line 11). If not, we will perform the COW operation (line 12) and update the page entry dirty bit, copying the page if necessary (line 13). After that, we get its physical address (line 14) and do the write operation (line 15). From Algorithm 1, we can notice that our data redirection engine (line 5 - 15) will actually work in any other kernel execution context as long as it can be informed. For instance, we can inspect and redirect the kernel data access in a particular kernel function, e.g., in a regular kernel module routine, or a user developed device driver routine. This is one of the distinctive benefits of our Vmst. That is, it allows end-users to customize the introspection to a specific chunk of kernel code. We will demonstrate this feature in §3.6.

3.5

Implementation

We have implemented Vmst based on a recent version of QEMU (0.15.50) (Bellard, 2005), with over 7,400 lines of C/C++ code (LOC). Next, we share the implementation details of

37 interest, especially how we dynamically instrument each instruction in a recent QEMU, how we intercept the interrupt execution context, and how we manage the MMU with respect to our new STLB. Dynamic Binary Instrumentation. There are quite a few open source dynamic binary instrumentation frameworks built on top of QEMU (e.g., TEMU (Yin and Song, 2010) and Argos (Portokalidis et al., 2006)). However, their implementations are scattered across the entire QEMU instruction translation, and our redirectable data identification can be implemented more simply. In particular, we take a more general and portable approach, and leverage the XED library (Intel, 2005) (which has been widely used in PIN tool (Luk et al., 2005)) for our dynamic instrumentation. Upon the execution of each instruction, we invoke XED decoder to disassemble it and dynamically jump to its specific instrumentation logic for performing our redirectable data tracking. The benefit is such an approach allows us to largely reuse our previous code base of PIN-based dynamic data flow analysis (Lin et al., 2010a). Interrupt Context Interception. The beginning execution of an interrupt for the x86 architecture in QEMU is mainly processed in the function do interrupt all. We instrument this function to acquire the interrupt number, and determine whether it is a hardware or software interrupt. After the QEMU executes this function, it will pass the control flow to OS kernel. The kernel then subsequently invokes the interrupt handler to process the specific interrupt. As discussed in §3.2, the interrupt handler returns using an iret instruction. Thus, by capturing the beginning and ending of an interrupt (the pair), we identify the interrupt execution context. MMU Management with STLB. In QEMU, MMU is emulated in i386-softmmu module for our x86 architecture. To implement our STLB, we largely mirrored and extended the

38 original TLB handling code and data structures (e.g., tlb fill, tlb set page, tlb table). For load and store, QEMU actually differentiates code and data when translating the code (e.g., generating ldub code for the instruction load). Therefore, we only instrument the data load and store helper functions in QEMU.

3.6

Empirical Evaluation

We have performed an empirical evaluation of Vmst. Next, we report our experimental results. We first tested its effectiveness in Chapter 3.6.1 with a number of commonly used utilities and natively developed programs atop Linux kernel-2.6.32-rc8 and Microsoft Windows XP (SP3), followed by the performance overhead of these programs in Chapter 3.6.2. Next, we show the generality of our system by testing with a diversity of other Linux and Windows kernels in Chapter 3.6.3. Finally, we demonstrate its security applications in Chapter 3.6.4. All of our experiments were carried out on an Intel Core i7 CPU with 8G memory using a Ubuntu 11.04, Linux kernel 2.6.38-8.

3.6.1

Effectiveness

Automatic VMI Tool Generation. Most VMI functionality can be achieved by running administrative utility programs. Using Vmst, these native utilities automatically become introspection programs. In this experiment, we took 15 commonly used administrative utilities in Linux platform, and tested them with options shown in the 1st column of Table 3.3. To measure whether we get the correct result, we take a cross-view comparison approach on our testing kernel-2.6.32-rc8. Right before we take the snapshot, we run these commands and save their result to a file. Then we attach the snapshot and run these utilities in our secure VM, and syntactically compare (diff) the two output files. Note that we did not install any rootkit in this experiment, and the security application is tested in §3.6.4.

39 Table 3.3. Evaluation result with the commonly used Linux utilities without any modification. Utilities w/ options ps -A lsmod lsof -c p ipcs netstat -s uptime ifconfig uname -a arp free date pidstat mpstat iostat vmstat

Description

SyE

SeE

Reports a snapshot of all processes Shows the status of modules Lists opened files by a process p Displays IPC facility status Displays network statistics Reports how long the system running Reports network interface parameters Displays system information Displays ARP tables Displays amount of free memory Print the system date and time Reports statistics for Linux tasks Reports CPU related statistics Displays I/O statistics Displays VM statistics

7 X X X X 7 X X X 7 7 7 7 7 7

X X X X X X X X X X X X X X X

Interestingly, as shown in the 3rd column of Table 3.3, 8 out of 15, including ps, and date, have slight syntax discrepancy, and all other commands returned syntax-equivalent (SyE) result for these Linux utilities. Then we examined the reasons and found the root cause to be due to the timing when we took the snapshot. For ps command, our introspected version found one fewer process and this process is actually the ps itself when running in the product-VM. It did not exist in the snapshot. For all other commands such as uptime and date, the slight difference is also due to the timing field in the output (reflecting the time difference between running the command and taking the snapshot), but every time our introspected version will always output the same result, which precisely shows that we introspected the correct state of the inguest OS of product-VM and the state never changed. That is, our automatically generated introspection program returns the semantic-equivalent (SeE) result and we summarize this in the last column of Table 3.3. We also tested five Windows utilities on top of Windows XP (SP3). The result is presented in Table 3.4. These commands have the same semantics as running inside the guest OS of the product-VM. For the syntax difference, it has similar reason to those Linux utilities.

40 Table 3.4. Evaluation result with the commonly used Windows utilities without any modification. Utilities w/ options getpid tasklist net user ipconfig time /t

Description

SyE

SeE

Get Process Identification List running applications and services List user accounts for the computer Displays all current TCP/IP network configuration Display the system time

7 7 X X 7

X X X X X

Table 3.5. Evaluation result with the Native customized Linux programs. Customized Program ugetpid kps klsmod

LOC

Description

SyE

SeE

5 52 65

Reports the current process pid Reports all the processes Displays all the modules

7 7 X

X X X

Native Customized VMI Tool Development. Sometimes, end-users may develop their own customized VMI tool by either invoking syscalls or programming with the kernel directly. For example, a utility command may not be able to see some advanced rootkits, and endusers could either write a native kernel module to retrieve the state, or write a user-level program but invoke special system calls (e.g., by developing a new system call). Recall that our system allows customized kernel code inspection as long as the end-user informs the execution context (as discussed in Algorithm 1). To show such a scenario, we developed three programs. One is a very simple userlevel getpid program (with only 5 LOC) to demonstrate that end-users can invoke system calls to inspect a kernel state, and the other two are kernel level programs (i.e., device drivers) that list all the running processes and installed kernel modules by traversing the task struct.tasks and module list data structures. Note that it took us less than one hour in developing kps (with 52 LOC) and klsmod (with 65 LOC). As presented in Table 3.5, interestingly ugetpid will always return a constant pid value. This is because the getpid system call by default is redirected to guest memory of productVM, and every time it will return the pid of the guest “current” process when taking the snapshot. Therefore, it becomes a constant value for a particular snapshot.

41 For kps, it traverses kernel task struct.tasks list and outputs all the PIDs. Similar to our automatically generated introspection ps, kps also has almost syntax-equivalent PIDs with the user-level ps except there is one fewer PID of the ps itself. However, for klsmod, we extracted all kernel modules. The module list result is both syntax and semantic equivalent to the state when lsmod reads them from the proc file system. Also, note that when we run the two kernel modules, we will inform our secure-VM of the start and end addresses of kps and klist. To this end, we inserted two consecutive cpuid instructions, each at the beginning and the end of our kernel modules. Our secure-VM automatically detects this during the execution and senses the start and end address of the monitor context. In addition, this time the execution context monitoring is not based on any CR3 or system call, and we also controlled our secure-VM ensuring that there is no context switch during our module execution. Summary. The above two sets of experiments have demonstrated that we can automatically generate the VMI tools, and also enable end-users to develop their own VMI tools natively without worrying about any semantic gap. Also, the slightly different result of the two views does not mean we get the wrong result. Instead, all of our above experiments have faithfully introspected the guest OS of the product-VM and reported the precise state.

3.6.2

Performance Overhead

We also measured the performance overhead of the programs in Table 3.3, Table 3.4 and Table 3.5 by running each of them 100 times and computing the average. The result is summarized in Fig. 3.5. We found that Vmst introduces 9.3X overhead in Linux and 19.6X overhead in Windows on average compared with the non-introspected process for the user-level introspection programs running in the VM. For the two Linux kernel modules, it introduces very large performance overhead up to 500X, but we have to emphasize that the

42 No−VMI VMI

80% 60% 40%

time

net

ipconfig

getpid

tasklist

kps

klsmod

netstat

ugetpid

iostat

vmstat

mpstat

pidstat

arp

date

uname

ifconfig

uptime

ipcs

0%

ps

20%

lsmod

Normalized Performance Overhead

100%

Benchmark Program

Figure 3.5. Normalized performance overhead when running Vmst. absolute time cost is very short. For example, for kps, it only takes 0.06s to dump all the pids in the kernel. There are a number of reasons why we have small performance overhead at the user level. First, all user level code runs normally without any instruction interception, data flow tracking, and redirection. Second, not all kernel level system calls get redirected, only the introspection related ones. Therefore, if a program frequently opens and reads proc files, it tends to have larger overhead. For instance, there is 20X overhead in ps and pidstat, and 30X for lsof. However, for kernel level modules, we have huge performance overhead. The primary reason is because there is no user-level code, and we intercepted each instruction of our kernel modules. Also, these kernel modules run too fast (almost negligible) and the 500X is an over approximation. We can see that VMST has a larger overhead in Windows programs than that of Linux. The reason is that VMST has to instrument certain kernel functions and monitor more data flow than Linux programs (to be explained in Chapter 3.6.3). In particular, many of these utilities will collaborate with other programs through remote procedure call (RPC) to fulfill the tasks. For example, tasklist retrieves process run time information from

43 wimprvse.exe, net user retrieves user accounts from lsass.exe. Note that lsass.exe is responsible for the enforcement of security policies in Windows system. Tracing for these programs will incur much larger overhead. When compared with Virtuoso for the performance, as these two systems have entirely different software environment, it is unfair to compare the absolute time cost (for instance its pslist took around 6s to inspect all the running process in their experiment (DolanGavitt et al., 2011), and our kps only took 0.06s). Our 9.3X overhead on average is the one compared with the tool running in a VM, and we did not include the performance overhead incurred by the VM.

3.6.3

Generality

Testing with Linux kernels. Next, we tested how general (OS-independent) our design is, regarding different OS kernels. We selected a wide range of Linux distributions, including Fedora, OpenSUSE, Debian, and Ubuntu (presented in the 1st column of Table 3.6), with a variety of 20 different kernel versions (the 2nd column). We tested these kernels whether running correctly or not, by using our benchmark programs in Table 3.3 and our ugetpid. Note that the two kernel modules are not transparent to all kernels, and we did not test them. One of the basic metrics is to test whether our design presented from §3.2 to §3.4 is fully transparent to the guest OS (the 3rd column) without any modification, and if it is not fully transparent, we measure the code size of our hard-coded part (the 4th column) in order to make that kernel work. Surprisingly, our design is truly transparent to Linux kernel starting from 2.6.20, and for the previous kernels we have to make just 53 LOC to support it. The reason why we have to introduce these 53 LOC to the old kernel is because of the way the kernel extracts the current process. More specifically, in the Linux kernel, each process has a task struct which stores

44 Table 3.6. OS-independent testing of Vmst. OS Distribution Redhat-9 Fedora-6 Fedora-15 OpenSUSE-11.3 OpenSUSE-11.4 Debian 3.0 Debian 4.0 Debian 6.0 Ubuntu-4.10 Ubuntu-5.10 Ubuntu-10.04

Ubuntu-11.04 Ubuntu-11.10 Windows-XP Windows-XP Windows-XP

Kernel Version 2.4.20-31 2.6.18-1.2798.fc6 2.6.38.6-26.rc1.fc15 2.6.34-12-default 2.6.35 2.6.37.1-1.2-default 2.6.39.4 2.4.27-3 2.6.18-6 2.6.32-5 2.6.32-rc8 2.6.8.1-3 2.6.12-9 2.6.32.27 2.6.33 2.6.34 2.6.36 2.6.37.6 2.6.38-8-generic 3.0.0-12-generic SP2 SP3

Release Date 11/28/2002 10/14/2006 05/09/2011 09/13/2010 08/10/2010 02/17/2011 08/03/2011 08/07/2004 12/17/2006 01/22/2010 02/09/2010 08/14/2004 08/29/2005 12/09/2010 03/15/2010 07/05/2010 11/22/2010 03/27/2010 06/03/2011 08/05/2011 08/24/2001 08/25/2004 04/21/2008

OS-independent

LOC

7 7 X X X X X 7 7 X X 7 7 X X X X X X X 7 7 7

53 53 0 0 0 0 0 53 53 0 0 53 53 0 0 0 0 0 0 0 215 215 215

all the process execution and management information (Bovet and Cesati, 2005). During a system call execution, the kernel itself usually first fetches the current task and then performs the specific system call for this task. Thus, when we perform our kernel data redirection, it is also crucial for us to find the current task of the in-guest OS, from which we could know about such as where the file system (the fs field in task struct) is and then execute our introspection. Starting from kernel 2.6.20, Linux uses a global variable to store the current task, and in our secure-VM it automatically gets redirected. That explains why our system is fully transparent to the Linux kernel starting from 2.6.20. However, for the old versions, Linux acquires the current task from its kernel stack. In particular, each task struct has a pointer pointing to thread info, which is usually allocated in the bottom of a kernel stack. From thread info, it has a pointer to task struct. Therefore, each time for the kernel to fetch the current task, it first gets the thread info by bit-masking esp with the size of the stack. That is why we have 53 LOC additional code.

45 Testing with Windows kernels. We also tested our system regarding to different Windows kernels. As shown in Table 3.6, we have to add 215 LOC to make Vmst work with Windows. These additional code is mainly responsible for special non-redirectable Windows kernel data such as kernel stack address and paged pool memory.

3.6.4

Security Applications

Vmst has many security applications. It can be naturally used in intrusion detection, malware analysis, and memory forensics. In the following, we demonstrate one of the particular applications – kernel malware (i.e., rootkit) detection, and show its distinctions. Meanwhile, we also briefly describe how to use it for memory forensics. Kernel Rootkit Detection. A kernel rootkit is a special kernel level malware that hides important kernel objects by either hijacking syscalls, other kernel functions, or direct kernel object manipulation (DKOM). In this experiment, we selected 9 publicly available and widely tested rootkits in the literature (e.g., (Riley et al., 2008)), which are listed in the 1st column of Table 3.7 from packetstormsecurity.org, and test how our Vmst detects them in our working kernel 2.6.32-rc8. Note that we have to slightly modify the source code of the outdated rootkits to make them work in the 2.6.32 kernel. Most rootkits aim to hide task struct or kernel modules. To detect them, we also take a cross view comparison approach. As summarized in Table 3.8, for rootkits that hijack kernel control flow by patching the syscall table or other kernel hooks, Vmst trivially detected them by running either the native ps or kps, because the OS kernel in our secure-VM is not contaminated. Regarding rootkits that directly modify the kernel objects (i.e., the DKOM attack), normal VMI tools (or administrative inspection command) will not be able to detect them. However, our system enables end-users to develop kernel modules in-VM to inspect the kernel

46 Table 3.7. Attack Vector Description of the Tested Rootkits Rootkit Name

Attack Vector

adore-ng-0.53 adore-ng-0.56 hide process-2.6 linuxfu-2.6 sucKIT1.3b override synapsys enyelkm-1.3 modhide

Patching system call table and kernel function pointers Patching kernel function pointers Patching task list pointer (DKOM) Patching task list pointer (DKOM) Patching system call table and kernel function pointers Patching system call table and kernel function pointers Patching system call table and kernel function pointers Patching system call table and kernel function pointers, and DKOM Patching module list (DKOM)

Table 3.8. Rootkits experiment under Kernel-2.6.32-rc8 Rootkit Name

Target Data Structure

adore-ng-0.53 adore-ng-0.56 hide process-2.6 linuxfu-2.6 sucKIT1.3b override synapsys enyelkm-1.3 modhide

task struct task struct task struct task struct task struct task struct task struct,module task struct,module module

ps X X 7 7 X X X X

kps X X 7 7 X X X X

Detected by kps’ lsmod X X X X X X X X X 7 7

klsmod

X 7 7

objects out-of-VM, and we are able to detect some of them but not all. Specifically, ps and lsmod command visit proc files to inspect the running processes and device drivers. The data in the proc files comes from the task struct and module list. If rootkits removed the node from the two lists through DKOM, neither original ps and lsmod, nor our kps and klsmod is able to identify them. As such, we have to develop new kernel modules to detect them. In particular, we developed a kps’ (with 130 LOC) which will periodically traverse the runqueue inside the kernel, and compare with the tasks list (note that in this case we have to periodically take the memory snapshot or attach to the guest-memory lively). It is an opportunistic approach, namely, it detects the hidden DKOM process at the moment when it is in runqueue but not in tasks list. We attempted to identify all the waitqueue in the kernel, but they are scattered across many places and it is very hard to traverse them in a centralized way. Our kps’ is able to detect “hide process-2.6” and “linuxfu-2.6” DKOM rootkits as shown in Table 3.8. But for the last two module hidden rootkits, we cannot detect them through traversal-based

47 approaches. One may have to use signature-based approaches (e.g., (Dolan-Gavitt et al., 2009; Lin et al., 2011)) to detect them. For instance, it has been shown that SigGraph can detect the last two rootkits (Lin et al., 2011). Memory Forensics. Our automatically generated VMI tools can also be used in memory forensics. The only requirement is that end-users need to provide a guest CR3 value and assign it to our SCR3 for the V2P translation when “mounting” the memory. If it is a hibernation file, they have to recover a CR3 or more generally recover a pgd. In fact, this is not a problem as (1) pgd has a strong two-layer SigGraph (Lin et al., 2011) signature as demonstrated in our recent work OS-Sommelier (Gu et al., 2012). More specifically, each entry in pgd is either NULL or a pointer pointing to a page table entry (pte), and each entry in pte is either NULL or a pointer pointing to a page frame; (2) meanwhile, all the processes largely share the identical kernel address mapping except little process-specific private data. In addition, there are several other approaches to recover a pgd in physical memory snapshot such as using kernel symbols (Walters, Walters). Due to space limitation, we omitted the details on our memory forensics experiment.

3.7

Discussion

While Vmst has seamlessly bridged the semantic gap and enabled automated generation of VMI tools from binary code, its current design and implementation has a number of limitations. Next, we examine each limitation below and outline our future work. Transparency against arbitrary OS. It is obvious that Vmst is not entirely transparent to arbitrary OS kernels. It still binds with some particular OS kernel knowledge such as system call interface, interrupt handling and context switching, though such knowledge (particularly from the design of UNIX (Bach, 1986)) is general. For example, as demonstrated in our experiment, Vmst directly supports a variety of Linux kernels without any

48 modifications. However, an entirely different OS may have different system-call interfaces and semantics. Therefore, for a different OS other than Linux/UNIX, we must inspect how its kernel handles system calls, interrupts, context switches, and its specific semantics of system calls, etc. But we do believe our design is general (OS-independent), and we suspect it should work for other kernels. For instance, as demonstrated in Chapter 3.4.1 and Chapter 5.5, with a slight modification of the “system call redirection policy” as well as special treatment on how Windows handling kernel data and context switch in Vmst, we directly enable a number of native Windows programs such as getpid.exe, tasklist.exe, ipconfig.exe in Windows to directly introspect the in-guest Windows OS, which experimentally proves that our interrupt handling, context switching controlling, redirectable data identification, and kernel data redirection are indeed largely OS-independent. Handling asynchronous system calls. While most system calls on UNIX platform are synchronized, i.e., during the execution of a system call it will block other executions except context switches until the system call finishes, there could be some system calls that are asynchronized (“non-blocking”). Vmst does not support asynchronous system calls unless we can detect the precise execution context in which the kernel notifies the caller that a system call has completed. Currently, we have not encountered such asynchronous system calls in our VMI tool generation. Handling swapped kernel data. Vmst cannot read the in-guest data that has been swapped out. To our surprise, we have not encountered this situation in our experiments. Our explanation is that the kernel tends to swap user space memory instead of kernel, as kernel space is shared, and swapping in and out a kernel page may have to update all processes’ page tables. Meanwhile, for the Linux/UNIX kernel, we actually confirmed that the kernel page never gets swapped out (Bovet and Cesati, 2005).

49 Other issues. Finally, for other architectural issues such as using multi-core or multiCPUs running the guest OS, our secure-VM (a single CPU) may encounter some issues when reading their memory. Also, attacks targeting our secure-VM or DKSM-based (Bahram et al., 2010) are out-of-scope of our current work. Another venue of future work will investigate how to address these problems.

3.8

Summary

In this chapter, we have presented the design, implementation, and evaluation of Vmst, a novel system to seamlessly bridge the semantic gap and automatically generate VMI tools from binary code. The key idea is that through system wide instruction monitoring at VMM layer, we can automatically identify the introspection related kernel data and redirect their access to the in-guest OS memory (which could be directly attached, or from a snapshot). We have showed that such an idea is practical and truly feasible by devising and implementing a number of OS-independent enabling techniques, including syscall execution context identification, redirectable data identification, and kernel data redirection. Our experimental results have demonstrated that Vmst offers a number of new features and capabilities. Particularly, it automatically enables the in-guest inspection program to become an introspection program and largely relieve the procedure of developing customized VMI tools. Finally, we believe Vmst will significantly remove the hurdles in virtualization-based security, including but not limited to VMI, malware analysis, and memory forensics.

CHAPTER 4 BRIDGING THE SEMANTIC GAP VIA DECOUPLED EXECUTION AND TRAINING MEMOIZATION

1

In this chapter, we present the detail of Hybrid-Bridge which improves the performance of Vmst by an order of magnitude. In particular, we discuss the technical overview in Chapter 4.1, detailed design of our Fast-Bridge in Chapter 4.2, Slow-Bridge in Chapter 4.3, and FallBack in Chapter 4.4. Then we present the implementation detail in Chapter 4.5, evaluation in Chapter 4.6, discussion in Chapter 4.7 and finally summary in Chapter 4.8.

4.1

Technical Overview

4.1.1

Observation

Similar to Vmst (Fu and Lin, 2012), the main goal of Hybrid-Bridge is to enable native inspection utilities (e.g., ps, lsmod) to transparently investigate a remote system out-of-VM. This goal is achieved by forwarding special kernel data from a remote system (i.e., untrusted VM) to a local system (i.e., trusted VM). We use a simple inspection program, GetPid, to illustrate the basic idea behind HybridBridge. GetPid invokes the sys getpid system call to retrieve a running process’s ID. Fig. 4.1 (a) shows a code snippet of sys getpid of Linux kernel 2.6.32.8. In particular, sys getpid kicks off by accessing current task, a global pointer which points to the current running task at line 5, then dereferences the group leader field to access the group leader 1

c

2014 Internet Society. Reprinted, with permission, from Alireza Saberi, Yangchun Fu and Zhiqiang Lin. “HYBRID-BRIDGE: Efficiently Bridging the Semantic Gap in Virtual Machine Introspection via Decoupled Execution and Training Memoization”, In Proceedings of the 21st Annual Network and Distributed System Security Symposium (NDSS14). DOI:http://dx.doi.org/10.14722/ndss.2014.23226

50

51

: : 1: c10583e0: push 2: c10583e1: mov 3: c10583e3: push 4: c10583e4: sub

Data Structure Name current_task (Line: 5) %ebp %esp,%ebp %ebx $0x14,%esp

// Accessing Global Variable: struct task_strut current_task 5: c10583e7: mov %fs:0xc17f34cc,%ebx c10583ea: R_386_32 current_task

Data Structure Offset [%fs:0xc17f34cc]

struct task_struct (Line: 6) struct task_struct *group_leader 0x220 struct pid_link pids[3] (Line: 7)

0x23c

struct pid *pid

// Accessing struct task_struct: current_task->group_leader 6: c10583fe: mov 0x220(%ebx),%eax // Accessing struct pid: current_task->group_leader->pids[0]->pid 7: c1058404: mov 0x23c(%eax),%eax 8: 9:

c105840a: call c105840f: add

struct pid

c1065660 $0x14,%esp struct upid

(a)

unsigned int level

0x4

struct upid numbers[1]

0x1c

int nr

0x0

(b)

Figure 4.1. Code Snippet of System Call sys getpid and the Corresponding Data Structures in Linux Kernel 2.6.32.8. task structure at line 6. Next, it dereferences the pointer to group leader of task struct at line 7 to access the pid field. Note that the real PID value is stored in int nr field of struct upid. For the sake of brevity we only show the partial code and the data structures accessed during sys getpid illustrated in Fig.4.1. It is important to notice that all of these data structures are accessed by dereferencing a global variable, current task, and traversing the subsequent data structures. This observation, as first discovered by Vmst (Fu and Lin, 2012), lays one of the foundations of Hybrid-Bridge; namely, by fetching specific kernel global variables (e.g. current task) and all of their derived data structures from the OS kernel of a remote VM, a commodity inspection tool can automatically achieve introspection capabilities. We refer to this technique as data redirection.

4.1.2

System Overview

At a high level, Hybrid-Bridge enables inspection tools in a trusted VM to investigate an untrusted system memory using native system calls and APIs as if they are investigating the

52 trusted VM. Hybrid-Bridge achieves this goal by using data redirection (or forwarding) at kernel level. As shown in Fig. 4.2, there are three key components inside Hybrid-Bridge: Slow-Bridge, Fast-Bridge and FallBack. Slow-Bridge and Fast-Bridge are both capable of redirecting kernel data and enable commodity inspection tools to investigate the untrusted system memory. The main difference, as indicated by their names, is the lower performance overhead in Fast-Bridge compared to Slow-Bridge. SLOW-BRIDGE Memory Snapshot

ps

… Untrusted OS

Trusted OS Taint Tracking

Data Redirection

R/W

C O W



Trusted OS

3

QEMU

ps

lsmod

Dynamic Instruction Patching

R/O

Memory Snapshot

Inspection Apps

Trrusted VM

Trrusted VM

Inspection Apps lsmod

FAST-BRIDGE

Data Redirection

Untrusted OS

R/W

C O W

R/O

1 KVM

Snapshot

FALLBACK 4 3 4 3

Meta-Data

SLOW-BRIDGE Starts

SLOW-BRIDGE Finishes

Kernel Inspection Command

Training Memoization

5 2 5

FAST-BRIDGE Resumes

2

Snapshot

Meta-Data Command Log

Data Control

HYBRID-BRIDGE Figure 4.2. An overview of Hybrid-Bridge. To use our system, assume end users use ps to perform the introspection of a memory snapshot from an untrused OS. If the meta-data is sufficient (provided in step °), there will be no FallBack and Fast-Bridge executes normally as in step ¬. Otherwise, Fast-Bridge will be suspended, and FallBack will be invoked (step ­) along with the snapshot of the guest OS and the command log (that is ps). Next in step ®, Slow-Bridge will be started with the guest snapshot and the inspection command (namely ps in this case) to produce the missing meta-data. After Slow-Bridge finishes (step ¯), it will send the meta-data for training memoization and inform the FallBack to resume the execution of Fast-Bridge with the new meta-data (step °). Step ­ to ° will be repeated whenever the meta-data is missing in Fast-Bridge. Except Fast-Bridge, the Slow-Bridge and FallBack components are both invisible to end users. Given an introspection tool T , as illustrated in Fig. 4.2, Hybrid-Bridge executes it in Fast-Bridge. With the Meta-Data provided by Slow-Bridge and memoized by FallBack, Fast-Bridge enables T to investigate untrusted system memory with low overhead.

53 In case that the Meta-Data is not rich enough to guide Fast-Bridge, Fast-Bridge will suspend its VM execution, and request the trusted VM inside Slow-Bridge to execute T with the same untrusted memory snapshot as input through FallBack component. Similar to Vmst, Slow-Bridge monitors the execution of T and uses a taint analysis engine to infer the data redirection policy for each instruction. These inferred information, being part of the Meta-Data, are shared with Fast-Bridge. As soon as Fast-Bridge receives the Meta-Data from Slow-Bridge, it resumes the execution of T . As a concrete example, assume end users use ps to perform the introspection of a memory snapshot from an untrused OS. As illustrated in Fig. 4.2, if the Meta-Data is sufficient (provided in step °), there will be no FallBack and Fast-Bridge executes normally as in step ¬. Otherwise, Fast-Bridge will be suspended, and FallBack will be invoked (step ­) along with the snapshot of the guest VM and the command log (that is ps). Next in step ®, Slow-Bridge will be started with the guest snapshot and the inspection command (namely ps in this case) to produce the missing Meta-Data. After Slow-Bridge finishes (step ¯), it will send the Meta-Data for training memoization and inform the FallBack to resume the execution of Fast-Bridge with the new Meta-Data (step °). Step ­ to ° will be repeated whenever the Meta-Data is missing in Fast-Bridge. Except Fast-Bridge, the Slow-Bridge and FallBack components are both invisible to end users. Hybrid-Bridge requires that both trusted VMs in Fast-Bridge and Slow-Bridge deploy the same OS version as the untrusted VMs. The specific OS version can be identified through guest OS fingerprinting techniques (e.g., (Quynh, 2010; Gu et al., 2012)). In order to efficiently bridge the semantic gap and turn the commodity monitoring tools into introspection tools, Hybrid-Bridge faces two new challenges: (1) how to pass the control flow to the hypervisor and to orchestrate Fast-Bridge, Slow-Bridge, and FallBack in a seamless way, and (2) how to identify both the data and instructions that should be redirected. We will present how these two challenges are addressed by Fast-Bridge and Slow-Bridge in Chapter 4.2 and Chapter 4.3, respectively.

54 4.1.3

Threat Mode

Hybrid-Bridge shares the same threat model with both Virtuoso and Vmst; namely, it defeats directly those attacks that tamper with the in-guest native inspection software and the guest kernel code. Note that there are three type of VMs involved in Hybrid-Bridge: a guest VM that runs guest OS for a particular application (e.g., a web or database service), a secure VM that runs in Fast-Bridge, and another secure VM that runs in Slow-Bridge. We distinguish between trusted and untrusted VMs. The VM directly faced by attackers is the guest VM and we call it untrusted VM. The other two VMs are maintained by hypervisor and are invisible to attackers and we call them trusted VMs. While Hybrid-Bridge can guarantee there is no untrusted code redirected from untrusted VM to the local trusted VM, it will not defend against those attacks that subvert the hypervisors through other means (e.g., exploiting hypervisor vulnerabilities). Also note that in the rest of the chapter, we refer the trusted VMs or secure VMs as those (1) maintained by cloud providers, (2) installed with clean OS (the same version with the guest OS), and (3) invisible to attackers. This can be achieved because cloud providers can physically isolate Hybrid-Bridge with guest VMs. For untrusted VMs, they could be any type of product VMs (including KVM/Xen/HyperV, etc.) that offer services to cloud users.

4.2

Fast-Bridge

Fast-Bridge is designed with fast performance in mind and runs in hardware-based virtualization (e.g., KVM) to offer a VMI solution. It is built based on the key insight that each kernel instruction executed in a specific system call invocation S shows a consistent data redirectable behavior for all invocations of S (which forms the basis of the memoization (Michie, 1968)). For example, sys getpid in Linux kernel 2.6.32.8 has 14 instructions

55 that need to be redirected by Fast-Bridge. These 14 instructions that will always touch the redirectable data are called redirectable instructions. To this end, Fast-Bridge needs to address two challenges: • Performing the data redirection. For example, for these 14 instructions in sys getpid, Fast-Bridge needs to redirect their memory access from untrusted VM to trusted VM. While there is no dynamic binary instrumentation engine in KVM, FastBridge is still capable of redirecting the data access for these instructions at hypervisor layer transparently. This capability is achieved by manipulating the physical to virtual address translation and dynamic code patching. • Identifying the redirectable instructions. To identify the redirectable instruction, it often requires a taint analysis engine (Fu and Lin, 2012), which is heavy and slow. Therefore, we propose the decoupling of the dynamic taint tracking engine, the primary contributor to the performance overhead of Vmst, from Fast-Bridge and implant it into Slow-Bridge. As a result, Slow-Bridge executes the expensive taint analysis and provides the list of redirectable instructions for Fast-Bridge to bridge the semantic gap efficiently. Fast-Bridge is depicted in the right hand side of Fig. 3.2. Next, we present the detailed design of Fast-Bridge.

4.2.1

Variable Redirectability

A redirectable variable is defined as the data in a kernel data structure that is accessed by inspection tools to reveal the system status. These data are redirectable because if a monitoring tool in a secure VM is fed with redirectable data from untrusted VM, it will report the internal status of untrusted VM as if for the secure VM.

56 The most intuitive way to identify redirectable variables is by monitoring the behavior of introspection tools. As discussed in Chapter 4.1.2, an introspection tool usually starts an investigation by first accessing specific kernel global variables and then follows them to traverse the kernel internal data structures. These specific global variables and internal data structures, traversed through pointer dereferences, would belong to redirectable variables. We will describe how Slow-Bridge uses a taint tracking engine to identify redirectable variables in greater details in Chapter 4.3.2.

4.2.2

Instruction Redirectability

An instruction that accesses redirectable variable is defined as redirectable instruction. In general, kernel instructions are divided into six categories based on how they interact with the redirectable variables. Since Slow-Bridge contains a taint analysis engine, it is able to categorize the instructions. The details on how Slow-Bridge categorizes them are presented in §4.3.3. In the following, we describe what these categories are and why we have them: 1. Redirectable: An instruction whose operand always accesses redirectable variables is called redirectable instruction. Instructions at line 5, 6 and 7 in Fig. 4.1 (a) are the samples of such instructions, and the corresponding redirectable variables for these instructions are depicted in Fig. 4.1 (b). Fast-Bridge forwards all the memory access of redirectable instructions to the untrusted memory snapshot from the secure VM.

2. Non-Redirectable: An instruction that never interacts with redirectable variables is categorized as non-redirectable. For example, instructions at line 1, 3 and 8 in Fig. 4.1 (a) fall into this non-redirectable instruction category. Fast-Bridge confines these instructions to the memory of the local secure VM only.

57 3. Semi-Redirectable: Semi-Redirectable instructions have two memory references, and they copy data values between redirectable variables and non-redirectable variables. For instance, push[%fs:0xc17f34cc] is a sample of such an instruction, because this push instruction reads a global redirectable variable (of interest) and copies it to the stack which is non-redirectable. Fast-Bridge forwards the redirectable variable memory access to the untrusted memory snapshot and confines the non-redirectable memory access to the local secure VM. For push[%fs:0xc17f34cc], Fast-Bridge reads the global variable (a redirectable variable) from the untrusted memory snapshot and saves it on top of the secure VM’s stack that is non-redirectable.

4. Bi-Redirectable: If an instruction shows both redirectable and non-redirectable behavior in different execution context, it is labeled as bi-redirectable. For example, function strlen, which returns the length of a string, can be invoked to return the length of either redirectable or non-redirectable strings in different kernel kernel execution context. As such, for each invocation of a bi-bedirectable instruction, Fast-Bridge must determine whether to redirect the data access (e.g., the argument of strlen) to untrusted memory snapshot or confine it to the local secure VM based on the execution context, which is defined as the kernel code path from the system call entry to the point of the bi-redirectable instruction execution. One of the key observations in Hybrid-Bridge is that for a specific execution context, a bi-redirectable instruction always shows the same redirection policy. (Otherwise the program behavior is non-deterministic). Introspection program is deterministic: given the same snapshot, it should always give the same output. Therefore, we can determine

58 the correct data redirection policy of a bi-redirectable instruction based on its execution context. To this end, Hybrid-Bridge first trains the data redirection policy for each bi-bedirectable instruction (using Slow-Bridge), and then memoizes the same data redirection policy in the next execution of the same kernel code path in Fast-Bridge. 5. Neutral : Instructions in this category do not reference memory. Instructions at line 2 and 4 of Fig. 4.1 (a) are labelled as neutral instructions. Since these instructions do not access memory, Fast-Bridge does not impose any memory restriction with them. 6. Unknown: All the instructions that are not categorized in any of above categories are called unknown. This category is crucial for the synchronization and training data memoization between Fast-Bridge and Slow-Bridge. Specifically, just before an unknown instruction gets executed, Fast-Bridge passes the control to FallBack component to ask Slow-Bridge to provide detailed instruction categorization information for the same snapshot. Chapter 4.4 will describe the fall-back mechanism in greater details.

4.2.3

Data Redirection Using Dynamic Patching

Observation. Having identified the redirectable instructions, we must inform the CPU and let it redirect the data access from secure-VM to untrusted VM for these instructions. We could possibly use static kernel binary rewriting, but this approach faces serious challenges such as accurate disassembly and sophisticated kernel control flow analysis (Rajagopalan et al., 2006). Then an appealing alternative would be to use dynamic binary instrumentation through emulation based virtualization such as QEMU (Bellard, 2005), but this approach suffers from high performance overhead (Fu and Lin, 2012). In contrast, we would like to run hardware assisted virtualization such as KVM and thus we must exploit new approaches.

59 Fortunately, we have a new observation and we propose hijacking the virtual to physical address translation to achieve data redirection in Fast-Bridge. In general, CPU accesses data using their virtual addresses and the memory management unit (MMU) is responsible to translate the virtual address to physical address using page tables. By manipulating page table entries, we are able to make a virtual address translate to a different physical address. Therefore, Fast-Bridge can redirect a memory access by manipulating the page table in a way that a redirectable virtual address is translated to the physical address of untrusted memory snapshot. Fast-Bridge chooses this novel approach because it neither requires any static binary rewriting of kernel code, nor suffers from high overhead as of dynamic binary instrumentation. To the best of our knowledge, we are the first to propose such a technique for transparent data redirection as an alternative to static binary code rewriting or dynamic binary instrumentation. Our Approach. More specifically, after loading an untrusted memory snapshot, FastBridge controls data redirection by manipulating the physical page number in page tables. In order to redirect memory access for a redirectable variable v, Fast-Bridge updates the physical page number of the page containing v with a physical page number of a page in untrusted snapshot which contains the same variable v. Then Fast-Bridge flushes the TLB. From now on, any memory access to v is redirected to untrusted memory snapshot because all the virtual to physical address translations for variable v points to the desired physical page in untrusted snapshot. Fast-Bridge employs a similar technique to confine the memory access within the secure VM. Note that changing the page table for each single instruction will introduce performance overhead, and in fact Fast-Bridge can avoid most of the overhead due to the instruction locality. In particular, usually instructions with similar type are located beside each other (this can be witnessed from Table 4.1) and Fast-Bridge leverages this feature to avoid frequent page table updates for each instruction and set the page table once for all the

60

Line Number

Instruction Type

Table 4.1. A Code Snippet of sys getpid and the Corresponding Patched Code for NonRedirectable and Redirectable Page

1

NR

2

Original Code Page

Non-Redirectable Code Page

Redirectable Code Page

: : c10583e0: push %ebp

push

%ebp

int 3

N

c10583e1: mov

%esp,%ebp

mov

%esp,%ebp

mov

3

NR

c10583e3: push

%ebx

push

%ebx

int 3

4

N

c10583e4: sub

$0x14,%esp

sub

$0x14,%esp

$0x14,%esp

5

R

c10583e7: mov %fs:0xc17f34cc,%ebx c10583ea: R_386_32 current_task

int 3

6

R

c10583fe: mov

int 3

0x220(%ebx),%eax

VMexit

7

R

c1058404: mov

0x23c(%eax),%eax

int 3

8

NR

c105840a: call

c1065660

call

c1065660

9

VMexit

mov

%esp,%ebp

%fs:0xc17f34cc,%ebx c10583ea: R_386_32 current_task

mov

0x220(%ebx),%eax

mov

0x23c(%eax),%eax

int 3

N

c105840f: add

$0x14,%esp

add

$0x14,%esp

$0x14,%esp

10

NR

c1058412: pop

%ebx

pop

%ebx

int 3

11

NR

c1058413: pop

%ebp

pop

%ebp

int 3

12 …

NR

c1058414: ret ...

36

U

c106551a: xor

37

U

c106551c: add

ret ...

int 3

%eax,%eax

int 3

int 3

$0x1c,%esp

int 3

int 3

(a)

(b)

Instruction Types: NR: Redirectable NR: Non-Redirectable NN: Neutral NU: Unknown

(c)

adjacent instructions of the same type. Fast-Bridge uses code patching described below to inform KVM when the page table should be updated. Dynamic Code Patching. As mentioned before, Fast-Bridge switches the data redirection policy by manipulating the page tables. An important question popping up is “how Fast-Bridge informs KVM it is time to change the data redirection policy”. In our design, Fast-Bridge employs int3, a common technique used by debuggers to set up a break point. Fast-Bridge overwrites the first instruction that has a different redirection policy from the previous instructions by int3. In this way, when an int3 is executed, KVM catches the software trap and knows this is the time to change the data redirection policy by manipulating the page table.

61 For instance, instructions in line 1-4 of sys getpid in column (a) of Table 4.1 are nonredirectable or neutral and they are executed with no data redirection policy. But instruction at line 5 is redirectable and has a different data redirection policy from previous instructions and thus Fast-Bridge patches instruction at line 5 as shown in column (b) of Table 4.1. Next, when Fast-Bridge executes the code page as of column (b) of Table 4.1, int3 at line 5 will cause a software trap and notify the KVM to change the data redirection policy. The next question is “what should happen to the instructions that are overwritten by int3? ” Fast-Bridge actually makes several copies of the kernel code to make sure kernel control flow would not be affected in spite of the dynamic code patching. More precisely, Fast-Bridge makes four copies of kernel code pages namely redirectable code page, nonredirectable code page, semi-redirectable code page and bi-redirectable code page. Each code page has some part of original kernel code page as well as int3 patches for the unknown instructions if there is any. Fast-Bridge constructs non-redirectable code page by copying all the non-redirectable and neutral kernel instructions and patch all the remaining instructions (redirectable, semiredirectable, bi-redirectable, and unknown instructions) with int3. The column (b) in Table 4.1 depicts a non-redirectable code page which is derived from a code snippet of sys getpid in column (a) of Table 4.1. Instructions in lines 5, 6 and 7 in column (b) of Table 4.1 are patched because they are redirectable and instructions at lines 36 and 37 are patched since they are unknown instructions. More specifically, Fast-Bridge constructs each code page with kernel instruction of the corresponding category as well as the kernel neutral instructions. The rest instructions of that category are all patched with int3 to make sure KVM always takes control and changes the data redirection for different categories of instructions. Fast-Bridge constructs redirectable, semi-redirectable and bi-redirectable code pages by following this rule. For example, column (c) of Table 4.1 shows redirectable code page which contains redirectable and neutral instructions.

62 As we mentioned earlier, each of the four kernel code pages has a special data redirection policy and Fast-Bridge overwrites the instruction whose data redirection policy does not match with the code page policy with an int3. Such a simple technique notifies KVM the right moment to change the data redirection policy. An important advantage of using four different kernel code pages embedding with int3 is that Fast-Bridge preserves the original kernel control flow as what it should be, and changes the data redirection policy without the need of any sophisticated kernel control flow analysis. Considering the virtual address of instructions in Table 4.1 (b) and Table 4.1 (c), we notice that Fast-Bridge maps the redirectable and non-redirectable code pages at the same virtual address to preserve the kernel control flow. In other words each time Fast-Bridge changes the data redirection, it also re-maps the appropriate kernel code page. Next, we describe how Fast-Bridge uses Algorithm 1 at the right moment to map appropriate code page for each data redirection policy. The Control Transfers of the Patched Code. While instructions in Fast-Bridge have six different categories, control flow will be as usual for neutral instructions. As such, in the following we focus on the other five categories and describe how Fast-Bridge uses Algorithm 1 to choose the appropriate kernel code page and map it for each different category during the kernel instruction execution: 1. Non-Redirectable: As described earlier, non-redirectable instructions should be restricted to the secure VM memory. Line 7 of of Algorithm 1 restricts the memory access to the local secure VM by using the original secure VM’s page table and the non-redirectable code page through manipulating the page table entires. Table 4.1 (b) shows a non-redirectable code page which is mapped to the same virtual address as original kernel code page in Table 4.1 (a). We can see that the original program semantics is still preserved. For instance, instructions in lines 1-4 of non-redirectable

63 Input: Meta-Data MD shared by Slow-Bridge, Program Counter of instruction under investigation pc and Kernel stack of secure VM stack Output: Return Appropriate P ageT able to Enforce Data Redirection for instruction located at address pc Instruction Type it ← MD.InstructionType[pc]; if IsBi -Redirectable(it) then CallSiteChains CSCs ← MD.CallSiteChain[pc]; it ← Match&FindType(stack , pc, CSCs) end switch it do case Non-Redirectable: return [Original Secure V M P ageT able + Non-Redirectable Code P age] ; end case Redirectable: return [U ntrusted Snapshot P ageT able + Redirectable Code P age] ; end case Semi -Redirectable: return [U ntrusted Snapshot P ageT able + Original Secure V M P ageT able[stack] + Semi -Redirectable Code P age] ; end case U nknown: call FallBack end endsw

Algorithm 1: SetPageTable(M D, pc, stack): Construct Page Table to Enforce Data Redirection based on Instruction Type code page in Table 4.1 (b) would be executed just like the original kernel code page but instruction 5, int3, would cause a software trap and a VM exit. KVM then looks up the data redirection policy for this instruction and finds out that instruction 5 is a redirectable instruction by querying the memoized Meta-Data, and consequently redirectable code page will be mapped and executed. 2. Redirectable: All the data access for redirectable instructions should be forwarded to untrusted memory snapshot and the redirectable code page should be mapped instead of the original kernel code. To this end, line 9 of Algorithm 1 manipulates the page table to point to the untrusted VM snapshot. During virtual to physical address

64 translation in secure VM using manipulated page table entires, virtual addresses of secure VM are translated into physical addresses of untrusted memory snapshot. In this way secure VM (i.e., our KVM) can access the snapshot memory of the untrusted VM transparently. Line 9 of Algorithm 1 also maps redirectable kernel code page. Table 4.1 (c) illustrates the redirectable kernel code page of sys getpid shown in Table 4.1 (a). Using Algorithm 1, KVM changes the page table and maps the redirectable code page, then instructions 5-7 are executed while accessing the untrusted memory snapshot. Instruction 8 of redirectable code page, int3, informs KVM to change the data redirection policy to non-redirectable by raising a software trap. 3. Semi-Redirectable: Based on semi-redirectable instruction definition, these instructions are allowed to reference non-redirectable data (i.e., stack) in trusted VM and the redirectable data of untrusted VM. Line 11 of Algorithm 1 manipulates page table to map the memory of untrusted VM snapshot, trusted VM kernel stack and semiredirectable code page. 4. Bi-Redirection: For bi-redirection instructions, whether they are redirectable depends on the execution context. Ideally, we should use the kernel code execution path to precisely represent the execution context. To that end, we have to instrument all the branch and call instructions to track the path, which is very expensive and contradicts our design. Therefore, we use a lightweight approach to approximate the kernel execution path. Specifically, we use a combination of the instruction address (i.e., the P C) and the Call-Site-Chain (CSC), which is defined as a concatenation of all return addresses on the kernel stack, as the representation of a unique execution context. Slow-Bridge

65 provides a set of CSC that are stored in the Meta-Data for each bi-redirection instruction bi. While this approximation is less precise, our experimental results (§4.6) reveal that for each bi, the CSC and P C uniquely distinguishes the execution context. If it happens that CSC and P C are not sufficient to distinguish the correct execution context, then Hybrid-Bridge will fail and we have to warn the user, though we have not encountered such a case. Note that Hybrid-Bridge is able to detect this by simply checking the Meta-Data. In particular, before a bi gets executed, as illustrated in line 3 of Algorithm 1 FastBridge retrieves the CSCs for current instruction pointed by P C from the MetaData. Line 4 of Algorithm 1 then matches CSCs with current kernel stack. If any of the CSCs matches with current stack then Fast-Bridge picks the correct data redirection policy between redirectable or non-redirectable and resumes the execution. Otherwise, the instruction type would be unknown and FallBack is invoked to find the appropriate policy. To retrieve the CSC from the current kernel stack, Fast-Bridge reads each specific return address from the offset location information provided by the memoized MetaData. This offset location of a return address inside a stack frame is acquired by Slow-Bridge through dynamic binary instrumentation. In other words, we do not actually need guest kernel to be compiled with stack frame pointer.

5. Unknown: If an unknown instruction (e.g., line 36 in Table 4.1) gets executed, KVM catches the software trap and queries Algorithm 1. Since Fast-Bridge dose not know the corresponding data redirection policy for unknown instructions, lines 13 and 14 of Algorithm 1 falls back to Slow-Bridge to find out the correct data redirection policy. This is the only moment when Slow-Bridge will get invoked in Hybrid-Bridge.

66 4.3

Slow-Bridge

Slow-Bridge, as depicted in the left hand side of Fig. 3.2, consists of (1) a trusted VM that is installed with the same version of the guest OS kernel as of Fast-Bridge, and (2) an untrusted guest OS memory snapshot forwarded by FallBack from Fast-Bridge. Slow-Bridge provides two important services for Fast-Bridge: • Instruction Type Inference. As discussed in Chapter 4.2.2, instructions are classified into six different categories, and the classification is done by Slow-Bridge. • Fall Back Mechanism. When Fast-Bridge faces a new code path and does not know the appropriate data redirection policy, Slow-Bridge provides a vital fall back mechanism to deal with this issue. At a high level, Slow-Bridge works as follows: when an inspection tool inside the trusted VM of Slow-Bridge invokes a system call, it will then identify the system call of interest (Chapter 4.3.1), pinpoint the redirectable variables (Chapter 4.3.2), infer the corresponding redirectable instruction types (Chapter 4.3.3), perform data redirection (Chapter 4.3.4), and share the Meta-Data with Fast-Bridge. In the following, we present how Slow-Bridge works regarding these behaviors. 4.3.1

Detecting the System Calls of Interest

Slow-Bridge is interested in systems calls that reveal the internal states of an OS. In terms of the identification of the system call interest, Slow-Bridge has no difference compared to Vmst. Specifically, Slow-Bridge is interested in two types of system calls: (1) state query system calls (e.g. getpid) and (2) file system related system calls which inspect the kernel by reading the proc files. Slow-Bridge follows a similar approach to Vmst and inspects 14 file system and 28 state query system calls (c.f., (Fu and Lin, 2012)).

67 Table 4.2. Taint Propagation Rules E c esp R R := R0 R := ∗(R0 ) (∗R) := R0 R0 op R00 op R0

4.3.2

TV[E] 0 1 TV[R] T V [R] := T V [R0 ] T V [R] := T V [∗R0 ] T V [∗R] := T V [R0 ] T V [R0 ] || T V [R00 ] T V [R0 ]

Comments Constants are always untainted Stack pointer is always tainted Taint value of register or memory R ∗ is the dereference operator op represents a binary arithmetic or bitwise operation op represents a unary arithmetic or bitwise operation

Redirectable Variables Identification

Redirectable variables, described in §4.2.1, are kernel data accessed by inspection tools to reveal the system status. There are two approaches to identify redirectable variables. The first approach follows a typical introspection footsteps by reading interesting kernel global variables which are exported in System.map. Following the global variables, introspection tools reach out to kernel data structures in heap and extract system status. Finding relevant set of global variables for each system call is a challenging task especially considering the fact that this list has to be tailored for different versions of OSes. As such, the second approach focuses on non-redirectable variables and redirects the rest of kernel data in the specific system call execution context. A simple definition of nonredirectable variable is all variables derived from kernel stack pointer (i.e., esp) which are tied to the local trusted system. Slow-Bridge follows the second approach and embodies a taint analysis engine to find all the data derived from esp. Note that this approach has been proposed in Vmst (Fu and Lin, 2012) and Slow-Bridge has no technical contribution regarding this. The taint analysis engine maps each register and memory variable to a boolean value called taint value (TV). All TVs are initialized to zero except the taint value of esp (TV[esp]) which is set to one. The initial taint values indicates that at the start of a system call, esp is the only data that is considered non-redirectable. A concise description of the rules for taint propagation is presented in Table 4.2, though more detailed rules and design can be found

68 in (Fu and Lin, 2012). All the access to variable R with TV[R] equal to zero is redirected to the untrusted memory snapshot. If TV[R] is equal to one, Fast-Bridge would use the local value of variable R from trusted VM. 4.3.3

Inferring Instruction Redirectability

Tracking redirectable variables also enables Slow-Bridge to infer kernel instruction’s data redirection types based on their interaction with the redirectable variables. To this end, Slow-Bridge logs every instructions executed along with all the memory references in the context of a monitored system call. Slow-Bridge then traverses the log file to infer each instruction into one of the instruction category mentioned in §4.2.2. More specifically, Slow-Bridge uses the following rules to infer the instruction redirection type: 1. Redirectable: If all execution records of an instruction always accesses the redirectable variables, this instruction is categorized as redirectable instruction.

2. Non-redirectable: If all execution records of an instruction always accesses nonredirectable variables, this instruction is a non-redirectable instruction.

3. Semi-Redirectable: If an instructions access two variables, one redirectable and the other non-redirectable in a single record, this instruction is called semi-redirectable.

4. Bi-Redirection: If there are several execution records showing that an instruction accesses redirectable and non-redirectable variables always in different execution context, then this instruction is categorized as bi-redirectable. Note that having taint tracking engine, Slow-Bridge infers whether bi-redirectable instruction is referencing redirectable or non-redirectable variable in each execution. But Fast-Bridge needs a mechanism to differentiate between different invocations of

69 bi-redirectable instructions, and it relies on Meta-Data information provided by SlowBridge to enforce correct data redirection for each execution of bi-redirectable instruction. In particular, before each bi-redirectable instruction gets executed, Slow-Bridge extracts the value of all return addresses on the stack as well as their location offsets with respect to the base address of the stack, and stores them in the Meta-Data. The return value is used to form the Call-Site-Chain (CSC) as a signature in the training data, and the offset list is to facilitate Fast-Bridge retrieving these return addresses at run-time in the Fast-Bridge kernel stack. 5. Neutral : An instruction with no record of memory access in the log is categorized as neutral instruction. 6. Unknown: All the instructions which are not executed in the context of a system call are labelled as unknown instructions, which is crucial for FallBack to take over the control and invoke Slow-Bridge to infer the instruction redirection type. 4.3.4

Data Redirection

Slow-Bridge enables the trusted VM to access the untrusted memory snapshot transparently by forwarding all the access of redirectable variable to the untrusted snapshot memory. Unlike in Fast-Bridge which uses a page manipulation technique to redirect the data, Slow-Bridge uses memory emulation at VMM level. More details on how to use emulation-based VM for data redirection can be found in Vmst (Fu and Lin, 2012). 4.4

FallBack

The key component to connect Fast-Bridge and Slow-Bridge is the FallBack, which is shown in the middle of Fig. 4.2. Since Fast-Bridge uses the Meta-Data provided by

70 Slow-Bridge through dynamic analysis, there might exist instructions that have not been trained by Slow-Bridge and we call them unknown instructions. At a high level, if FastBridge faces an unknown instruction (ui in short) during the execution, it suspends its execution and falls back to Slow-Bridge through FallBack. The rationale for such an OS page fault style fall back mechanism is based on the observation that if FallBack passes the same untrusted memory snapshot and introspection command to Slow-Bridge, then the trusted VM (i.e., the QEMU emulator) in SlowBridge would invoke the same command and eventually execute the same ui. Because we run the same code to examine the same state of the untrusted memory snapshot, the program should follow the same path and finally touch the same ui in both trusted VMs of Fast-Bridge and Slow-Bridge (the deterministic property of the introspection program). In order to execute the same introspection command in trusted VM inside Slow-Bridge, there are several approaches: one is to use network communication to connect the trusted VM from hypervisor and invoke the command, the other is to use a process implanting approach (Gu et al., 2011) to inject the introspection process in trusted VM or use the inVM assisted approach that installs certain agent inside trusted VM to invoke the command. After the introspection command finishes the execution in the trusted VM, Slow-Bridge will update the Meta-Data, which is implemented using a hash table for the memoization, and then inform FallBack to resume the execution of trusted VM in Fast-Bridge for further introspection.

4.5

Implementation

We have developed a proof-of-concept prototype of Hybrid-Bridge. Basically, we instrument KVM (Kivity et al., 2007) to implement Fast-Bridge and FallBack component, and modify Vmst (Fu and Lin, 2012) to implement Slow-Bridge. Specifically:

71 Fast-Bridge.

Fast-Bridge provides three main functionalities and they are imple-

mented in the following way: • Guest VM system call interception: To activate the data redirection policy on the system calls of interest, Fast-Bridge needs to intercept all the guest VM system calls. We implement the system call interception feature atop a recent KVM based system call interception system Nitro (Pfoh et al., 2011). • Data redirection: As described in Chapter 4.2.3, Fast-Bridge manipulates guest OS page table to achieve a transparent data redirection. • Finding the exact time to change the data redirection policy: As mentioned earlier, Fast-Bridge changes the data redirection policy only when the instruction type of the next-to-be-executed instruction is different with the current one. To this end, Fast-Bridge needs an efficient mechanism to notify when the current data redirection policy should be changed. As described in Chapter 4.2.3, Fast-Bridge uses a software trap technique to notify KVM to change the data redirection policy. In particular, Fast-Bridge employs Exception Bitmap, a 32-bit VM-Execution control filed that contains one bit for each exception. If the forth bit of the Exception Bitmap is set, then an int3 execution in guest VM will cause a VMExit. Using this technique KVM is notified to take the control and change the data redirection policy accordingly. In total, we added 3.5K LOC to implement Fast-Bridge in KVM code base.

Slow-Bridge. We reused our prior Vmst code base (especially the taint analysis component) to implement Slow-Bridge. Additionally, we developed over 1K LOC atop Vmst to infer the instruction’s data redirection type (described in Chapter 4.3.3) and memoizes it in the Meta-Data.

72 FallBack. We did not adopt the process implanting or in-VM agent assisted approach to implement FallBack, and instead we use a network communication approach. In particular, in order to run the same introspection command in trusted VM inside Slow-Bridge, FallBack dynamically creates a shell which uses ssh to invoke the command through system API such that FallBack can precisely know when the command finishes. Also, this ssh shell did not introduce any side effect for our introspection purpose regarding the untrusted memory snapshot. Also, it is straightforward to implement the logic for parsing the command log, managing the Meta-Data, and controlling the VM states. In total, we developed 300 LOC for FallBack. 4.6

Evaluation

Next, we present our evaluation results. We took 15 native inspection tools to examine the correctness of Hybrid-Bridge, and we report this set of experiment in §4.6.1. Then we evaluate the performance overhead of Hybrid-Bridge, and compare it with both Virtuoso and Vmst in §4.6.2. Note that we have the access to the source code of Virtuoso (as it is public open (Dolan-Gavitt, Dolan-Gavitt)) as well as our own Vmst source code. We run Vmst, Virtuoso and Hybrid-Bridge on a box with Intel Core i7 and 8GB of physical memory to collect the performance results. Ubuntu 12.04 (kernel 2.6.37) and Debian 6.04 (kernel 2.6.32.8) were our host and guest OS, respectively. 4.6.1

Correctness

To evaluate the correctness of Hybrid-Bridge, we use a cross-view comparison approach as in Vmst. Specifically, we first execute the native inspection tools shown in the first column of Table 4.3 on an untrusted VM and save their outputs. Then we take a memory snapshot of the untrusted VM and use Hybrid-Bridge to execute the same set of inspections tools inside the trusted VM and compare the two outputs.

73 Table 4.3. Correctness Evaluation Result of Hybrid-Bridge and the Statistics of the number of each Instruction Types. App. Name getpid gettime hostname uname arp uptime free lsmod netstat vmstat iostat dmesg mpstat ps pidstat

Neutral 32 31 92 92 4649 2339 2497 2418 2884 2865 3472 106 3219 5181 4325

Non-Red. 28 17 53 53 3383 1781 1958 1752 2020 2432 2793 54 2650 4185 3678

Red. 10 1 26 26 1852 908 987 923 1106 1086 1299 13 1205 1825 1630

Semi-Red. 1 1 1 1 55 24 28 26 31 32 30 2 44 51 44

Bi-Red. 0 0 0 0 34 0 0 19 7 0 26 0 21 14 28

Syntax Equal 7 7 X X X 7 7 X X 7 7 X 7 7 7

Semantics Equal X X X X X X X X X X X X X X X

The seventh column of Table 4.3 shows that six inspection tools have exactly the same output for this two rounds of execution. The manual investigation of the remaining nine tools shows that the slight differences in outputs are due to the timing. For example date and uptime have different output because there is a time difference between running them on the untrusted OS and taking the snapshot. If we consider this time difference then the output are similar. Another example is ps which also has a small difference in output. The ps command in untrusted OS shows itself in the list of processes but when we take the snapshot right after ps execution, ps is not running anymore thus the output of Hybrid-Bridge shows one process less compared to untrusted OS output. The last column of Table 4.3 shows that considering timing differences the output of all 15 tools are semantically equivalent. In addition, Table 4.3 also presents the statistics of the different instruction types categorized by Slow-Bridge during the execution of each command. These are shown from the third to sixth columns. An interesting observation can be drawn from these statistics is that semi-redirectable and bi-redirectable instructions tend to be rare compared to other instruction categories, and the majority of the instructions are either neutral, or non-redirectable. Also, note that Hybrid-Bridge does not have a direct correspondence with the size of the user level program, and all of our instrumentation execution occurs at kernel level

74 Table 4.4. Performance of each component of Hybrid-Bridge and its comparison with Vmst. App. Name getpid gettime hostname uname arp uptime free lsmod netstat vmstat iostat dmesg mpstat ps pidstat

KVM (sec.) 0.004 0.004 0.004 0.003 0.086 0.005 0.007 0.018 0.014 0.007 0.01 0.155 0.008 0.009 0.016

Vmst (sec.) 0.423 0.392 0.488 0.389 0.739 0.591 0.627 1.034 1.454 2.195 2.323 8.622 1.635 6.623 8.095

Hybrid-Bridge w/o any Meta-Data (sec.) 1.976 1.985 2.199 2.211 2.360 1.810 2.755 2.329 1.719 4.186 5.047 4.845 4.460 10.047 12.585

Hybrid-Bridge w/ Full Meta-Data (sec.) (i.e. Fast-Bridge) 0.005 0.005 0.005 0.005 0.094 0.012 0.017 0.048 0.107 0.109 0.120 0.295 0.153 0.481 0.598

Hybrid-Bridge #VMExit 2 4 10 10 1852 1892 3927 11875 23165 86578 97390 11663 124525 418124 490713

for the system call of interest, which is the primary factor for the scalability of our system. For instance, the first three programs in Table 4.3 have less monitored instructions even though their user level code size is as big as others. In our experiment, ps command has the largest number of trapped instructions according to Table 4.3. More specifically, we dynamically observed over four million instruction execution at kernel side, which is in total 10, 244 unique instructions according to sum of the second column to the sixth of Table 4.3.

4.6.2

Performance Evaluation

Hybrid-Bridge is designed to significantly improve the performance of existing VMI solutions. Next, we present how and why Hybrid-Bridge advances the state-of-the-art and meets our design goals. Table 4.4 shows the execution time of inspection tools tested in §4.6.1. The second and fifth columns of Table 4.4 display the execution time of inspection tools on a vanilla KVM and Fast-Bridge, respectively. Comparing these two columns reveals that Fast-Bridge has on average 10X slowdown compared to the vanilla KVM. Fig. 4.3 illustrates the details of the performance evaluation for each inspection tool in Fast-Bridge compared to KVM.

75

Slowdown (times)

60 50 40 30 20 10 0

Figure 4.3. Fast-Bridge Slowdown Compared to KVM. The fourth column of Table 4.4 displays the execution time of inspection tools in SlowBridge. Taint analysis engine and the full emulation architecture of QEMU are the two main contributors to 150X slowdown of Slow-Bridge compared to Fast-Bridge. The third column of Table 4.4 shows the running time for Vmst. Fast-Bridge speedup compared to Vmst is illustrated in Fig. 4.4. It is important to notice that Fast-Bridge on average has 38X speedup compared to Vmst. Speedup and Slowdown Gap. After examining the performance data, a natural question that pops up is why there is a huge gap between speedup of inspection tools in Table 4.4? The very same question should be also answered for slowdown gap between inspection tools. While there are several reasons to justify the speedup or slowdown gap, we believe the main contributor is the number of VMExits. As we mentioned in Chapter 4.2.3 Fast-Bridge notifies KVM to change the data redirection policy by using code patching technique. The software trap, raised by code patching, causes a VMExit and transfers the execution to the KVM, as illustrated in Table 4.1. The sixth column of Table 4.4 shows the number of VMExit during the corresponding inspection tools’ execution. We also illustrate this fact in Fig. 4.3, which sorts the inspection tools based on the number of VMExits. We can observe from Fig. 4.3 that as the number of VMExits

Speedup (times)

76 100 90 80 70 60 50 40 30 20 10 0

Figure 4.4. Fast-Bridge Speedup Compared to Vmst. increases from left to right, Fast-Bridge slowdown compared to vanilla KVM jumps from 25% to more than 20X. This trend clearly illustrates that VMExit is the main contributor to the Fast-Bridge overhead. The Fast-Bridge speedup illustrated in Fig. 4.4 also indicates the negative effect of VMExit on Fast-Bridge. In particular, Fig.4.4 shows as the number of VMExit increases from left to right, the speedup factor drops dramatically. For example getpid achieved 84X speedup because it needs only two VMExits but ps cannot achieved better than 13X speedup because it causes more than 418,000 VMExits.

Comparison with Virtuoso. In addition, we use the four inspection tools, shipped in Virtuoso source code, to compare the performance of Fast-Bridge and Virtuoso. The detailed result is presented in Table 4.5. We can see that Fast-Bridge achieves 4X-23X speed up (13X on average) compared to Virtuoso. The fifth column in Table 4.5 shows the number of x86 instructions extracted by Virtuoso for each tool. Considering the fifth and the last columns of Table 4.5, we can see that as the size of inspection tool increases Fast-Bridge achieves a better speedup compared to Virtuoso.

77 Table 4.5. Performance comparison of Fast-Bridge and Virtuoso App. Name gettime getpid tinyps getprocname

Native (sec.)

Virtuoso (sec.)

0.004 0.004 0.020 0.006

0.023 0.024 1.501 2.716

#X86 Inst. in Virtuoso 482 516 140843 294797

Fast-Bridge (sec.) 0.005 0.005 0.064 0.132

Fast-Bridge vs. Virtuoso 4.60X 4.80X 23.45X 20.57X

We have verified that the two primary reasons of Virtuoso’s slowdown are: (1) micro operations code explosion – the number of micro operations often increases by 3X to 4X, and (2) executing the translated micro operations in Pythons (which is very slow).

Number of Fall-Backs. Hybrid-Bridge outperforms Vmst and Virtuoso if the inspection tools primarily get executed in Fast-Bridge. It is important to find out how many times Fast-Bridge have to fall back to Slow-Bridge before it can execute an inspection tool completely in Fast-Bridge. In order to answer this question we take five different snapshots of untrusted VM and execute these inspections tools using Hybrid-Bridge. As shown in Fig. 4.5, in the first round when no Meta-Data is available, all the tools fall back to Slow-Bridge and they all have a very high overhead. Fig. 4.5 also shows that the first round of Meta-Data provides large enough instruction category information for Fast-Bridge: 11 out of 15 inspection tools with no more memoization (no more fall-back) to Slow-Bridge. The rest four inspection tools face new code paths in their second executions and fall back to Slow-Bridge for the second time. After two rounds of execution on two different memory snapshots, according to Fig.4.5, Fast-Bridge is able to execute all the inspection tools on new memory snapshots without any support from Slow-Bridge. In other words, after few runs all the inspection tools would be executed with a very low overhead in FastBridge.

78 14 12 10

pidstat

ps

mpstat

dmesg

iostat

vmstat

netstat

lsmod

free

uptime

arp

hostname

uname

gettime

getpid

Seconds

8 6 4 2 0 1st

2nd

3rd

4th

5th

N-th Snapshot

Figure 4.5. Execution time of inspection tools in Hybrid-Bridge with five different memory snapshots

4.7

Discussion

Homogeneity of Guest OS Kernel. As discussed in §5.1, Hybrid-Bridge requires that both trusted VMs in Fast-Bridge and Slow-Bridge deploy the same OS version as the untrusted VMs. Note that we only require the same version of guest OS kernel, and do not require the same set of kernel modules. For instance, lsmod can certainly return different sets of running kernel modules for different running instances, because end users might have different customizations for kernel modules.

Memory-only Introspections. Similar to Virtuoso and Vmst, Hybrid-Bridge supports introspection tools that investigate only memory but not on files in the disk. It might be an option to directly mount disk file and inspect it. But for encrypted file system, we have to seek other techniques. We leave the introspection of disk files in future work.

79 Also, if a memory page is swapped out, Hybrid-Bridge, including Vmst and Virtuoso cannot perform the introspection on these pages. However, we may argue that OSes usually tend not to swap out the kernel pages since they are shared between applications. In fact, kernel memory pages are never swapped out in Linux kernel (Bovet and Cesati, 2005). Attacking Hybrid-Bridge. Since Hybrid-Bridge is built atop KVM and QEMU, any individual successful exploits against KVM or QEMU might be able to compromise HybridBridge, if our infrastructure is not completely isolated from attackers. Moreover, it might appear to be possible to launch a returned-oriented programming (ROP) attack, or other control flow hijack attacks against our trusted VM by manipulating the non-executable data in the untrusted VM kernel because Hybrid-Bridge consumes data from untrusted memory snapshot. However, it is important to mention that Hybrid-Bridge monitors all the instruction execution (including the data flow), and it never fetches a return address from the untrusted VM (recall stack data is never redirected). Therefore, the only way for attacker to mislead the control flow of our trusted VM is to manipulate the function pointers. However, this can also be detected because we check all the instruction execution: whenever a function pointer value is loaded from untrusted VM and later gets called, we can raise flags (because we can observe this data flow) and stop the function call, though this will lead to a denial of service attack. Evading Our Introspection. Hybrid-Bridge assists VMI developers to reuse inspection tools for introspection purposes. However if system calls and well defined APIs used in inspection tools are not rich enough to do a introspection task then Hybrid-Bridge cannot help further. For example, if a Linux rootkit removes a malicious task from task lined list then an inspection tool which rely on task lined list to enumerate all the running processes

80 would fail to detect the malicious task. Note that both Virtuoso and Vmst also face this limitation. More Precise Execution Context Identification for Bi-Redirectable Instructions. Fast-Bridge depends on the execution context to determine the correct date redirection policy for bi-redirectable instructions. While our current approximation design with CSC and P C has not generated any conflict yet, if our Slow-Bridge really detects such a case, we have to resort other means such as instrumenting kernel code to add certain wrapper to further differentiate the context or developing kernel path encoding technique. We leave this as another part of our future work if there does exist such a case. Part of our future efforts will address this problem. For instance, a possible way to improve the performance of Fast-Bridge is not to catch int3 (no VM Exit) in hypervisor level. Instead, we can introduce an in-guest kernel module and patch the int3 interrupt handler to switch the page table entries. Supporting Kernel ASLR. Hybrid-Bridge currently works with Linux kernel, which so far has not deployed the kernel space address space layout randomization (ASLR) yet (Edge, 2013). Addressing kernel ASLR for recent Windows-like system is another avenue of future work.

4.8

Summary

In this chapter, we have presented Hybrid-Bridge, a fast virtual machine introspection system that allows the reuse of the existing binary code to automatically bridge the semantic gap. Hybrid-Bridge combines the strengths of both training based scheme from Virtuoso, which is fast but incomplete, and online kernel data redirection based scheme from Vmst, which is slow but complete. By using a novel fall back mechanism with decoupled

81 execution and training memoization at hypervisor layer, Hybrid-Bridge decouples the expensive execution of taint analysis engine from hardware-based virtualization such as KVM and moves it to software-based virtualization such as QEMU. By doing so, Hybrid-Bridge significantly improves the performance of existing solutions with one order of magnitude as demonstrated in our experimental results.

CHAPTER 5 BRIDGING THE SEMANTIC GAP VIA SYSCALL EXECUTION REDIRECTION

1

In this chapter, we present the detail of HyperShell, a practical hypervisor layer shell for automated, uniformed, and centralized guest OS management. In particular, we discuss the technical overview in Chapter 5.1, host os side design in Chapter 5.2, guest vm side design in Chapter 5.3, implementation in Chapter 5.4, evaluation in Chapter 5.5, discussion in Chapter 5.6 and summary in Chapter 5.7.

5.1

Technical Overview

We introduce a new abstraction called Reverse System Call (R-syscall in short) to bridge the semantic gap for hypervisor layer programs that will be executed in our HyperShell. Unlike traditional system calls that serve as the interface for application programs from a layer below, R-syscall serves as the interface in a reverse direction from a layer up (with a way similar to an upcall (Clark, 1985)). While hypervisor programmers can use our R-syscall abstraction to develop new guest OS management utilities, to largely reuse the existing legacy software (e.g., ps/lsmod/netstat/ls/cp) we make the system call interface of Rsyscall transparent to the legacy software, resulting in no modification when using them in HyperShell. In addition, we also make HyperShell transparent to the guest OS, and we do not modify any guest OS code. All of our design and implementation is done at the hypervisor layer. 1

c

2015 USENIX Association. Reprinted, with permission, from Yangchun Fu, Junyuan Zeng and Zhiqiang Lin. “HYPERSHELL: A Practical Hypervisor Layer Guest OS Shell for Automated In-VM Management”, In Proceedings of 2014 USENIX Annual Technical Conference (USENIX ATC14), pages 85-96.

82

83 5.1.1

Challenges

HyperShell aims at executing guest OS management utilities at the hypervisor layer with the same effect as executing them inside an OS. To this end, we are facing two major challenges: • How to bridge the semantic gap. In HyperShell, guest OS management utilities execute below an OS kernel. However, for OS below software, there are no OS abstractions. For example, there is no pid, no FILE, and no socket. Therefore, we have to reconstruct these abstractions such that the utility software understands the guest OS states and can perform the management. • How to develop the utilities. Suppose we have a perfect approach to bridging the semantic gap, we still have to develop the guest OS management software. Should we develop the software from scratch, or can we reuse any legacy (binary or source) code? Ideally, we would like to reuse the existing binary code as there are already lots of OS management utilities, and we show that this approach is feasible.

Key Insights. Before describing how we solve these challenges, we would like to first revisit how an in-VM management utility executes. Suppose we want to know the host name of a running OS, we can use utility software such as hostname to fulfill this task. In particular, as illustrated in Fig.5.1(a), it will execute 41 system calls (syscall for short) in Linux kernel 2.6.32.8, our testing guest kernel. Among these syscalls, sys uname is the one that really returns the host name. Also, as shown in Fig.5.1(b), this syscall will traverse the current task structure and dereference the field current->nsproxy->uts ns->name to eventually retrieve the machine name. If we implement the same hostname utility and execute it in HyperShell, and if we use a manual approach to bridging the semantic gap, we have to traverse the data structure

84

1. 2. 3. 4. ... 36. ... 40. 41.

execve("/bin/hostname", ["hostname"], ...) = 0 brk(0) = 0x8113000 access("/etc/ld.so.nohwcap", F_OK) = -1 ENOENT mmap2(NULL, 8192, ..., -1, 0) = 0xb7795000 uname({sys="Linux", node="debian", ...}) = 0 write(1, "debian\n", 7) exit_group(0)

= 7

(a) System call trace of command “hostname” c103c305 : 1. 0xc103c420 push %ebx 2. 0xc103c421 mov $0xc137ad34,%eax 3. 0xc103c426 call 0xc125ee10 ... // get the current task structure 19. 0xc103c430 mov %fs:0xc13f9454,%eax // point to current->nsproxy 20. 0xc103c436 mov 0x2c4(%eax),%eax ( ), // point to current->nsproxy->uts_ns 21. 0xc103c43c mov 0x4(%eax),%edx 22. 0xc103c43f mov 0x8(%esp),%eax // point to current->nsproxy->uts_ns->name 23. 0xc103c443 add $0x4,%edx // copy to user space buffer 24. 0xc103c446 call copy_to_user

(b) Disassembled instructions for system call sys_uname Figure 5.1. System call trace of utility hostname and one of its sys uname implementation. again, in the same way as how sys uname does. Since the only interface for user level programs to request OS kernel services is through syscall, and the execution of a syscall is often trusted, then why not let hypervisor programs directly use the syscall abstractions provided by the guest OS? As such, we do not have to develop any code to reconstruct the guest OS abstractions. This is one of the key insights of designing HyperShell. Another key insight is that not all the syscalls should be executed inside the guest OS. One example is the write syscall that prints the “host name” to the screen. If we execute it inside the guest OS, we would not be able to observe the output from HyperShell. Therefore, we introduce an R-syscall abstraction that is used by hypervisor programmers to annotate the syscalls that need to be redirected and executed inside the guest OS.

85 In addition, while hypervisor programmers can use our R-syscall abstraction to develop new software to manage the guest OS, there are already lots of legacy utilities running inside a VM for the same purposes. For instance, there are over hundreds of tools in core-utility, util-linux, and net-tools for Linux OS. If we can make our R-syscall transparent to the legacy software, then there is no need to annotate the R-syscall and we can directly execute the legacy software in HyperShell. For instance, in hostname example, only the uname syscall needs this abstraction. We can thus hook the execution of the syscall and use a transparent policy to determine whether a given syscall is an R-syscall. 5.1.2

Scope and Assumptions

As HyperShell is executed at the hypervisor layer and will also invoke the syscalls from the guest OS, we assume everything below is trusted. This includes the guest OS kernel, host OS and the hypervisor code. Ensuring the hypervisor and guest kernel integrity is an independent problem and out of scope of our approach. In fact, recently there have been many efforts aiming at ensuring the guest kernel and hypervisor integrity (e.g., SecVisor (Seshadri et al., 2007) and HyperSafe (Wang and Jiang, 2010)). Also note that HyperShell is designed mainly for automated guest OS management and not for security. While it could defeat certain attacks such as guest user level viruses, it cannot defend against any guest kernel level attacks. To make our discussion more focused, we assume a guest OS running with a 32-bit Linux kernel atop x86 architecture. For the hypervisor, we focus on the design and implementation of HyperShell using KVM. 5.1.3

Overview

An overview of our HyperShell is presented in Fig. 5.2. For a KVM based virtualization system, there are two kinds of OSes: one is the guest OS that is executed atop a KVM

86 User Space

ps

User Space

Data Control

HYPERSHELL

helper



ls

Shared Memory Library Space

Kernel Space

5

4

1

1

1 Syscall Data Exchanger

1 5

Syscall Dispatcher

Kernel Space

5 1

0

4 2

3

Syscall Data Exchanger

Helper Process Creator Reverse Syscall Execution

KVM

Guest VM (GVM)

Host OS Hardware Layer

Figure 5.2. An Overview of the HyperShell Design. hypervisor, and the other is the host OS that hides the underlying hardware resources and provides the virtualized resources to KVM. The goal of HyperShell is to execute the guest OS management utilities from the host OS to manage the guest OS. To this end, there are five key components: two located inside the library space of the host OS, and three located at the hypervisor layer of the GVM. To use HyperShell, assume hypervisor managers use ls (or other utilities such as ps or hostname) to list the guest files in a given directory. To get started, they will launch ls in our host OS. The real execution of ls will be divided into a master process that is executed inside the host OS, and a helper process that is executed in the GVM. Only when an R-syscall gets executed will we forward the execution of this syscall to a helper process in the GVM and map the execution result (e.g., the directory entries) back such that ls can continue its execution in the master process. There are five key steps involved during the execution of an R-syscall: • Step À: Right after a syscall enters the library space in the host OS, our Syscall Dispatcher intercepts it. If it is not an R-syscall, it directly traps to the host OS kernel

87 for the execution. Otherwise it fetches the syscall number and arguments, and invokes our Syscall Data Exchanger at the host OS side which communicates with its peers at the GVM side with the detailed syscall execution information. Next, our master process gets paused and will be resumed at Step Ä when the redirected R-syscall finishes the execution. At the GVM side, according to each specific syscall specification, the Syscall Data Exchanger will set up the corresponding memory state for the to-be-executed R-syscall. • Step Á: Our Reverse Syscall Execution will wait until the helper process traps to kernel. The helper process is created at Step 0 right after the execution of the management utilities in HyperShell, or can be executed as a daemon depending on the settings. • Step Â: Our Reverse Syscall Execution directly injects the execution of the R-syscall with the corresponding arguments and memory mapping, and makes the R-syscall be executed under the helper process kernel context. Note that such an R-syscall injection and execution mechanism works similar to function call injection from debuggers but with a more powerful capability because of the layer below control from hypervisor. • Step Ã: During the execution of the R-syscall, if there is any kernel state update to the guest OS, this syscall will directly update the kernel memory as usual (e.g., sysctl that changes the kernel configuration). If there is any user space update (such as the buffer in read syscall), it directly updates to the shared memory created by the Syscall Data Exchanger in Step À. • Step Ä: Right after the execution of the syscall exit of the R-syscall, we notify Syscall Dispatcher and Syscall Data Exchanger at the host OS side. We also copy the data from the shared memory to the user space of the master process, if the R-syscall has

88 any memory update. We also resume the execution of the master process and directly return to its user space for continued execution. Regarding the helper process state in the GVM: if the master process terminates, it will also be terminated (in non-daemon mode); otherwise, it will keep executing int32 such that our Reverse Syscall Execution can always take control of the helper process from the hypervisor layer.

5.2

Host OS Side Design

5.2.1

Syscall Dispatcher

The key idea of HyperShell in bridging the semantic gap is to selectively redirect and execute a syscall in the guest OS. (The selected one is called an R-syscall). As shown in Fig.5.1(a), not all the syscalls belong to R-syscalls. Therefore, the first step in our Syscall Dispatcher design is to systematically examine all of the Linux syscalls and define our reverse execution policy for each syscall. Syscall Execution Policy. In our testing guest kernel Linux 2.6.32.8, there are 336 syscalls in total. Among them, we find that technically, nearly all of them can be redirected to execute in a guest OS. However, for process creation (e.g., execve, fork, exit group), dynamic loading (e.g., open, stat, read when loading a shared library), memory allocation (e.g., brk, mmap2), and screen output (e.g., write), we would like them to be executed in the master process created in our HyperShell. Unfortunately, for the rest syscalls, it is also not always clear which syscalls need to be executed in the helper process. For instance, as shown in Table 5.1, suppose we want to copy /etc/shadow from the guest OS to the host OS; in this case, some of the file system related syscalls (e.g., open/stat64/read/close) are executed in the GVM, and some (e.g., 2

An interrupt that is often used by debuggers to set up break points.

89 Table 5.1. Syscalls in cp with different execution policy. The Syscall Trace of “cp /etc/shadow /outside/shadow” Host OS GVM execve("/bin/cp",["cp","/etc/shadow","/tmp/shadow"],…= 0 brk(0)

= 0x8824000

access("/etc/ld.so.nohwcap", F_OK)

= -1 ENOENT

… stat64("/etc/shadow",{st_mode=S_IFREG|0640,st_size=713, ...})=0 stat64("/outside/shadow", 0xbf9bad78)

= -1 ENOENT

open("/etc/shadow", O_RDONLY|O_LARGEFILE) = 0 fstat64(0, {st_mode=S_IFREG|0640, st_size=713, ...}) = 0 open("/outside/shadow", O_WRONLY|O_CREAT|…|O_LARGEFILE, 0640)=3 fstat64(3, {st_mode=S_IFREG|0640, st_size=0, ...}) = 0 read(0, "root::15799:0:99999:7:::\ndaemon:"..., 32768) = 713 write(3 "root::15799:0:99999:7:::\ndaemon:" write(3, root::15799:0:99999:7:::\ndaemon: ..., 713) = 713 read(0, "", 32768)

= 0

close(0) close(3)

open/stat64/write/close) are executed in HyperShell. Even though we could leave the solution to hypervisor programmers, where they would specify which syscall needs to be executed in the master process or helper process, we would prefer to make an automated policy for these syscalls in order to allow for transparent reuse of the legacy binary code. In general, syscalls are relatively independent of each other (e.g., getpid will just return a process ID, and uname will just return the host name). After having examined all of the 336 syscalls, we realize that the syscalls that have connections are often file system and socket related (e.g.,open/stat64/read/write/close), and these syscalls have dependences with the file descriptors. For instance, as illustrated in Table 5.1, if we can differentiate the file descriptor from the GVM and the host OS automatically, we can then transparently execute the existing legacy utility in HyperShell without any modification.

90 Intuitively, we would use dynamic taint analysis (Newsome and Song, 2005) to differentiate the file descriptors that are accessed inside the GVM or the host OS. However, such a design would require instruction level instrumentation, which is often very slow. In fact, our earlier design adopted such a taint analysis approach by running HyperShell in an emulator. Surprisingly, we have a new observation and we can actually eliminate the expensive dynamic taint analysis. In particular, as a file descriptor is just an index (a 32-bit unsigned integer) to the opened files (and network socket) inside the OS kernel for each process, it has a limited maximum value (due to the resource constraints). In our testing Linux kernel, it is 1023 (which means a process can only open 1024 files at the same time). Also, it is extremely rare to perform data arithmetic operations on a file descriptor. Therefore, we can in fact add a distinctive value (e.g., 4096 or 8192) to the file descriptor returned by the GVM. Whenever such a descriptor is used by the GVM again, we subtract our added value. As such, we can differentiate whether a file descriptor is from the host OS or the GVM by simply looking at its value. Whether a file descriptor should be returned from the GVM or the host OS depends on the semantics of open. Specifically, if it is opening the guest OS files (we can differentiate this based the parameters, and internally we add a prefix associated with the guest files), it is executed in the GVM; otherwise it is executed in the host OS. For instance, we know “/etc/shadow” is in the GVM, and “/outside/shadow” is in the host OS while executing “cp /etc/shadow /outside/shadow”. Similarly, we can also infer the files involved in “cp -R ” by their names and their opening mode. Syscalls in Dynamic Loader. To intercept the syscall, we use dynamic library interposition (Curry, 1994) (a technique that has been widely used in many applications such as LibSafe (Tsai and Singh, 2002)). Interestingly, we notice that the syscalls executed in dynamic loader cannot be trapped by our library interposition. Therefore, syscalls executed while loading a dynamic library (e.g., access/open/stat64/read/close) will not be

91 checked against our policy, and they will be executed directly on the host OS side, which is exactly what we want. Summary. By default, the majority of the syscalls will be treated as redirectable and they will be executed in the GVM, except process execution and memory management related syscalls that will be executed in the host OS. All file system and network connection related syscalls will be checked against the file descriptor. Whether a file descriptor needs to be checked is determined by the semantics of the corresponding file operations. 5.2.2

Syscall Data Exchanger

Since we need to make an R-syscall executed in the GVM, we must inform the GVM with the corresponding context and also update the corresponding memory state at the host OS side to reflect the R-syscall’s execution. Our Syscall Data Exchanger is designed for this goal. Specifically, right after an R-syscall enters the library space (Step À), we will retrieve the syscall arguments (e.g., the buffer address and size information) based on the corresponding syscall’s specification. Then, we will inform its peer (to be discussed in §5.3.3) to prepare for the necessary arguments at the GVM side. Once an R-syscall finishes the execution (Step Ä), we will pull the data back from the GVM to the host OS. All of these operations are quite straightforward.

5.3 5.3.1

Guest VM Side Design Helper Process Creator

An R-syscall must be executed under a certain process execution context in the GVM. While we could hijack an existing process to execute an R-syscall, such an approach is too intrusive to the hijacked process. Therefore, we choose to create a helper process dedicated

92 to executing our R-syscall in the guest OS. Regarding the permission of this helper process, it should have the highest privilege; otherwise an R-syscall may fail due to certain permissions. Also, it would terminate when the master process terminates (to minimize the impacts to the guest OS workloads). To have better performance while executing the management utilities in HyperShell, we can also have an option of creating a daemon process as the creation of a helper process takes additional time. There are only three instructions for this helper process as shown below: 00000001

cd 80

int 0x80 _loop: loop:

00000003 00000004

cc eb fd

int 3 jmp _loop

Basically, it keeps executing int3 (i.e., while(1) int3) with a prefix of int 0x80. We will explain why we use such an instruction sequence in §5.3.2. Then the challenge lies in how to select a high privilege process to fork the helper process. Since all Linux kernels have an init process with PID 1, one option is to traverse the pid field of the task struct for each process. But such a design would make HyperShell too OS-specific. Fortunately, since we are able to inject an R-syscall (discussed in §5.3.2), we are certainly able to inject getpid to inspect the return values. If it is 1, we can therefore infer that the current execution context is the init process, and we can then inject a fork syscall to create our helper process. Meanwhile, we will retrieve the child PID from the return value of fork, and then use getpid again to identity the helper process. Once we have identified it, we will pull its CR3 such that the hypervisor knows it is the int3 that occurs in our helper process, not others (e.g., gdb) by looking at the CR3 value. Consequently, we must design a mechanism to intercept the entry point and exit point of the syscall execution for each process in order to select the init process. Once we have created our helper process, we will not need this interception. We call the selection of init process redirection initialization phase (i.e., the RI-Phase that only occurs at Step 0 ) in the GVM. With hardware-assisted virtualization, we can rely on hardware mechanisms to

93 intercept the execution of the syscall instructions. Ether (Dinaburg et al., 2008), built atop the Xen hypervisor, leverages a page fault exception to capture syscall entry and syscall exit points. Nitro (Pfoh et al., 2011), based on the KVM hypervisor, leverages invalid segment exceptions to intercept the pair of sysenter/int0x80 and sysexit syscalls for a single process. In our design, we extend Nitro to intercept the system wide syscall entry and exit pairs (for all processes).

5.3.2

Reverse Syscall Execution

After we have passed the RI-Phase, we are then ready to execute an R-syscall if there is any. Yet we have to solve two additional challenges: when and how to execute an R-syscall under our helper process context. When to Inject a Syscall. At a given time, a process either executes in user space or kernel space. To trap to kernel, a process must use a syscall or interrupt (including exceptions). As an interrupt or exception can occur at arbitrary time, the OS must be designed in such a way that it is safe to trap to OS kernel and execute syscall or interrupt handler services at any time in user space. However, we cannot inject a syscall execution at arbitrary time in kernel space. This is because: (1) the injected syscall might make kernel state inconsistent. For instance, we might inject a syscall when the kernel is handling an interrupt, and there might be some synchronization primitives involved (e.g., spin lock). After we inject a new syscall, if this syscall execution also happens to lock some data or release certain locks, it may cause inconsistency among these locks. (2) Similarly, we might make the non-interruptible code interruptible. For instance, if the kernel is executing the cli code block and has not executed sti yet, and if we inject a new syscall, this may make the non-interruptible code interruptible. (3) We might also overflow the kernel stack of a running process if it already has a large amount of data.

94 Therefore, to inject the execution of a syscall, we use the approach that right before entering the kernel space (e.g., sysenter/int0x80), or right after exiting to the user space of a running process, we will save the current execution context (namely all the CPU registers), and then execute the injected syscall (such as our getpid case in §5.3.1). Regarding our helper process, we have a slightly different strategy to inject the R-syscall. In particular, when the int3 traps to hypervisor, we change the current user level EIP (pointing to cc at this moment) to EIP-2, which points to “int 0x80”; meanwhile, we prepare for the necessary arguments such as setting up the corresponding registers. Then when control returns to the user space of the helper process, it will automatically execute the syscall we prepared for because we have changed its EIP. The use of int3 is to make the control flow of the helper process trap to the hypervisor. There are also alternative approaches such as using a cpuid instruction. How to Execute an R-syscall. To execute an R-syscall, we have to set up the syscall arguments and map the memory that will be used during the R-syscall execution. This is done by our Syscall Data Exchanger (§5.3.3) at Step À. After that, the syscall will be executed as usual in the GVM. If there is any memory update to the user space, it will directly (Step Ã) update to the shared memory that is allocated by our Syscall Data Exchanger. For kernel space, it directly updates the guest kernel. Once an R-syscall finishes, we inform the Syscall Dispatcher at Step Ä, and push the updated memory back to the master process. At the GVM side, the helper process continues its execution of int3. When the master process exits, we terminate the helper process if it is not executed in the daemon mode. 5.3.3

Syscall Data Exchanger

As discussed in §5.2.2, we need to pass the corresponding syscall parameters to the GVM. Also, we need to map the data back to the host OS if there is any memory update. The Syscall Data Exchanger at the GVM side is exactly designed to achieve these goals.

95 One issue we have to solve is the virtual address relocation. This is because the same virtual addresses used by the host OS may not be available for the helper process in the GVM, and we have to relocate the virtual addresses used in the syscalls of the master process to the available addresses of the helper process. To this end, before the execution of the first R-syscall, we will first allocate a large buffer (as a cache) with a default size of 64K bytes by injecting a mmap syscall and recording the mapped virtual address of this buffer, denoted as Vg , and its size, denoted as Sg . (Certainly, the guest OS will automatically munmap this allocated space once the helper process terminates.) Then whenever there is an R-syscall (e.g., read) that has an argument with virtual address Vh and size Sh , we will use Vg as the buffer starting address instead of Vh , and if Sh is greater than Sg , we will inject mmap to map more caches. Also, to avoid too many data transmissions between the host OS and the GVM, we allocate a shared memory between them. Right after the execution of the mmap syscall to allocate new pages for the redirected syscall, in the hypervisor layer we map the pages of the shared memory to the virtual address of the mmap returned page by traversing the page tables (rooted by the captured CR3) of the helper process, such that we do not have to perform an additional memory copy from the GVM to the shared memory. To prevent being swapped by the guest OS, we inject mlock syscall to lock the mmap allocated memory.

5.4

Implementation

We have developed a proof-of-concept prototype of HyperShell.3 The implementation is scattered across both the host OS side, which is atop Linux kernel 3.0.0-31, and the KVM side. While we have used KVM to build HyperShell, we believe our design can be applied to other types of hypervisors such as Vmware, Xen and VirtualBox. Below we briefly describe how we implement our system. 3

The source code of our prototype is publicly available at github.com.

96 Host OS. As described in §5.2.1, we use dynamic library interposition to hook and dispatch the system call execution to either the host OS or the guest OS. We have developed a shared library and use LD PRELOAD to hijack all syscall executions from the master process. The most time consuming part is the arguments handling and data transferring between the host OS and the GVM for each syscall. In total, we developed around 2,700 lines of C code (LOC) for this library. GVM. Our GVM is atop KVM-3.9. KVM consists of two components: a kernel module (kvm-kmod) to implement the hardware virtualization, and a user level program (qemu-kvm) to emulate other virtual devices. The part to trap syscalls is done by kvm-kmod. Again, to intercept the syscall entry and exit in RI-phase, we extended the Nitro (Pfoh et al., 2011) approach. Our host OS communicates with the GVM using a shared memory and semaphores. Specifically, when a management process gets started, on the host OS side we will first create the shared memory and set up all resources. After the GVM finishes the RI-phase, its hypervisor will poll the semaphores to check whether there will be any redirected Rsyscall to execute. If so, GVM patches the EIP to EIP-2 at the hypervisor layer, sets up the necessary parameters of the to-be-executed syscall, and returns to user space. There are certainly other alternative designs such as using socket communication if the host OS and the GVM is not within a physical machine. The entire additional implementation for KVM is around 1,000 LOC. 5.5

Evaluation

We have developed a proof-of-concept prototype of HyperShell with 3,700 lines of C code. The implementation is scattered across both the host OS side, which is atop Linux kernel 3.0.0-31, and the KVM side. While we have used KVM to build HyperShell, we believe our design can be applied to other types of hypervisors such as Vmware, Xen and VirtualBox.

97 Next, we present our evaluation results. All of our experiments were carried out on a host machine configured with an Intel Core i7 CPU with 8G memory and running with Ubuntu 12.04 using Linux kernel 3.0.0-31; the guest OS is Debian 6.04 with kernel 2.6.32.8.

5.5.1

Effectiveness

Benchmark Software. Recall the goal of HyperShell is to enable the execution of native management utilities at the hypervisor layer to manage a guest OS, and also enable the fast development of these software by using the R-syscall abstraction. Since the software development with HyperShell is very simple (a hypervisor programmer just needs to annotate the syscall and inform HyperShell which one is an R-syscall), we skip this evaluation. In the following, we describe how we automatically execute the native utilities in HyperShell to transparently manage a guest OS. Today, there are a large number of administrative utilities to manage an OS. To test HyperShell, we systematically examined all of the utilities (in total 198) from six packages including core-utility, util-linux, procps, module-init-tools, sysstat, and net-tools, and eventually we selected 101 utilities, as presented in Table 5.2, though technically we can execute all of them. The selection criteria is the following: if a utility is all user level program (e.g., hash computation such as md5sum), or not so system management related (e.g., tr), or can be executed in alternative way (e.g., poweroff, halt), or not supported by the kernel any more (e.g., rarp), we ignore them. Experimental Result. Without any surprise, through our automated system call reverse execution policy, all of these utilities can be successfully executed in HyperShell. To verify the correctness of these utilities, we use a cross-view comparison approach in a similar way when we tested our prior systems such as Vmst (Fu and Lin, 2013a, 2012) and Exterior (Fu and Lin, 2013b). Basically, to test a given utility such as ps, we first execute it inside

S 7 7 X X 7 7 X X 7 X X X X X X X X X X S 7 7 7 S X X X S X X X X X X X X

B(ms) 1.33 1.95 0.07 0.01 0.29 0.69 0.11 0.11 504.92 0.07 1.27 0.89 0.87 0.17 0.07 0.05 0.16 0.01 0.62 B(ms) 0.04 0.19 0.22 B(ms) 0.51 0.48 0.10 B(ms) 0.14 0.07 0.07 0.19 0.11 0.09 0.09 0.26

D(ms) 5.42 7.56 0.11 0.02 0.66 6.03 0.16 0.18 510.85 0.26 1.28 4.72 4.33 0.65 0.09 0.07 0.36 0.04 3.03 D(ms) 0.08 0.33 0.36 D(ms) 3.14 1.54 0.17 D(ms) 0.72 0.11 0.1 0.45 0.46 0.53 0.11 0.85

T (X) 4.08 3.88 1.57 2.00 2.28 8.74 1.45 1.64 1.01 3.71 1.01 5.30 4.98 3.82 1.29 1.40 2.25 4.00 4.89 T (X) 2.00 1.74 1.64 T (X) 6.16 3.21 1.70 T (X) 5.14 1.57 1.43 2.37 4.18 5.89 1.22 3.27

date w hostname groups hostid locale getconf System Utils uptime sysctl arch dmesg lscpu mcookie Disk/Devices blkid badblocks lspci iostat du df Filesystem sync getcap lsof pwd Files chgrp chmod chown cp uniq file find grep ln ls

7 7 X X X X X S 7 X X X X 7 S X X X X X X S X X X X S X X X X X X X X X X 0.11 0.95 0.04 0.21 0.16 0.09 0.09 B(ms) 0.07 8.5 0.07 0.38 0.26 0.29 B(ms) 0.14 0.35 31.40 0.45 0.11 0.16 B(ms) 8.07 0.04 3.31 0.07 B(ms) 0.19 0.07 0.19 0.11 0.09 0.87 0.20 0.35 0.08 0.14

0.12 6.62 0.06 0.62 0.56 0.17 0.34 D(ms) 0.47 42.72 0.11 0.51 1.21 0.49 D(ms) 0.61 0.44 36.52 1.04 0.53 0.35 D(ms) 6.53 0.08 6.12 0.11 D(ms) 0.47 0.14 0.47 0.27 0.35 1.72 0.58 2.14 0.14 0.27

1.09 6.97 1.50 2.95 3.50 1.89 3.78 T (X) 6.71 5.03 1.57 1.34 4.65 1.69 T (X) 4.36 1.26 1.16 2.31 4.82 2.19 T (X) 0.81 2.00 1.85 1.57 T (X) 2.47 2.00 2.47 2.45 3.89 1.98 2.90 6.11 1.75 1.93

mkdir mkfifo mknod mv rm od cat link comm shred truncate head vdir nl tail namei whereis stat readlink unlink cut dir mktemp rmdir ptx chcon Network ifconfig ip route ipmaddr iptunnel nameif netstat arp ping Avg.

X 0.10 X 0.10 X 0.10 X 0.15 X 0.08 X 0.12 X 0.07 X 0.07 X 0.08 7 0.72 X 0.07 X 0.07 X 0.63 X 0.08 X 0.08 X 0.07 X 2.05 X 0.27 X 0.07 X 0.07 X 0.08 X 0.07 X 0.09 X 0.07 X 0.12 X 0.06 S B(ms) 7 0.32 X 0.10 X 138.65 X 0.13 X 0.09 X 0.10 7 0.25 X 0.14 7 15.02 7.27

0.19 1.90 0.19 1.90 0.19 1.90 0.31 2.07 0.15 1.88 0.35 2.92 0.18 2.57 0.13 1.86 0.22 2.75 0.92 1.28 0.26 3.71 0.15 2.14 3.95 6.27 0.17 2.13 0.20 2.50 0.13 1.86 4.86 2.37 0.78 2.89 0.12 1.71 0.13 1.86 0.17 2.13 0.20 2.86 0.18 2.00 0.13 1.86 0.45 3.75 0.12 2.00 D(ms) T (X) 1.15 3.59 0.20 2.00 150.32 1.08 0.34 2.62 0.29 3.22 0.21 2.10 0.37 1.48 0.24 1.71 18.2 1.21 8.45 2.73

S stands for whether there is any Syntax-difference, B(ms) stands for the average time of the base execution, D(ms) stands for the average execution time of the utility in HyperShell when using the daemon mode in GVM, and T (X) stands for the result of D/B (i.e., the times).

Process ps pidstat nice getpid mpstat pstree chrt renice top nproc sleep pgrep pkill snice echo pwdx pmap kill killall Memory free vmstat slabtop Modules rmmod modinfo lsmod Environment who env printenv whoami stty users uname id

Table 5.2. Evaluation Result of the Tested Utility Software.

98

99 the GVM and save the output, which is called the in-VM view; then we execute it inside HyperShell to manage the GVM and also save the output, which is called the out-ofVM view. Then we compare the syntax (through diff) and semantics (with a manual verification) of the in-VM and out-of-VM views, which leads to the two sets of effectiveness test results: one is the syntax comparison, and the other is the semantic (i.e, the meaning) comparison. We notice that while there are 16 utilities that have syntax differences (as shown in the S column in Table 5.2), all other utilities have the same screen output. A further investigation shows that the syntax differences among them is actually caused due to the different location (host OS vs. GVM) and timing of performing our in-VM and out-ofVM experiment. Regarding the semantics, we notice that all of the utilities have the same semantics as the original in-VM programs through our manual verification. Testing w/ More Guest Kernels. Working at syscall level allows HyperShell with less constraint and wider applicability because of the POSIX compatibility. For instance, we can now use a single host OS to manage a large number of syscall-compatible OSes. To validate this, we selected five other recently released Linux kernels of versions 2.6.32, 2.6.38, 3.0.10, 3.2.0, and 3.4.0, and executed them in our GVM. Our benchmark utilities were all correctly executed with these kernels. 5.5.2

Performance Overhead

When executing a program in HyperShell, there are two processes to fulfill the execution: the master process executed in the host OS, and the helper process executed in the guest OS. Consequently, we have to measure two sets of performance. One is how slow an end-user would feel when executing a utility in HyperShell. The other is the impact with respect to the guest OS kernel due to our syscall capturing and helper process execution at the GVM. Below we report these two types of overhead.

100 Performance Impact to the Native Utilities With different settings of the helper process (daemon or non-daemon), we could also have two sets of performance overhead for the utility software. However, the performance differences for these two settings mainly come from the creation of the helper process, which is almost a constant factor (the time interval between the two scheduled executions of the init process). Our evaluation shows that every 5 seconds, the init process will be scheduled. Therefore, it leads to the creation of a helper process with maximum 5 seconds, the worst case delay if we want to use a non-daemon helper process to execute the R-syscall. All other latency is the same compared to the daemon mode execution. Therefore, in the following, we present our result with the daemon mode execution of the helper process. Again, we used these 101 utilities in effectiveness tests to measure this overhead. Specifically, we executed the utilities each with 100 times and computed their average. First, we ran all of them in a native-KVM and got the average execution time for each of them as the base. This result is presented in the B-column of Table 5.2. Then we collected the average run time of these utilities in HyperShell with a daemon helper process in the GVM. This result is presented in the D-column. We computed the overhead of this test with the base one, and we report them in the T -column. We compare with the execution running in native-KVM instead of native host OS because we are comparing our out-of-VM approach with an in-VM approach. We notice that on average, with a daemon mode helper process, HyperShell has 2.73X slowdown compared to the executions running in a native-KVM. This overhead mainly comes from the data exchange and synchronization between the host OS and the GVM during the R-syscall execution. Performance Impact to the GVM The performance impact to the GVM also falls into two scenarios: one is the system wide sysenter/sysexit interception that is used to capture the init process (recall we name it

101 the RI-Phase), and the other is the R-syscall execution that occurs in the helper process (we call this RE-Phase). These two phases inevitably introduce performance penalty to the running workloads/processes at GVM. Note that if the GVM is neither running in RI nor RE-Phase, there is no performance overhead. To quantify the overhead from these two scenarios, we used standard benchmark programs (e.g., LMBench, and ApacheBench) that are used in other work (e.g., (Wang and Jiang, 2010; Zhang et al., 2011)) to measure the runtime overhead of the guest OS execution at both micro and macro level for these two phases. Also, according to the result from Table 5.2, the execution of the RE-phase is very short (on average 8.45 milliseconds). In addition, our RI will never be executed if the helper process has created. Therefore, we have to create an environment to keep executing RI and RE such that we can measure the impact to the long running benchmark programs. That is, we will keep polling the init process to measure the impact from the RI-Phase, and keep executing the int3 loop for the helper process to measure the impact from the RE-Phase. These results are the worst case performance impact to running processes in the GVM.

)

se Sl ow

do w

E -R

2.28 147.26 480.00 1088.10 1.23 40.67 7.16 5647.71 72.65 10000.00 8543.00

82.89 67.95 67.04 64.63 52.03 56.23 35.20 0.73 0.62 1.48 0.29

0.41 47.54 161.30 386.30 0.73 17.96 4.65 5605.40 73.24 10000.00 8540.40

4.88 0.72 1.92 0.36 19.18 0.89 0.22 1.47 1.42 1.48 0.32

V G

M

n

-P

(%

ha

) (% n do w

Sl ow

M -R

0.39 47.20 158.20 384.90 0.59 17.80 4.64 5689.17 72.20 10150.00 8567.70

V G

Tested Item stat (µs) fork proc (µs) exec proc (µs) sh proc (µs) ctxsw (µs) 10K File Create (µs) 10K File Delete (µs) Bcopy (MB/s) Rand mem (ns) Mem read (MB/s) Mem write (MB/s)

N

at iv e-

K

V

M

IP ha

se

Table 5.3. Micro-benchmark Test Result of GVM.

Micro-benchmarks. To evaluate the primitive level performance slowdown, we used LMBench suites. In particular, we focused on the overhead of the stat syscall, process creation

102 (fork proc), process execution (exec proc), C library function (sh proc), context switches (ctxsw), memory-related operations (e.g., bcopy, Mem read, Mem Write), and IO-related operations (e.g., 10k File Create, and 10K File Delete). The detailed result is presented in Table 5.3. The RI-Phase tends to have large overhead on tests which contain syscalls, as we intercept the system-wide syscall entry and exit points. While we do not intercept context switches, our system still has large overhead on the ctxsw test. The reason is that LMBench tests the time of context switches on a number of processes. And these process are connected using pipe. Therefore, the measurement still contains syscalls. In contrast, during the RE-Phase, our syscall interception is only within the helper process, and it has significantly less overhead except the ctxsw case with similar reasons in the RI-Phase. Macro Benchmarks. We used four real world workloads to quantify the performance slowdown at the macro level. In particular, we decompressed a source tarball of Linux 2.6.32.8 using bzip, and then compiled the kernel using kbuild. We recorded the process time. In the test of Apache, we used ApacheBench to issue 100, 000 requests for a 4k-byte file from a client machine and got the throughput (#request/s). For memcached, we recorded the time of processing 1, 000 requests. The performance overhead is presented in Table 5.4. For the RI-Phase, the overhead comes from the VMexit of trapping syscall entry and exit. Hence, the workloads that have large portions of IO operation will incur large overhead, e.g., as in Kbuild, Apache, and memcached. The worst case is memcached which is also sensitive to IO-latency. In contrast, computation intensive workloads have small overhead (as in the bzip case). Regarding the RE-Phase, all the workloads have small overhead because our system only introduces a user mode int3 loop. The VMexit only occurs in the helper process execution context. The only side effect is that the helper process takes some CPU time slices from them.

103 5.5.3

Case Studies

Once we have enabled the execution of native utilities in HyperShell to manage the guest OS, many new use cases would appear. For instance, we can now kill malicious processes, remove malicious drivers, change the guest IP address, update the firewall rules, etc., directly from the hypervisor layer. In the following, we demonstrate an interesting use case of our system—full disk encryption (FDE) protected virus scanning from the hypervisor. Today, because of the privacy and data-breach concerns, a growing practice for outsourced VMs is to deploy FDE. Unfortunately, this has brought challenges for disk introspection, forensics, and management. With HyperShell, we can actually use off-the-shelf anti-virus software from the host OS to transparently scan files in the guest OS even though the GVM disk might have been encrypted by FDE. To validate this, we installed dm-crypt, a transparent FDE subsystem in the Linux kernel (since version 2.6) in our GVM. Under a test user home directory, we copied a large volume of files including the source code of Linux-2.6.32.8, gcc, glibc, QEMU, Apache, and Lmbench, as well as two viruses from offensivecomputing.com, resulting in a total number of 101, 415 files adding up to 1336.09 megabytes in size. In the host OS, we installed ClamAV-0.98 and used it (in particular its clamscan) to scan the files in /home/test in the GVM. We tried two different approaches in this testing: • The first is to directly allow clamscan running in HyperShell to scan the files in the GVM by redirecting the R-syscall, and in this case it took 188.35 seconds to scan the entire 1336.09 megabytes of files and find the two viruses. • The second is to copy (i.e., cp) the files in /home/test to our host OS, and then scan them natively. In this case, it took 59 seconds to copy these files, with another 120.91 seconds scanning them, resulting in a total of 179.91 seconds.

104 It is worth noting that very interestingly if we installed ClamAV inside the GVM and scanned these files, it would take 271.58 seconds. Therefore, by moving certain management software running into HyperShell, it can in fact speedup certain computation (188.35 vs. 271.58) as shown in our clamscan case. There are two primary sources for this speedup: one is that there is no additional VMexit when processing the disk IO at the host OS side (i.e., IO in host OS is usually faster than guest OS), and the other is because there is no need for the decryption of the signature data base of ClamAV when running at host OS.

(% )

se ha Sl ow

do w

n

-P -R E

n G

V M

Sl ow

do w

-R G

V M

(%

ha IP

V M K iv e-

5.6

N at

Benchmark Program bzip (s) kbuild (s) memcached (s) Apache (#request/s)

)

se

Table 5.4. Macro-benchmark Test Result of GVM.

16.83 1799.00 1.57 1104.60

18.35 2270.25 3.11 904.12

8.28 20.76 49.52 18.15

17.04 1889.97 1.64 1065.28

1.23 4.81 4.27 3.56

Discussion

While HyperShell offers better automation (e.g., no need of login), uniformity (e.g., all of the VM can be checked for anti-virus), and centralized management (e.g., using only one copy of the software running at a hypervisor to manage a large number VMs, and there is a need of only updating the copy at the hypervisor layer), it comes with price. In particular, it will circumvent all of the existing user login and system audit for each managed VM. For instance, syslog in each individual VM will not be able to capture all the executed events inside the guest OS. To fix this, we need to add a new log record at the hypervisor layer for each activity executed in HyperShell, such that the entire cloud can still be audited. One avenue of our future research will address this. Second, as normal utility software does, HyperShell requires the trust of the guest OS kernel as well as the init process. Consequently, it cannot be used for security critical

105 applications, especially when the kernel has been compromised. Also, unlike introspection, which aims to achieve stealthiness, HyperShell is not designed with this goal in mind, since its primary goal is to manage the guest OS (which definitely introduces footprints) from out-of-VM in the same way as we manage in-VM but in a more centralized and automated manner. Third, our current prototype requires both OSes running in the host OS and the GVM to have compatible syscall interface. If a guest OS uses a randomized system call interface (e.g., RandSys (Jiang et al., 2007)), it could thwart the execution of the management utilities at HyperShell. In fact, we can design certain logic in our Syscall Dispatcher and Reverse Syscall Execution component to perform syscall translations even though the syscalls are not fully compatible or randomized (e.g., with different syscall number). We leave this as another future work. Again, we would like to emphasize that working at syscall boundary makes HyperShell with less constraint when compared to other alternative approaches. For instance, it is possible to directly inject the shell command to the guest OS to achieve the same goal (e.g., configure the guest OS), or directly inject the file system updates. However, command-line interfaces or configuration file interfaces are less stable when compared to the syscall interface. That is why eventually it leads to our R-syscall based approach. Finally, our Syscall Dispatcher uses dynamic library interposition, and it ignores the syscall policy checking in the dynamic loader. Therefore, static linked native utilities cannot be executed in HyperShell. Also, if there is a different loader whose syscall can be captured by library interposition, we have to design new techniques to differentiate the syscall policy for these syscalls. One possible solution is to add the call stack context in our policy check. In addition, while most of our design is OS-agnostic, we currently only demonstrate HyperShell with the Linux kernel and we would like to test with other OSes such as Microsoft Windows. We leave these in our other future efforts.

106 5.7

Summary

In this chapter, we have presented the design, implementation, and evaluation of HyperShell, a practical hypervisor layer shell for automated, uniformed, and centralized guest OS management. To overcome the semantic gap challenge, we introduce a reverse system call abstraction, and we show that this abstraction can be transparently implemented. Resulting from this, many of the legacy guest OS management utilities can be directly executed in HyperShell. Our empirical evaluation with 101 native Linux utilities shows that we can use HyperShell to manage a guest OS directly from the hypervisor layer without requiring any access to administrator’s account. Regarding the performance, it has on average 2.73X slowdown for the tested utilities compared to their native in-VM execution, and less than 5% overhead to the guest OS kernel.

CHAPTER 6 LIMITATIONS AND FUTURE WORK Overall, we demonstrate that the new approaches we have developed in this dissertation are highly practical and feasible for VMI. However, there are still some limitations that need to be addressed in future works. Before concluding this dissertation, we would like to discuss the future works in this chapter. Handling disk data introspection. Currently, Vmst and Hybrid-Bridge only support the introspected tool examining the memory data. If a VMI tool needs to open an inguest disk file in product-VM, it will not be redirected by our current scheme, though endusers could directly copy these files outside and then inspect them. Supporting disk data redirection is one of the future works. Developing more customized VMI tools. While Vmst and Hybrid-Bridge provide a framework to enable native, off-the-shelf in-VM program to automatically become outof-VM introspection program, it cannot be directly used to defend arbitrary threat and it still requires programmers’ efforts to develop the security software based on their needs, if there is no such tool available (for instance, as in DKOM rootkit detection example). The advantage of using Vmst and Hybrid-Bridge is that hypervisor programmers do not have to worry about the semantic-gap, and they can use native system calls or kernel APIs to develop these software. Developing more customized VMI tools to systematically handle new threats is another future work. Reducing VMExit Overhead. VMExit is the main contributor to the performance overhead of Hybrid-Bridge and HyperShell. Reducing VMExits would be an important 107

108 immediate task. Part of our future efforts will address this problem. For instance, a possible way to improve the performance of Hybrid-Bridge is not to catch int3 (no VM Exit) in hypervisor level. Instead, we can introduce an in-guest kernel module and patch the int3 interrupt handler to switch the page table entries.

CHAPTER 7 RELATED WORK Our work is closely related to virtual machine introspection and binary code reuse. It is also related to dynamic data dependency tracking, kernel rootkit detection, memory forensic, hybrid-virtualization and training memorization. In this chapter, we compare our work with them.

7.1

Binary Code Reuse

Recently, there is a great attention towards binary code reuse for security analysis (Caballero et al., 2010; Kolbitsch et al., 2010; Dolan-Gavitt et al., 2011) and creation of malicious code (Lin et al., 2010b). In particular, BCR (Caballero et al., 2010) made a systematic study of automated binary code reuse, and demonstrated its effectiveness in extracting encryption and decryption components from malware. Similarly, through dynamic slicing, Inspector gadget (Kolbitsch et al., 2010) also focuses on extracting and reusing features inside malware programs. In a different application, ROC attack (Lin et al., 2010b) shows that we can also reuse the legal binary code to create stealthy trojans by directly patching benign software. Dolan-Gavitt et al. proposed Virtuoso (Dolan-Gavitt et al., 2011), a technique for narrowing the semantic gap in VMI. The idea is to acquire traces of an inspection command (e.g., ps) on a clean guest-OS through dynamic slicing. Such clean slices are then translated and executed at the VMM layer to introspect the identical version of the guest-OS that may be compromised. Our Vmst is directly inspired and motivated by Virtuoso. Most recently, TOP (Zeng et al., 2013) demonstrates that we can dynamically decompile malware code, unpack and transplant malware functions. 109

110 Compared to all of the existing techniques, Vmst distinguishes itself by its exploration of other settings of binary code reuse. Specifically, instead of extracting the code outside from binary (kernel or user-level code) for the reuse, we can still retain these pieces of code. Through automatically identifying the specific execution context, we dynamically instrument the code and achieve our reuse.

7.2

Virtual Machine Introspection

Due to the nature of strong isolation, VMI has largely been used in many security applications, including intrusion detection (e.g., (Garfinkel and Rosenblum, 2003; Payne et al., 2007, 2008)), malware analysis (e.g., (Jiang et al., 2007; Dinaburg et al., 2008)), process monitoring (e.g., (Srinivasan et al., 2011)), and memory forensics (e.g., (Hay and Nance, 2008)). Again, similar to Virtuoso, Vmst complements these works by enabling automated VMI tool generation. There is also a VMI framework called VProbe. By using the VProbe scripts, security analyst can develop tools to collect data about the state of a guest-OS. However, vProbe still does not hide the low level kernel details and developers must understand such as kernel data structures before developing the VMI tools. For example, to retrieve the current process name, developers must find out where the current task is and which field stores the process name. Meanwhile, there is another work (Inoue et al., 2011) aiming to automatically bridge the semantic gap in VMI. Their technique involves using a C interpreter facilitated by the OS kernel data structure information and XenAccess library (Payne et al., 2007) to interpret the introspection code. Such a technique is entirely different from Vmst. For example, users have to write the introspection code running in their interpreter. In contrast, Vmst directly uses the common utilities without any code development from users.

111 Also, recent process out-grafting (POG) (Srinivasan et al., 2011), although Vmst shares a general idea of using a secure VM to do the monitoring and redirection of “some” data during execution, Vmst still substantially differentiates from this approach in a number of aspects. In particular, we have entirely different goals. POG aims to monitor an untrusted process, but we aim to inspect the whole OS. POG only intercepts kernel execution at the system call granularity (which explains why they can implement it using KVM), whereas we have to monitor all the instructions. Consequently, their data redirection is only system call arguments and return values, whereas we have to automatically identify the redirectable data on-the-fly, and redirect only the introspection related data. Our recent work, HybridBridge (Saberi et al., 2014) has improved the performance with one order of magnitude to Vmst. HyperShell (Fu et al., 2014) made the performance run even faster.

7.3

Kernel Rootkit detection

Our rootkit detection uses a cross-view comparison approach that was initially proposed in Ghostbuster (Yi-Min Wang and Verbowsk, 2005). We compared the in-VM view and the out-of-VM view to detect the hidden process or module. Other closely related work includes VMWatcher (Jiang et al., 2007), SBCFI (Petroni and Hicks, 2007), and KOP (Carbone et al., 2009). But they require detailed kernel knowledge such as traversing specific kernel data structures to detect the rootkit.With a drop-in memory controller, mGuard (Liu et al., 2013) can detect hypervisor rootkit. Recently, HUKO (Xi Xiong, 2011) and OSck (Hofmann et al., 2011) infer possible rootkits presence by detecting the kernel integrity violation including kernel code, data and control flow. HUKO addresses the sematic gap problem by introducing an in-guest module to label the dynamic content, whereas OSck extracts kernel type graph from source code and traverses

112 this type graph starting from a set of kernel global variables, following pointers to examine all allocated kernel objects.

7.4

Dynamic Data Dependency Tracking

Vmst employs a generic technique of dynamic data dependency tracking (i.e., taint analysis) in determining the redirectable data. Such techniques have been widely investigated and there exists a large body of recent work in this area, such as data life time tracking (e.g., (Chow et al., 2004)), exploit detection (e.g.,(Newsome and Song, 2005)), vulnerability fuzzing (e.g., (Cadar et al., 2006; Godefroid et al., 2008)), protocol reverse engineering (e.g., (Caballero and Song, 2007; Cui et al., 2008; Lin et al., 2008; Wondracek et al., 2008)), and malware analysis (e.g., (Egele et al., 2007; Yin et al., 2007)). There are also some open source dynamic instrumentation frameworks(e.g., PEMU (Zeng et al., 2015), TEMU (Yin and Song, 2010)) can be used for dynamic data dependency tracking.

7.5

Memory Forensics

Technically, memory forensic analysis shares large similarity with VMI in that both techniques have to interpret and inspect memory. For example, forensic tools can actually facilitate VMI (Dolan-Gavitt et al., 2011). The basic memory forensic technique is object traversal and signature scanning. Thus, many existing techniques focus on how to build the object map (e.g., (Carbone et al., 2009), (Zeng and Lin, 2015)) or generate robust signatures (e.g., (Dolan-Gavitt et al., 2009; Lin et al., 2011)). Again, Vmst complements these techniques by offering a new set of automatically generated VMI-based tools (Hay and Nance, 2008) to analyze memory.

113 7.6

Hybrid-Virtualization

While recently there are a number of systems which combine both hardware virtualization and software virtualization (e.g. TBP (Ho et al., 2006), Aftersight (Jim Chow, 2008), and V2E (Yan et al., 2012)), they have different goals and different techniques. In particular, TBP detects malicious code injection attack by using taint tracking to prevent execution of network data. The protected OS is running on Xen and uses page fault to switch execution to QEMU for taint-tracking when tainted data is being processed by the CPU. Aimed at heavyweight analysis on production workload, Aftersight decouples analysis from execution by recording all VM inputs on a VMware Workstation and replaying them on QEMU. Designed for malware analysis, V2E uses hardware virtualization to record the malware execution trace at page level, and uses page fault to transfer control to software virtualization; whereas Hybrid-Bridge uses int3 patch to cause VMExit and control the transitions between redirectable and non-redirecatable instructions at instruction level as well as control the transitions to software virtualization.

7.7

Training Memoization

Memoization (Michie, 1968) is an optimization technique that remembers the results corresponding to some set of specific inputs, thus avoiding the recalculation when encountering these inputs again. This has been used in many applications such as deterministic multithreading (via schedule memoization (Cui et al., 2010)) and taint optimization (e.g., FlexiTaint (Venkataramani et al., 2008) and DDFT (Jee et al., 2012)). While Hybrid-Bridge and FlexiTaint (Venkataramani et al., 2008) may seem similar at very high level regarding taint memoization but they operate in different world and face different challenges. FlexiTaint is an instruction level CPU cache (very similar to Translation Lookaside Buffer) to enhance taint operation with low overhead in CPU, whereas

114 Hybrid-Bridge is based on the idea of decoupling taint analysis from the main execution engine (i.e., Fast-Bridge) without any taint analysis inside it. For DDFT (Jee et al., 2012), the substantial difference is that its taint memoization works at user level program much like a compiler optimization to speed up the taint analysis, whereas Hybrid-Bridge works at hypervisor level with no intention to speed up the taint analysis itself. Also, our memoization not only does remember the tainted data, but also remember other types of meta-data such as the offset for each return address for bi-redirection instructions.

CHAPTER 8 CONCLUSION In this dissertation, we have presented three different approaches to bridge the semantic gap in virtual machine introspection by leveraging binary code reuse. Vmst makes the first step in bridging semantic gap via online binary code and data redirection. With a hybridvritualization approach, Hybrid-Bridge improves the performance of Vmst by one order of magnitude. Finally, HyperShell demonstrates a more practical approach for VMI by redirecting syscall execution. Vmst can seamlessly bridge the semantic gap and automatically generate VMI tools. Through system wide instruction monitoring, Vmst automatically identifies the introspection related data from a secure-VM and online redirects these data accesses to the kernel memory of a product-VM, without any training. Vmst offers a number of new features and capabilities. Particularly, it enables an in-VM inspection program to automatically become an out-of-VM introspection program. We have tested Vmst with over 25 commonly used utilities on top of a number of different OS kernels including Linux and Microsoft Windows. The experimental results show that our technique is general (largely OS-independent), and it introduces 9.3X overhead for Linux utilities and 19.6X overhead for Windows utilities on average for the introspected program compared to the native in-VM execution without data redirection. Vmst is not fast enough for real-time monitoring of VM. Hence, we present HybridBridge, a new system that uses an efficient decoupled execution and training memoization approach to automatically bridge the semantic gap. The key idea is to combine the strengths of both offline training based approach and online kernel data redirection based approach, 115

116 with a novel training data memoization and fall back mechanism at hypervisor layer that decouples the expensive Taint Analysis Engine (TAE) from the execution of hardware-based virtualization and moves the TAE to software-based virtualization. The experimental results show that Hybrid-Bridge substantially improves the performance overhead of existing binary code reuse based VMI solutions with at least one order of magnitude for many of the tested benchmark tools. HyperShell redirects syscall execution to bridge the semantic gap and has fast performance. It is designed as a practical hypervisor layer guest OS shell that has all of the functionality of a traditional shell, but offers better automation, uniformity, and centralized management. This will be particularly useful for cloud and data center providers to manage the running VMs in a large scale. To overcome the semantic gap challenge, we introduce a reverse system call abstraction, and we show that this abstraction can significantly relieve the painful process of developing software below an OS. More importantly, we also show that this abstraction can be implemented transparently. As such, many of the legacy guest OS management utilities can be directly reused in HyperShell without any modification. Our evaluation with over one hundred management utilities demonstrates that HyperShell has 2.73X slowdown on average compared to their native in-VM execution, and has less than 5% overhead to the guest OS kernel. Finally, we believe this dissertation has significantly removed the road blocks in virtualization based security, including but not limited to VMI, malware analysis, and memory forensics. It has the potential to largely change their daily practice.

REFERENCES Bach, M. J. (1986). The Design of the UNIX Operating System. Prentice Hall. Bahram, S., X. Jiang, Z. Wang, M. Grace, J. Li, D. Srinivasan, J. Rhee, and D. Xu (2010). Dksm: Subverting virtual machine introspection for fun and profit. In Reliable Distributed Systems, 2010 29th IEEE Symposium on. Baiardi, F. and D. Sgandurra (2007). Building trustworthy intrusion detection through vm introspection. In Proceedings of the Third International Symposium on Information Assurance and Security, pp. 209–214. IEEE Computer Society. Bauman, E., G. Ayoade, and Z. Lin (2015). A survey on hypervisor based monitoring: Approaches, applications, and evolutions. ACM Computing Surveys. Bellard, F. (2005). QEMU: an open source processor emulator. http://www.qemu.org/. Bovet, D. and M. Cesati (2005). Understanding The Linux Kernel. Oreilly & Associates Inc. Caballero, J., N. M. Johnson, S. McCamant, and D. Song (2010, February). Binary code extraction and interface identification for security applications. In Proceedings of the 17th Annual Network and Distributed System Security Symposium (NDSS’10), San Diego, CA. Caballero, J. and D. Song (2007). Polyglot: Automatic extraction of protocol format using dynamic binary analysis. In Proceedings of the 14th ACM Conference on Computer and and Communications Security (CCS’07), Alexandria, Virginia, USA, pp. 317–329. Cadar, C., V. Ganesh, P. M. Pawlowski, D. L. Dill, and D. R. Engler (2006). Exe: automatically generating inputs of death. In CCS ’06: Proceedings of the 13th ACM conference on Computer and communications security, New York, NY, USA, pp. 322–335. ACM. Carbone, M., W. Cui, L. Lu, W. Lee, M. Peinado, and X. Jiang (2009). Mapping kernel objects to enable systematic integrity checking. In The 16th ACM Conference on Computer and Communications Security (CCS’09), Chicago, IL, USA, pp. 555–565. Chen, P. M. and B. D. Noble (2001). When virtual is better than real. In Proceedings of the Eighth Workshop on Hot Topics in Operating Systems. Chow, J., B. Pfaff, K. Christopher, and M. Rosenblum (2004). Understanding data lifetime via whole-system simulation. In Proceedings of the 13th USENIX Security Symposium. 117

118 Clark, D. D. (1985). The structuring of systems using upcalls. In Proceedings of the Tenth ACM Symposium on Operating Systems Principles, SOSP ’85, Orcas Island, Washington, USA, pp. 171–180. Cui, H., J. Wu, C.-C. Tsai, and J. Yang (2010, October). Stable deterministic multithreading through schedule memoization. In Proceedings of the Ninth Symposium on Operating Systems Design and Implementation (OSDI ’10). Cui, W., M. Peinado, K. Chen, H. J. Wang, and L. Irun-Briz (2008, October). Tupni: Automatic reverse engineering of input formats. In Proceedings of the 15th ACM Conference on Computer and Communications Security (CCS’08), Alexandria, Virginia, USA, pp. 391–402. Curry, T. W. (1994). Profiling and tracing dynamic library usage via interposition. In Proceedings of the USENIX Summer 1994 Technical Conference on USENIX Summer 1994 Technical Conference - Volume 1, Boston, Massachusetts. Dinaburg, A., P. Royal, M. Sharif, and W. Lee (2008). Ether: malware analysis via hardware virtualization extensions. In Proceedings of the 15th ACM conference on Computer and communications security (CCS’08), Alexandria, Virginia, USA, pp. 51–62. Dolan-Gavitt, B. Virtuoso: Whole-system binary code extraction for introspection. https://code.google.com/p/virtuoso/. Dolan-Gavitt, B., T. Leek, M. Zhivich, J. Giffin, and W. Lee (2011). Virtuoso: Narrowing the semantic gap in virtual machine introspection. In Proceedings of the 32nd IEEE Symposium on Security and Privacy, Oakland, CA, USA, pp. 297–312. Dolan-Gavitt, B., B. Payne, and W. Lee (2011). Leveraging forensic tools for virtual machine introspection. Technical Report; GT-CS-11-05 . Dolan-Gavitt, B., A. Srivastava, P. Traynor, and J. Giffin (2009). Robust signatures for kernel data structures. In Proceedings of the 16th ACM Conference on Computer and Communications Security (CCS’09), Chicago, Illinois, USA, pp. 566–577. ACM. Edge, J. (2013). Randomizing the kernel. http://lwn.net/Articles/546686/. Egele, M., C. Kruegel, E. Kirda, H. Yin, , and D. Song (2007, June). Dynamic spyware analysis. In Proceedings of the 2007 USENIX Annual Technical Conference (Usenix’07). Fabrice, B. (2005). Qemu, a fast and portable dynamic translator. In Proceedings of the 2005 USENIX Annual Technical Conference (ATC’05), Anaheim, CA, USA. Forrest, S., S. A. Hofmeyr, A. Somayaji, and T. A. Longstaff (1996). A sense of self for unix processes. In Proceedings of the 1996 IEEE Symposium on Security and Privacy.

119 Fu, Y. and Z. Lin (2012). Space traveling across vm: Automatically bridging the semantic gap in virtual machine introspection via online kernel data redirection. In Proceedings of the 2012 IEEE Symposium on Security and Privacy (SP’12), San Fransisco, CA, USA, pp. 586–600. Fu, Y. and Z. Lin (2013a). Bridging the semantic gap in virtual machine introspection via online kernel data redirection. ACM Trans. Inf. Syst. Secur. 16 (2). Fu, Y. and Z. Lin (2013b, March). Exterior: Using a dual-vm based external shell for guest-os introspection, configuration, and recovery. In Proceedings of the Ninth Annual International Conference on Virtual Execution Environments, Houston, TX. Fu, Y., J. Zeng, and Z. Lin (2014). Hypershell: A practical hypervisor layer guest os shell for automated in-vm management. In Proceedings of the 2014 USENIX Conference on USENIX Annual Technical Conference, USENIX ATC’14, Berkeley, CA, USA, pp. 85–96. USENIX Association. Garfinkel, T. (2003). Traps and pitfalls: Practical problems in system call interposition based security tools. In In Proceedings of Network and Distributed Systems Security Symposium (NDSS’03), San Diego, CA, pp. 163–176. Garfinkel, T. and M. Rosenblum (2003, February). A virtual machine introspection based architecture for intrusion detection. In Proceedings Network and Distributed Systems Security Symposium (NDSS’03). Godefroid, P., M. Levin, and D. Molnar (2008, February). Automated whitebox fuzz testing. In Proceedings of the 15th Annual Network and Distributed System Security Symposium (NDSS’08), San Diego, CA. Gu, Y., Y. Fu, A. Prakash, Z. Lin, and H. Yin (2012, October). Os-sommelier: Memory-only operating system fingerprinting in the cloud. In Proceedings of the 3rd ACM Symposium on Cloud Computing (SOCC’12), San Jose, CA. Gu, Z., Z. Deng, D. Xu, and X. Jiang (2011). Process implanting: A new active introspection framework for virtualization. In Proceedings of the 30th IEEE Symposium on Reliable Distributed Systems (SRDS 2011), Madrid, Spain, October 4-7, pp. 147–156. Hay, B. and K. Nance (2008, April). Forensics examination of volatile system data using virtual introspection. SIGOPS Operating System Review 42, 74–82. Ho, A., M. Fetterman, C. Clark, A. Warfield, and S. Hand (2006). Practical taint-based protection using demand emulation. In Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems (EuroSys’06), pp. 29–41.

120 Hofmann, O. S., A. M. Dunn, S. Kim, I. Roy, and E. Witchel (2011). Ensuring operating system kernel integrity with osck. In Proceedings of the sixteenth international conference on Architectural support for programming languages and operating systems, ASPLOS ’11, Newport Beach, California, USA, pp. 279–290. Inoue, H., F. Adelstein, M. Donovan, , and S. Brueckner (2011, June). Automatically bridging the semantic gap using a c interpreter. In Proceedings of the 2011 Annual Symposium on Information Assurance, Albany, NY. Intel (2005). Xed: X86 encoder decoder. http://www.pintool.org/docs/24110/Xed/html/. Jee, K., G. Portokalidis, V. P. Kemerlis, S. Ghosh, D. I. August, and A. D. Keromytis (2012). A general approach for efficiently accelerating software-based dynamic data flow tracking on commodity hardware. In Proceedings Network and Distributed Systems Security Symposium (NDSS’12). Jiang, X., X. Wang, and D. Xu (2007). Stealthy malware detection through vmm-based out-of-the-box semantic view reconstruction. In Proceedings of the 14th ACM Conference on Computer and Communications Security (CCS’07), Alexandria, Virginia, USA, pp. 128–138. ACM. Jiang, X., H. J. Wangz, D. Xu, and Y.-M. Wang (2007). Randsys: Thwarting code injection attacks with system service interface randomization. In Proceedings of the 26th IEEE International Symposium on Reliable Distributed Systems, pp. 209–218. Jim Chow, Tal Garfinkel, P. M. C. (2008). Decoupling dynamic program analysis from execution in virtual environments. In USENIX 2008 Annual Technical Conference on Annual Technical Conference (ATC’08), pp. 1–14. Jones, S. T., A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau (2006). Antfarm: tracking processes in a virtual machine environment. In Proceedings of the annual conference on USENIX ’06 Annual Technical Conference, Boston, MA. USENIX Association. Jones, S. T., A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau (2008). Vmm-based hidden process detection and identification using lycosid. In Proceedings of the fourth ACM SIGPLAN/SIGOPS international conference on Virtual execution environments, VEE ’08, Seattle, WA, USA, pp. 91–100. ACM. Kivity, A., Y. Kamay, D. Laor, U. Lublin, and A. Liguori (2007). kvm: the linux virtual machine monitor. In Proceedings of the Linux Symposium, Volume 1, pp. 225–230. Kolbitsch, C., T. Holz, C. Kruegel, and E. Kirda (2010, May). Inspector gadget: Automated extraction of proprietary gadgets from malware binaries. In Proceedings of 2010 IEEE Security and Privacy, Oakland, CA.

121 Lin, Z., X. Jiang, D. Xu, and X. Zhang (2008, February). Automatic protocol format reverse engineering through context-aware monitored execution. In Proceedings of the 15th Annual Network and Distributed System Security Symposium (NDSS’08), San Diego, CA. Lin, Z., J. Rhee, X. Zhang, D. Xu, and X. Jiang (2011, February). Siggraph: Brute force scanning of kernel data structure instances using graph-based signatures. In Proceedings of the 18th Annual Network and Distributed System Security Symposium (NDSS’11), San Diego, CA. Lin, Z., X. Zhang, and D. Xu (2010a, February). Automatic reverse engineering of data structures from binary execution. In Proceedings of the 17th Annual Network and Distributed System Security Symposium (NDSS’10), San Diego, CA. Lin, Z., X. Zhang, and D. Xu (2010b, June). Reuse-oriented camouflaging trojan: Vulnerability detection and attack construction. In Proceedings of the 40th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN-DCCS 2010), Chicago, IL, USA. Liu, Z., J. Lee, J. Zeng, Y. Wen, Z. Lin, and W. Shi (2013). Cpu transparent protection of os kernel and hypervisor integrity with programmable dram. In Proceedings of the 40th International Symposium on Computer Architecture. ACM. Luk, C.-K., R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood (2005). Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’05), Chicago, IL, USA, pp. 190–200. Michie, D. (1968, April). ”Memo” Functions and Machine Learning. Nature 218 (5136), 19–22. Newsome, J. and D. Song (2005, February). Dynamic taint analysis for automatic detection, analysis, and signature generation of exploits on commodity software. In Proceedings of the 14th Annual Network and Distributed System Security Symposium (NDSS’05), San Diego, CA. Payne, B. D., M. Carbone, and W. Lee (2007, December). Secure and flexible monitoring of virtual machines. In Proceedings of the 23rd Annual Computer Security Applications Conference (ACSAC 2007). Payne, B. D., M. Carbone, M. I. Sharif, and W. Lee (2008, May). Lares: An architecture for secure active monitoring using virtualization. In Proceedings of 2008 IEEE Symposium on Security and Privacy, Oakland, CA, pp. 233–247. Petroni, N. L., Jr., T. Fraser, J. Molina, and W. A. Arbaugh (2004, August). Copilot - A coprocessor-based kernel runtime integrity monitor. In Proceedings of the 13th USENIX Security Symposium, San Diego, CA, pp. 179–194.

122 Petroni, Jr., N. L. and M. Hicks (2007, October). Automated detection of persistent kernel control-flow attacks. In Proceedings of the 14th ACM Conference on Computer and Communications Security (CCS’07), Alexandria, Virginia, USA, pp. 103–115. ACM. Pfoh, J., C. Schneider, and C. Eckert (2011, November). Nitro: Hardware-based system call tracing for virtual machines. In Advances in Information and Computer Security, Volume 7038 of Lecture Notes in Computer Science, pp. 96–112. Springer. Portokalidis, G., A. Slowinska, and H. Bos (2006, April). Argos: an emulator for fingerprinting zero-day attacks. In Proc. ACM SIGOPS EUROSYS’2006, Leuven, Belgium. Provos, N. (2003). Improving host security with system call policies. In Proceedings of the 12th USENIX Security Symposium, Washington, DC, pp. 18–18. Quynh, N. A. (2010). Operating system fingerprinting for virtual machines. In DEFCON 18. Rajagopalan, M., S. Perianayagam, H. He, G. Andrews, and S. Debray (2006). Biray rewriting of an operating system kernel. In Proc. Workshop on Binary Instrumentation and Applications. Riley, R., X. Jiang, and D. Xu (2008, September). Guest-Transparent Prevention of Kernel Rootkits with VMM-Based Memory Shadowing. In Proceedings of Recent Advances in Intrusion Detection (RAID’08), pp. 1–20. Saberi, A., Y. Fu, and Z. Lin (2014, February). Hybrid-bridge: Efficiently bridging the semantic-gap in virtual machine introspection via decoupled execution and training memoization. In Proceedings of the 21st Annual Network and Distributed System Security Symposium (NDSS’14), San Diego, CA. Sekar, R. Classification and grouping of linux system calls. http://seclab.cs.sunysb.edu/sekar/papers/syscallclassif.htm. Seshadri, A., M. Luk, N. Qu, and A. Perrig (2007, October). SecVisor: A Tiny Hypervisor to Guarantee Lifetime Kernel Code Integrity for Commodity OSes. In Proceedings of the ACM Symposium on Operating Systems Principles (SOSP’07). Srinivasan, D., Z. Wang, X. Jiang, and D. Xu (2011). Process out-grafting: an efficient ”out-of-vm” approach for fine-grained process execution monitoring. In Proceedings of the 18th ACM conference on Computer and communications security (CCS’11), Chicago, Illinois, USA, pp. 363–374. Srivastava, A. and J. Giffin (2008). Tamper-resistant, application-aware blocking of malicious network connections. In Proceedings of the 11th international symposium on Recent Advances in Intrusion Detection (RAID’08), Cambridge, MA, USA, pp. 39–58.

123 Tsai, T. K. and N. Singh (2002). Libsafe: Transparent system-wide protection against buffer overflow attacks. In Proceedings of the 2002 International Conference on Dependable Systems and Networks (DSN’02), Washington, DC, USA, pp. 541. IEEE Computer Society. Venkataramani, G., I. Doudalis, Y. Solihin, and M. Prvulovic (2008). Flexitaint: A programmable accelerator for dynamic taint propagation. In Proceedings of the 4th International Symposium on High Performance Computer Architecture (HPCA’08), Salt Lake City, UT. Walters, A. The volatility framework: Volatile memory artifact extraction utility framework. https://www.volatilesystems.com/default/volatility. Wang, Z. and X. Jiang (2010, may). Hypersafe: A lightweight approach to provide lifetime hypervisor control-flow integrity. In 2010 IEEE Symposium on Security and Privacy, pp. 380 –395. Wondracek, G., P. Milani, C. Kruegel, and E. Kirda (2008, February). Automatic network protocol analysis. In Proceedings of the 15th Annual Network and Distributed System Security Symposium (NDSS’08), San Diego, CA. Xi Xiong, Donghai Tian, P. L. (2011, February). Practical protection of kernel integrity for commodity os from untrusted extensions. In Proceedings of the 18st Annual Network and Distributed System Security Symposium (NDSS’11), San Diego, CA. Yan, L.-K., M. Jayachandra, M. Zhang, and H. Yin (2012). V2e: Combining hardware virtualization and software emulation for transparent and extensible malware analysis. In Proceedings of the 8th ACM SIGPLAN/SIGOPS conference on Virtual Execution Environments (VEE’12), London, UK, pp. 227–238. Yi-Min Wang, Doug Beck, B. V. R. R. and C. Verbowsk (2005, June). Detecting stealth software with strider ghostbuster. In Proceedings of International Conference on Dependable System and Networks. Yin, H. and D. Song (2010, Jan). Temu: Binary code analysis via whole-system layered annotative execution. Technical Report UCB/EECS-2010-3, EECS Department, University of California, Berkeley. Yin, H., D. Song, E. Manuel, C. Kruegel, and E. Kirda (2007, October). Panorama: Capturing system-wide information flow for malware detection and analysis. In Proceedings of the 14th ACM Conferences on Computer and Communication Security (CCS’07). Zeng, J., Y. Fu, and Z. Lin (2015). Pemu: a pin highly compatible out-of-vm dynamic binary instrumentation framework. In Proceedings of the 11th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, pp. 147–160. ACM.

124 Zeng, J., Y. Fu, K. Miller, Z. Lin, X. Zhang, and D. Xu (2013, November). Obfuscationresilient binary code reuse through trace-oriented programming. In Proceedings of the 20th ACM Conference on Computer and Communications Security (CCS’13), Berlin, Germany. Zeng, J. and Z. Lin (2015). Research in Attacks, Intrusions, and Defenses: 18th International Symposium, RAID 2015, Kyoto, Japan,November 2-4, 2015. Proceedings, Chapter Towards Automatic Inference of Kernel Object Semantics from Binary Code, pp. 538–561. Cham: Springer International Publishing. Zhang, F., J. Chen, H. Chen, and B. Zang (2011). Cloudvisor: retrofitting protection of virtual machines in multi-tenant cloud with nested virtualization. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, SOSP ’11, Cascais, Portugal, pp. 203–216. ACM.

VITA Yangchun Fu received his BS degree in Computer Science from Sun Yat-sen University in 2007, MS degree in Computer Science from South China University of Technology in 2010, and PhD degree in Computer Science from The University of Texas at Dallas in 2016. His research interests are system security with a focus on the development of program analysis and virtualization techniques, and their applications to virtual machine introspection, cloud management, and computer forensics. After graduation, he will join the Cloud infrastructure team at Google as a Software Engineer.