Cloud Computing Data Capsules for Non-Consumptive Use of Texts Jiaan Zeng
Guangchen Ruan
Alexander Crowell
School of Informatics and Computing, Indiana University
School of Informatics and Computing, Indiana University
[email protected]
[email protected]
Computer Science and Engineering Division University of Michigan
Atul Prakash Computer Science and Engineering Division University of Michigan
[email protected] ABSTRACT As digital data sources grow in number and size, they pose an opportunity for computational investigation by means of text mining, natural language processing (NLP), and other text analysis techniques. In this paper we propose a virtual machine (VM) framework and methodology for nonconsumptive text analysis. Using a remote VM model, the VM is configured with software and tooling for text analysis. When completed, the VM is wiped out and resources released for other users to share. Our approach extends the VM by turning it into a data capsules that prevents leakage of copyrighted content in the event that the VM is compromised. The HathiTrust Research Center Data Capsules has seen early use in application against the HathiTrust repository of digitized books from university libraries nationwide.
Categories and Subject Descriptors H.3.4 [Systems and Software]: Distributed systems— Cloud Computing, Data Capsules; K.6.m [Miscellaneous]: Security—Non-consumptive Use
General Terms Architecture, Design, Implementation
Keywords Cloud computing, Large-scale Text Mining, Non-consumptive Use, Data Capsules
1. INTRODUCTION As increasing amounts of digitized text become available from sources such as university libraries, researchers have acPermission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from
[email protected]. ScienceCloud 2014, June 23 - 27, 2014, Vancouver, BC, Canada. c 2014 ACM 978-1-4503-2911-8/14/06...$15.00. Copyright ⃝ http://dx.doi.org/10.1145/2608029.2608031.
[email protected] Beth Plale
School of Informatics and Computing, Indiana University
[email protected] cess to more material than they could ever hope to read in a lifetime, creating opportunities for new forms of research that employ automated analytical techniques. As an example, automated analytical techniques have been employed to check literary critic Ian Watts’ claims against 3,600 novels written between the eighteenth and nineteenth centuries [7]. While this tedious task could be carried out by hand, automated analytical techniques applied on large collections of digitized texts enables new forms of discoveries. One can apply topic modeling such as Latent Dirichlet Allocation (LDA) [16], for instance, at the book level to narrow from millions of texts to a few thousand, then at the page level to zero down to relevant pages, effectively pinpointing concepts down to the place where they appear on the page. As of 2014, HathiTrust has digitized nearly 12 million volumes (books) from research libraries across the country. The HathiTrust Research Center (HTRC) [9] was recently established to provision for automated analytical techniques on the text data and images of the HathiTrust digital repository. The software and services of HTRC has strong resemblance to a SaaS cloud data service, with the digital repository at the center as the asset of value, and services provisioned for investigation. As per [12], SaaS applications are accessible from various client devices through either a thin client interface or a program interface. The consumer does not manage or control the underlying cloud infrastructure. HTRC v2.0 already supports automated analytical techniques through a set of supported text mining and analysis routines that have been vetted in advance by HTRC staff. These analysis routines run over the nearly 3 million volumes of the HathiTrust digital repository that are in the public domain. However most texts digitized from research libraries, particularly post 1923, have restrictions on their use in the form of copyright. In extending HTRC to include an additional 7 million volumes of copyrighted content, we need stricter protections than exist presently, and at the same time want to allow the community of users to use their own custom text analysis tools over the copyrighted content. What constrains the architectural solution? Whether it is permissible to do automated analysis on copyrighted digitized texts from university libraries has been hotly debated ever since the Authors Guild issued a class action lawsuit against Google [8] in 2008. A number of stakeholders have
argued that data and text mining should be permitted, drawing on the principle of ”non-expressive” use, that is, uses that do not trade on the underlying or expressive purpose of the work. What this means is that text mining creates a new use and does not gain monetarily or otherwise from the original work. HTRC interprets “non-expressive use” as the right to carry out text analysis on the copyrighted content of the HathiTrust. The challenge is to strike the right balance between ensuring that the text analysis carried out does not violate nonexpressive use, while keeping the HTRC services as flexible as possible by not overly limiting the kinds of use. HTRC does not want to have to walk through the code of its users; nor are code walkthroughs an effective process for spotting malicious code. But no user’s text mining actions can, for instance, leak even a chapter to the Internet. Non-expressive use is also called “non-consumptive” use, where uses do not enable the human consumption of a text through reading. We use the term non-consumptive use in this paper to refer to computational analysis of the copyrighted content that is carried out in such a way that human consumption of texts is prohibited. Non-consumptive analysis of a large corpus of sensitive data poses several challenges: first, text mining tools vary in their computational demand. Analysis could take the form of simple access to a database of features; could involve a linear traversal through digitized text; or could be a computationally intensive natural language processing task like LDA. Second, while high performance computing (HPC) allocates resources dynamically based on workload, HPC systems remain largely batch-oriented, which has challenges for text analysis, which is frequently highly interactive. The cloud computing SaaS model overcomes some of the problems of limited interactivity. Third, allowing interactive access by researcher-supplied algorithms to run directly over the copyrighted texts opens a vulnerable channel that can violate the non-consumptive use constraint. This paper introduces the HTRC Data Capsules, the research questions that motivate it, its design, and the open questions. We trust that a text mining researcher will not deliberately leak repository data. We do aim, however, to prevent malware acting on the user’s behalf from leaking data. HTRC Data Capsules is motivated by four related research questions: • Non-consumptive use: can the framework provide safe handling of large volumes of protected data? • Openness: can the framework support user-contributed analysis tools (that is, not limit uses to a known set of algorithms)? • Efficiency: can the framework support user-contributed analysis tools without resorting to code walkthroughs prior to acceptance? • Large-scale and low cost: can the protections be extended to utilization of large-scale national (public) computational resources? In this paper, we propose a framework and methodology for non-consumptive text analysis. The framework draws on the data capsules approach [2] to preventing misuse. Using a remote virtual machine (VM) model, researchers can build a VM configured with software and tooling based on their
needs. Once an analysis is finished, the VM is wiped out and resources released for other users to share. VMs offer no inherent protection, however. Our approach extends the virtual machine by turning it into a data capsules[2] that prevents leakage of copyrighted content in the event that the VM is compromised or data analysis routines malicious. The HTRC Data Capsules provides the virtual machine with two modes: a maintenance mode during which a user can access the network and install software freely, but cannot access copyrighted data; and secure mode where copyrighted texts become accessible to the user while the network access and file system access is highly constrained. In the latter mode, users are allowed access only to a predefined set of network addresses and write to a specific volume, which is only visible in secure mode. Any other change made to the system in secure mode, except for the ones made to the special volume, are lost when the mode is switched from secure to maintenance. This is to guard against the situation that copyrighted texts are saved in the VM in secure mode and copied out across the network during maintenance mode. The remainder of the paper is organized as follows: Section 2 summaries related work and 3 presents an overview of HTRC and data capsules, as well as identifies the threat model for our system. Section 4 describes the design and implementation details of our system. Section 5 identifies future work and open questions.
2.
RELATED WORK
Research that leverages the power of cloud computing for text analysis includes Teregowda et al. who employ cloud computing as underlying infrastructure for CiteSeerX, a public search engine and digital library for scientific and academic papers [19]. Rosenthal et al. propose LOCKSS for digital preservation; LOCKSS uses a cloud model in a digital preservation system [17]. Our system differs from both because of the protection requirements of non-expressive use in the cloud platform. There are numerous cloud platforms that support IaaS, either commercially or in open source including Amazon EC2 [5], Eucalyptus [13], and OpenStack [15]. Researchers have characterized differences in cloud platforms [18, 20]. We do not build our system on existing open source cloud platforms for two reasons: first, existing cloud platforms introduce numerous complexities for HTRC data capsules, and the data capsule notion is needed to enforce non-consumptive use policies in the VMs. The complexities stem from the need to avoid imposing unnecessary burdens on digital humanities researchers. For example, researchers should not need to configure a software-defined network to run their text mining algorithms. Second, existing cloud platforms are designed for general purpose use and expose many vulnerable channels. For example, allowing a user to configure a network would pose significant threat to the data capsule system. Therefore, instead of using existing cloud platforms, we built a cloud framework around data capsules, which satisfies the researchers’ needs as well as the non-expressive use requirement. Our research also serves to demonstrate that data capsules could potentially be applied to more general cloud platforms. With respect to analysis performed on sensitive data, the Census Bureau has established Census Research Data Centers at 18 locations across the country [3]. Potential researchers can apply to gain access, and are then allowed
to conduct research only on machines within the physical research data center. Upon completion of their approved research, the results are manually reviewed before release. Although this procedure arguably has very strong security against data leaks, it is also very restrictive. A researcher must physically travel to one of the 18 locations. Secure Medical Research Workspace (SMRW) [6] and CMS Virtual Research Data Center (CMS) [4] provide virtual workspaces to researchers for work with clinical data. A virtual workspace is backed by a virtual machine equipped with secure software that prevents data leak over some channels. Both systems trust the virtual workspace. Users are thus forced to use pre-installed software and not allowed to configure the software environment of the workspace based on their needs. Cloud Terminal runs a thin terminal application on an untrusted operating system to display information while the actual computation logic against data is deployed to a secure cloud environment [11]. Like SMRW and CMS, Cloud Terminal does not provide options for users to customize the software environment. Besides, it requires users to install the terminal application on their machines which incurs extra usability burden. Data capsules [2] is a mechanism for containing sensitive data within a virtualized environment with the goal of minimizing the available channels to leak that data. Although data capsules provides no guarantees on the integrity of the stored data – which could be modified or deleted by an attacker – this poses little issue when data capsules is applied to non-expressive use, where the copyrighted source data can simply be made read-only. Additionally, the data capsules design is well-suited to non-expressive use, in that it is able to maintain strong security guarantees on the sensitive data while being highly permissive to users, who may need to install custom software or perform other actions that would require the granting of many privileges in a normal environment. The application of data capsules we present in this paper currently does not provide the same level of protection against covert channels that is provided in the original data capsules system, but improvements in the implementation are being made that will eventually provide greater protection than that given in the original work.
3. OVERVIEW This research builds from the HathiTrust Research Center software and services, and data capsules, both of which are described in some detail along with the threat model for the HTRC data capsule system.
3.1 HathiTrust Research Center Background The HathiTrust Research Center uses a set of software and services to carry out computational analysis using digitized books from the HathiTrust Digital Library for research and educational use. Figure 1 is an illustrative diagram of the system. A researcher accesses the system through one or more front ends (shown to the right). The system in its simplest form satisfies a user’s need by constructing for the user a Workset representing selected text mining tools, selected subsets of data from either the HathiTrust digital repository, or from feature sets that have been extracted in advance, and data from other sources. This Workset bundle is executed on a compute cluster. As is shown by the arrow pointing to both compute resource and front end (complexity hiding interface), text analysis is a highly interactive process, thus
challenging the use of HPC resources shown, even if, like Big Red 2 at Indiana University, they are architected to handle data-intensive computations. The HTRC system is modularly architected using a web services paradigm (i.e., REST interfaces). It utilizes a Solr index, Cassandra NoSQL store, both of which are sharded across a half-dozen machines for higher replicability and availability. Analytical tools include the SEASR suite of text mining tools [1]. The HTRC architecture uses OAuth2 authentication [14] and tracks for auditing purposes the actions of users entering through the portal. While this architecture is adequate for analysis algorithms that have been vetted by HTRC, it does not support analysis algorithms written by external users nor does it adequately protect copyrighted content. Both issues are the purview of the HTRC Data Capsules.
Figure 1: Functional diagram of HathiTrust Research Center software and services
3.2
Data Capsules
Data capsules is a system developed by Borders et al. [2] that allows privileged access to sensitive data while also restricting the channels through which that data can be released. Capsules uses virtual machine snapshotting and simple policy-based restriction mechanisms to allow a user to enter a “secure” mode, where network and other data channels are restricted in exchange for access to the data being protected. After using the protected data, when the user returns to normal use of the system, which we call “maintenance” mode, all changes to the system except those made to the designated secure data are forgotten, and the system returns to the state it was in before the switch to secure mode, thereby regaining full Internet access and the ability to make persistent changes to the virtual machine, such as installing software. In this way, network and storage channels cannot be used to leak the sensitive data. This usage model is readily applicable to non-expressive use, in that researchers can administer their system (e.g. install any required software tools) and then switch to “secure” mode to perform their analysis. Because data capsules was originally developed with use on a personal machine in mind, it had to be adapted and extended to work within the HTRC cloud environment. In HTRC, sensitive data is copyrighted text that is exposed through a Restful web service API instead of within a ma-
chine. Data capsules manages the network channel between virtual machines and the web services serving data to them. In consideration of potential future HTRC services, data capsules has a flexible network control mechanism that allows network policies for different modes to be written by administrators. Additionally, in order to provide a cloud environment across multiple machines and to a large number of researchers, there is a need for another layer on top of data capsules that coordinates different machines. We therefore implemented a web service layer that is responsible for management of virtual machines (e.g., resource allocation, request scheduling, status maintenance) above the data capsules layer. Section 4 gives details of the design and implementation of data capsules and the web service layer in the context of HTRC.
3.3 Threat Model Under the HTRC data capsules usage model, users access the copyrighted data through remotely accessed virtual machines that read the data from a network-accessed data service. The virtual machine presented to the user is not a part of the trusted computing base (TCB). We assume the possibility of malware being installed as well as other remotely initiated attacks on the VM. These attacks could potentially compromise the entire operating system and install a rootkit, both of which are undetectable to the end user. The end users themselves are considered to act in good faith, but this does not preclude the possibility of them unwittingly allowing the system to be compromised. We believe this is a reasonable assumption since users are required to sign a use agreement before using our system. Beyond the virtual machine itself, the virtual machine manager (VMM) and the host it runs on are both trusted and fall within the TCB. This includes system services that enforce network and data access policies for the virtual machines. Finally, the HTRC data service itself is also a part of the TCB. Users have VNC access to their virtual machines so that they have a graphical interface to the machine. However, this access does admittedly provide a channel for potential data leak. For now, we apply our assumption that the end user acts in good faith, and also assume that they are the only one accessing the virtual machine. In future work, we could monitor channel traffic as a means to automatically detect potential abuses that leak data. Additionally, a channel is provided to the user for releasing results when their research is complete. Although we currently intend to allow released results to be downloaded via a link sent to the end user’s email inbox, we note that the released data could also be subjected to manual or automated review to detect potential abuses. We also leave this for future work. Another potential threat is that of covert channels between virtual machines that run on the same host machine. For instance, a virtual machine running in secure mode could possibly make use of such covert channels to leak data to a co-resident virtual machine running in maintenance mode, which can in turn leak the data anywhere it pleases. We acknowledge that this could pose a serious threat, but leave addressing it for future work.
4. DESIGN AND IMPLEMENTATION The high-level architecture of the HTRC Data Capsules (see Figure 2) consists of three layers; bottom to top as follows: a back end, a web service layer, and a web front end.
The back end consists of physical machines and hypervisor software that run the virtual machines of the data capsules. It also includes the data capsules implementation which consists of scripts that wrap hypervisor commands to perform multiple tasks. The database, image store, and volume store are also part of this layer. The database is used by the web service layer to maintain persistent states for virtual machines and different operations. Virtual machine images as well as secure volumes are stored on NFS, a distributed file system. The web service layer is the central controller of the system. It is responsible for resource allocation, request scheduling, state maintenance, and failover on the physical layer. It also has an audit component to log users’ activities. Finally, we expose the system through the web front end layer consisting of a web UI where users interact with our system, and a user authentication server which validates users’ identification. To validate user identity, we use OAuth 2.0 [14] for user authentication.
Figure 2: HTRC Data Capsules architecture
4.1
Workflows
There are two major control flows within the HTRC Data Capsules system: virtual machine operations and VM access. Each are described in turn.
4.1.1
Virtual machine operations
The control flow of the system, as shown in Figure 3, starts at a web UI where a user request is authenticated through an OAuth 2 server. The request and authentication token are forwarded to the web service. Using a VM create request as an example, upon request arrival, the web service validates the token against the OAuth 2 server a second time in case the web UI is compromised. Upon validation succeess, the web service prepares invocation of the corresponding hypervisor script. It makes several scheduling decisions, including selection of the host on which to launch the VM, allocation of ports for the VM, and retrieval of VM image information from a database. Once the preparation is complete, it calls the hypervisor scripts remotely to launch a VM on a particular host, monitor the response, and updating VM state in the database. Meanwhile, the web service returns the information required to log into the VM (e.g. its VNC access port) to the user through the Web UI. Section 4.1.2 gives VM access details. Other VM operations (shutdown, delete, launch, and switch capsules mode) follow the same path as described here.
Figure 3: System control flow
4.1.2 Virtual machine access An HTRC data capsule is in essence a workspace within which a user executes analysis tasks against copyrighted data. The workspace in this first version is stateless in that it retains no state between sessions. The data flows in and out of the VM in support of analysis activity requires scrutiny for the security issues they may expose. Copyrighted digitized texts, indexes, and features are exposed through the HTRC data subsystem as read-only data. Tools running within a VM can access the data subsystem only while the VM is in secure mode. The read-only feature prevents malicious modification to the data, and guarantees data integrity while the network constraints imposed in secure mode guard against data leakage. Legitimate results of analysis carried out within the workspace needs to be free to leave the system. For this, HTRC data capsules forces the writes of such results to a secure volume where HTRC data capsules system has complete control of how and when to transfer the data out to the user. For instance, it could force scrutinization of the output data by a human reviewer before results are released to the user. There are two situations during user access of the data subsystem where copyrighted data may leak from the HTRC data capsules system. One is through the network. The other is when data is copied from secure mode to maintenance mode in local disk and leaked through network in maintenance mode. Each are described in turn below. For the former, in the situation of one VM, we block arbitrary network access when a user accesses copyrighted data, but keep the network open for the user to download and configure the software environment when access to copyright data is not needed. The data capsules provides two VM modes, i.e., maintenance mode and secure mode, to support different network control for the same VM. In maintenance mode, users can access network without any constraints, although the HTRC data service is blocked by firewall policy. In secure mode, users can only access HTRC data service. Any other network accesses are denied. Figure 4 delineates a typical flow when a user accesses a VM, and Figure 5 shows screenshots from both web UI and VNC sessions. First, the user creates and starts a VM through the web UI as described in section 4.1.1. When the VM is up and running, the user can log into it through any VNC client with proper information provided from the web UI (Step 1 in Figure 4), e.g., host name and port number. Note that the VNC session is the only channel that the user can access the VM in secure mode. By default, users log into
the VM in maintenance mode where they can install software from internet, upload programs from their desktop, etc. (Step 2 in Figure 4). Figure 5(a) shows the web UI displaying the VM status as “running” and mode as “maintenance”, and Figure 5(b) shows a screenshot of the VNC session in maintenance. If the user wants to access the HTRC data service, she needs to switch VM mode from maintenance to secure through the web UI (Step 3, 4 in Figure 4). In practice, the VNC screen will be frozen for a short time (usually 2 to 5 seconds) during the mode switch. In secure mode, a user does not have network access except for HTRC data service. Figure 5(c) shows the web UI displaying the VM status as “running” and mode as “secure”, and Figure 5(d) shows the VNC session screenshot. Compared with Figure 5(b), the secure volume has been mounted to the VM in Figure 5(d). The second leak scenario with copyrighted data is to copy data from secure mode to maintenance mode in local disk and leak through the network in maintenance mode. In secure mode, copyrighted data may be copied from HTRC data service to the VM local disk. When the VM is switched to maintenance mode, the copyrighted data on the local disk is visible and can be copied out through network. To prevent such a leak, data capsules checkpoints the VM image before it goes to secure mode. That is, it takes a snapshot of the VM by using the qemu command. When the VM switches from secure mode to maintenance mode, data capsules restores the checkpoint, i.e., the snapshot image, through qemu command. Therefore, all the data written to the VM local disk in secure mode will be wiped out when the VM switches from secure mode to maintenance mode. Meanwhile, to allow users to persist their data in secure mode, data capsules provides a secure volume, which is visible only in secure mode and shown as an external drive to the VM, to let the user write data to. The secure volume stores data persistently and will be detached from the VM when it is in maintenance mode.
Figure 4: Virtual machine access flow When a user finishes the analysis and saves the final result to secure volume, they notify HTRC Data Capsules by executing a special script within the VM. The special script notifies the back-end to copy the result out from the secure volume and upload it through web service, which stores the result in an external storage system, a MySQL database in
(a) Web UI screenshot of VM in maintenance mode. The mode box, bottom right side shows ”Mode: MAINTENANCE”.
(b) VNC screenshot of VM in maintenance mode. The secure volume is not available in this mode.
(c) Web UI screenshot of VM in secure mode. The mode box, bottom right side shows ”Mode: SECURE”.
(d) VNC screenshot of VM in secure mode. The secure volume is mounted.
Figure 5: Screenshots the current implementation. The web service notifies the user by sending a link to where the results can be downloaded, to the user’s registered email account.
4.2 Implementation Details 4.2.1 Backend layer implementation The back-end implementing the functionality of data capsules is as a set of Linux shell scripts that manage the qemu VMM. These scripts are invoked and run remotely by the web service via a secure shell (SSH) connection. It is qemu that provides live snapshotting for use when a user transitions to secure mode; a snapshot is taken of the entire virtual machine state and any further use of the system is recorded in memory and does not persist upon shutdown or any other operation. Upon return to maintenance mode, the snapshot taken earlier is restored and all system progress that followed it is forgotten. The virtual machines are isolated from the rest of the host’s network through the use of a tap interface, which is a virtual network device working in the data link layer. Their connection is then bridged to host networking. Network policies for both modes are implemented using iptables rules. These rules can be written in custom policy files that can be specified on mode switch, allowing for any arbitrary policy desired by the administrator. In addition to the policy rules, some basic rules are always enforced, for example preventing communication between virtual machines residing on the same host.
4.2.2 Web service layer implementation
The web service is implemented as an asynchronous RESTful service in the Jersey framework [10]. Since operations in the hypervisor take some time to finish, we decouple the response and actual execution for better responsiveness in this early version. For instance, when a start-VM request arrives, the web service immediately returns start-pending as the VM state rather than waiting until the VM start up finishes and returns start state. The web service also takes into account failover on the back-end layer by automatically retrying connection failures and expiring long running ssh sessions.
5.
DISCUSSION AND FUTURE WORK
This paper proposes a cloud framework that enforces nonconsumptive restrictions through extending Borders et al. 2009 data capsules. Early results are promising. Under the assumption that users are trustworthy and do not cooperate with each other, the current architecture meets the low cost requirement by offering a cloud computing solution to users, but also satisfies the non-consumptive use restriction through restricting user network activities in VMs when in secure mode, while keeping openness by allowing users to access the network freely in VMs when in maintenance mode. The proposed approach precludes the need to carry out code walkthroughs of user’s analysis tools that run against HTRC data service. Instead, we rely on data capsules to prevent leaks through the network. Initial users of HTRC Data Capsules report that they are able to access the Internet and make persistent changes to their VMs in maintenance mode. They also report that they can access HTRC data service and secure volume but
not Internet access in secure mode. They can neither make persistent changes to VMs, nor access other users’ VMs by SSH’ing in secure mode. This early feedback suggests that the system is able to enforce the non-consumptive use constraint over copyrighted text while preserving as many flexibilities as possible for users. Studies under heavier workloads are needed. The time transition between maintenance mode and secure mode is about 2 to 5 seconds from user perspective. This wait interval is an area of further study. There remain a number of important open issues and questions. In this first version of the system, we assume that the user is trustworthy, an assumption whose removal suggests three important open questions: first, users could intentionally deploy algorithms that leak copyrighted data through the VNC channel. How are protections enforced under intentional maliciousness at the application? The second is an algorithm that intentionally obscures a final result by encoding copyrighted text in it. The third is the existence of covert channels between VMs running on the same host machine. To prevent leaks from a single VM, a solution might be to add an analysis component to both the VNC channel and final result to automatically detect potential data leak. To prevent covert channels, one solution would be to run VMs with different modes on different hosts respectively in order to guarantee stronger isolation. In the existing version of the system, once an analysis is finished, the VM is wiped out and resources released for other users to share. It is highly likely that a user will develop a research investigation over months. Their interaction with the HathiTrust volumes that they have identified for their work may have them building up a carefully annotated corpus over time. This information may reside in a database stored on the VM, for instance. How does HTRC Data Capsule support these stateful VMs? Finally, the current solution limits users to running their analysis on a single VM. While this limitation we think is viable for 80% of the anticipated uses based on experience with HTRC users, the solution needs extending to accommodate analysis that requires a distributed cluster, such as using MapReduce on a university cluster. More work needs to be done on both data capsules and the web service to support execution of large-scale applications that run across multiple VMs with respect to security. Finally, the framework of the HTRC Data Capsules has not been applied in settings other than to protect copyrighted content in HTRC. An interesting study is the usage of cloud computing with data capsules in other domains of protected data.
6. ACKNOWLEDGMENTS This research is funded through a grant from the Alfred P. Sloan Foundation, grant # 2011-6-27 and is based in part on work supported by the National Science Foundation under # 0903629 and by Samsung. Special thanks to Samitha Liyanage, Milinda Pathirage, and Zong Peng at Indiana University, and Earlence Fernandes and Ajit Aluri at the University of Michigan for discussions contributing to this work. Any opinions, findings, conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
7.
REFERENCES
´ et al. Meandre data-intensive application [1] B. Acs infrastructure: Extreme scalability for cloud and/or grid computing. In 2011 Int’l Conf on New Frontiers in Artificial Intelligence, pages 233–242. Springer-Verlag, 2011. [2] K. Borders, E. V. Weele, B. Lau, and A. Prakash. Protecting confidential data on personal computers with storage capsules. In 18th USENIX Security Symposium, SSYM’09, pages 367–382. USENIX Association, 2009. [3] Census data centers. https://www.census.gov/ces/rdcresearch/. [4] CMS virtual research data center. http://www.resdac.org/cms-data/request/ cms-virtual-research-data-center/. [5] Amazon EC2. http://aws.amazon.com/ec2/. [6] M. S. et al. The Secure Medical Research Workspace: An IT infrastructure to enable secure research on clinical data. Clinical and Translational Science, 6(4):222–225, April 2013. [7] http://news.stanford.edu/news/2010/december/ jockers-digitize-texts-120110.html/. [8] Google book settlement. http://en.wikipedia.org/ wiki/Authors_Guild_v._Google/. [9] Hathitrust Research Center. http://www.hathitrust.org/htrc/. [10] Jersey. https://jersey.java.net/. [11] L. e. a. Martignoni. Cloud Terminal: Secure access to sensitive applications from untrusted systems. In USENIX Annual Technical Conf, USENIX ATC’12. USENIX Association, 2012. [12] P. M. Mell and T. Grance. The NIST definition of cloud computing. Technical Report SP 800-145, National Institute of Standards & Technology, 2011. [13] D. e. a. Nurmi. The eucalyptus open-source cloud-computing system. In IEEE/ACM Int’l Symp on Cluster Computing and the Grid, CCGRID ’09, pages 124–131. IEEE Computer Society, May 2009. [14] OAuth 2.0. http://oauth.net/2/. [15] OpenStack. http://www.openstack.org/. [16] B. Plale. Opportunities and challenges of text mining hathitrust digital library. November 2013. [17] D. Rosenthal and D. Vargas. Distributed digital preservation in the cloud. Int’l Journal of Digital Curation, 8(1):107–119, 2013. [18] P. Sempolinski and D. Thain. A comparison and critique of eucalyptus, opennebula and nimbus. In IEEE 2nd Int’l Conference on Cloud Computing Technology and Science, CloudCom ’10, pages 417–426. IEEE Computer Society, November 2010. [19] P. Teregowda, B. Urgaonkar, and C. Giles. Cloud computing: A digital libraries perspective. In IEEE 3rd Int’l Conference on Cloud Computing, CLOUD ’10, pages 115–122. IEEE Computer Society, July 2010. [20] G. von Laszewski, J. Diaz, F. Wang, and G. C. Fox. Comparison of multiple cloud frameworks. In IEEE 5th Int’l Conference on Cloud Computing, CLOUD ’12, pages 734–741. IEEE Computer Society, June 2012.