Computational Drug Screening in the Cloud Using HierVLS/PSVLS

3 downloads 8878 Views 175KB Size Report
platform for business and scientific software .... management software to the HPCC is often called a ... cloud provider if the original one goes out of business or.
Computational Drug Screening in the Cloud Using HierVLS/PSVLS Thomas Sitter1, Darryl L. Willick2,3, and Wely B. Floriano4,5,* Bioinformatics Program, Lakehead University, Thunder Bay, ON P7B 5E1, Canada 2 SciReal LLC, Grand Marais, MN 55604, USA 3 Technology Services Centre, Lakehead University, Thunder Bay, ON P7B 5E1, Canada 4 Thunder Bay Regional Research Institute, Thunder Bay ON, P7A 7T1, Canada 5 Department of Chemistry, Lakehead University, Thunder Bay, ON P7B 5E1, Canada * corresponding author 1

Abstract - Cloud computing is a rapidly growing platform for business and scientific software applications. It has key advantages over traditional high performance computers, particularly for smaller institutions without specialized staff and resources. This paper describes the transfer of scientific software used to calculate binding interactions of small molecules to proteins from a high performance computing environment to the OpenStack cloud computing environment. Keywords: cloud computing, virtual ligand screening, molecular docking, HPC, HierVLS, PSVLS.

1. Introduction The number of genetic disorders with a known molecular basis is growing at an ever-quickening pace [1]. This growth in knowledge has allowed researchers to focus on personalized medicine and disease-specific molecular imaging agents (also known as molecular probes), both of which involve small molecules that bind selectively to proteins associated with a disease of interest. Currently, the discovery of new drugs and molecular probes is largely performed by experimentation using high-throughput screening, where millions of compounds are characterized in pharmacological tests looking for desired activity. This method is prohibitively expensive for many research institutions and requires advanced robotics and highspeed computers, and thus is normally only found in industry [2]. For small to medium sized institutions, it is therefore necessary to develop low cost methods for discovering drugs and molecular probes. A dramatic increase in the number of known protein structures over

the last decade has enabled the development of entirely computational methods for high-throughput ligand screening. These methods are usually referred to as Virtual Ligand Screening (VLS) and involve molecular docking of libraries of chemical compounds into the three-dimensional (3D) structure of a protein target. The virtual libraries usually range from hundreds to hundreds of thousands of compounds. Although much cheaper than experimental high-throughput screening, VLS methods are computationally demanding and typically require high performance computing clusters (HPCCs) to run complex simulations. The costs associated with HPCC hardware purchase, maintenance and system administration is often out of reach for small institutions and research groups not focused on high performance computing. The recent boom in online services for ondemand computational infrastructure, called ‘cloud computing’, may present an affordable solution for many research groups to overcome the need for expensive high performance computers in virtual screening. In this paper, we report on the creation of a testbed cloud system using OpenStack and on the porting of a working implementation of a virtual screening suite of software HierVLS/PSVLS onto it.

2. The Virtual Ligand Screening Protocol HierVLS and its Multiple Binding Site Version PSVLS HierVLS is a software suite that uses computational simulations to discover new molecular probes and medicinal drugs [3]. Originally designed to target a single binding site within the target protein, HierVLS was later expanded to screen the ligand library against all available binding pockets in the structure of the target protein, an

approach referred to as Protein Scanning with Virtual Ligand Screening (PSVLS) [4-5]. PSVLS consists of three main software components used for finding potential ligand binding sites within the structure of a target protein, docking a ligand into a binding site, and calculating binding energy. Potential binding sites are found automatically using the experimentally determined model of a desired protein. Using progressively more complex calculations, 3D models of ligands are docked into the binding pockets in a variety of conformations and orientations. The binding energy and buried surface area are calculated and the least promising conformations and orientations are discarded prior to the next set of calculations. This minimizes the computational cost of virtual ligand screening while still ensuring the realistic binding scores are calculated for the most promising ligands. Because the simulations use a variety of individual software components, PSVLS must format and pass data files between programs to calculate binding potential of the library of small compounds. Since PSVLS can be complicated to configure and run, a graphical user interface (GUI) called Cassandra was developed to provide a user-friendly interface to set up and launch the calculations and to simplify the data handling and analysis [6]. PSVLS provides binding affinities and bound structures for each ligand in the virtual screening library in each one of the potential binding sites in the target protein.

3. High Performance Computing Clusters A high performance computing cluster (HPCC) is a cluster of relatively inexpensive commodity computers, called “compute nodes”, connected together. The user interacts with a “head” node which controls the compute nodes. The more compute nodes an HPCC has, the more calculations it can perform at the same time. A head node includes resource management software which, among other things, monitors the status of compute nodes, dispatches (schedules) jobs, and retrieves and returns results to users. HierVLS/PSVLS were originally designed to work with the open source Torque/PBS resource manager (http://www.adaptivecomputing.com/products/opensource/torque/). A discrete set of calculations submitted through the management software to the HPCC is often called a ‘job’. HierVLS/PSVLS is massively parallel, with the number of independent jobs equal to the number of identified binding sites in the 3D structure of the protein target multiplied by the number of ligands to be screened.

4. Cloud Computing Cloud computing has been growing in popularity as an alternative to HPCCs. The usage demand for a HPCC can vary significantly over a period of time, especially for online businesses. This has created a market for dynamic and highly available computational resources. By renting computational resources on demand, companies use only as much computing resources as needed at any particular time. This contrasts with traditional HPCCs, where businesses have to buy a sufficiently sized system to handle peak usage, and then have the extra resources sitting unused during off hours. This need for highly available resources gave rise to "cloud computing": many commodity computers linked together with software so that they can be rented on an hourly basis as a service [7]. Virtualization software can be used to simulate multiple computers on one physical computer, transforming many commodity computers into effective computing resources. Computers with very different physical hardware can simulate identical virtual machines (VMs). VMs are assigned virtual memory and virtual CPUs from the physical computer. They can be provided with preinstalled software and settings. Complex software and configurations can be bundled in a virtual machine and then deployed many times over to emulate hundreds or thousands of physical computers. This architecture provides many unique advantages over high performance computers in terms of distribution and scalability. Cloud services come in three major varieties: Software as a Service (SaaS), Platform as a Service (PaaS), and Infrastructure as a Service (IaaS). Software as a service is when the software and data are stored on cloud computers that can be accessed over the internet. The virtual machines running on the rented cloud computers have all the necessary software installed, so when demand increases more virtual systems are launched. This can save considerable effort and does not require specialized in-house staff to install and configure new systems. Some common examples of SaaS applications are Google Drive (drive.google.com), Apple iCloud (www.icloud.com), and DropBox (www.dropbox.com). IaaS allows customers to rent the physical computers so that they can provide their own virtual machines and provide Software as a Service. Platform as a Service provides an interface for customers to design and run software. Some businesses construct their own cloud systems to meet their computational needs, a model called a "private cloud", in contrast to "public clouds" which are supplied by cloud providers. Others have constructed hybrid clouds, mixing private clouds for most of their needs with rented public clouds at peak demand. One reason for constructing a private cloud is data privacy. Since the

security of the data stored in public clouds is not under the control of the business, the data may be vulnerable to untrusted access [8]. Another important aspect to consider is the dependency that may be created by developing software for a specific cloud vendor, which can make it difficult to transfer the software to a different cloud provider if the original one goes out of business or is no longer competitive [8]. To overcome this difficulty we chose a cloud provider, RackSpace, that uses open source cloud software that has open-standards used by other cloud providers such as HP Cloud Services (https://www.hpcloud.com) and Cloudscaling (www.cloudscaling.com). Scientists around the world are now moving from the planning and testing phase to mature cloud systems [914]. For example, using IaaS, scientists at the Large Hadron Collider deploy approximately 500 virtual machines in parallel, with peaks up to 1,000 virtual machines to perform calculations [15]. This cloud system has been in operation for two years and completed over 500,000 jobs [15]. The virtualization technology allows them to create custom Linux images with older operating systems and software that are insulated from changing technology due to the virtualization software. Virtual machine images like this can also be used by other researchers who would like to perform those calculations using the software as a service cloud model. This is the model we have adopted for HierVLS/PSVLS.

5. Porting Software to Cloud Computers As cloud computing becomes increasingly common, it is necessary to design scientific computational software that can use these resources instead of traditional HPCCs. In this context, methods for porting HPCC software over to cloud computers are becoming increasingly important. Because cloud computers use virtual machines, the first step in a port is the transfer of the software to a VM using virtualization software. VMs can be launched on local computers using virtualization software and show to produce consistent results within a small virtual HPCC environment before being launched on a cloud test environment. The final step is to launch a preconfigured VM onto an IaaS cloud. This allows the scientific software application to be offered as a SaaS for academic and industrial researchers. Currently, two major cloud providers specializing in IaaS are Amazon (http://aws.amazon.com/ec2/) and Rackspace (www.rackspace.com). Both providers have comparable prices, but Rackspace uses open source cloud software (OpenStack) and provides the Rackspace Private Cloud software for constructing local (or “private”) clouds.This was another deciding factor to port HierVLS/PSVLS to

OpenStack. The overall strategy adopted in this project is presented in Figure 1.

Figure 1. Schematic diagram of the various steps required to build a test cloud and deploy preconfigured instances into a virtual HPCC. VM images must be constructed of the Head and Compute nodes, and then they must be loaded in the Cloud image repository so that instances can be launched.

6. Strategy to Port HierVLS/PSVLS to a cloud environment Prebuilt virtual machine images tested on OpenStack are freely available online from rackerjoe (https://github.com/rackerjoe/oz-image-build). The HPCC in our lab uses CentOS v5.3 (www.centos.org) to run HierVLS/PSVLS. CentOS is an open source Linux distribution based on Red Hat Enterprise Linux (www.redhat.com). Therefore, a Red Hat 5 update 6 image was chosen because it was the most closely related. To configure and test our virtual machines without having to rent cloud computers we constructed a private cloud using two nodes from a HPCC available in our lab. Our test cloud consisted of one control node (Intel Xeon E5520, 2.27 GHz, 12GB RAM) and one compute node (Intel Xeon X5550, 2.67 GHz, 24GB RAM). The cloud software used was Rackspace Private Cloud Alamo v2.0, an easy to install and configure cloud software distributed freely by Rackspace (http://www.rackspace.com). Included are services for managing virtual images, creating and launching virtual machine instances, and a graphical web interface for managing OpenStack cloud service called the Dashboard. The virtual machines must be configured to run the HierVLS/PSVLS calculations and emulate the behavior of a HPCC. On our lab HPCC, the Operating System (OS) image is stored on the head node and pushed to the compute nodes when they are connected to it. Torque is the distributed resource management software used by HierVLS/PSVLS to manage the compute nodes. It is an open-source job scheduler based on the Portable Batch

System (PBS) originally developed by NASA. The head node has a Torque server daemon for submitting and managing jobs, as well as the Torque scheduler which implements a simple First In First Out (FIFO) protocol for jobs. Also available is a more advanced scheduler called Maui v3.2.5 (http://www.adaptivecomputing.com/products/opensource/maui/), which integrates with Torque v2.3.6 to queue jobs so that computing resources are used efficiently and fairly among multiple HPCC users.

7. Procedures 7.1. Determining Components and Dependencies HierVLS uses a hierarchical approach to virtual ligand screening, i.e. there are multiple calculations with increasing complexity separated by filtering steps. HierVLS and Cassandra are not a bundled software package, but rather many software components linked together. A series of Perl and other scripts drive the necessary computations for virtual ligand screening. Perl is a common programming language with powerful tools for processing text, ideal for analyzing data generated from each program that comprises HierVLS and transferring it to the next component. Besides passing the data files between programs in the appropriate order, some other tasks may include converting file formats using an open source chemical data file converter called OpenBabel [16], scanning through ligand files to remove

ones with poor binding scores, and consolidating the results at each step. The Cassandra GUI [6] allows users to input the chemical compounds and protein data files, which are then organized into a “Project” so that subsequent data analysis is simplified. Other components of HierVLS include compiled executable files, some of which rely on environmental variables, runtime libraries, and other files stored at specific locations (Figure 2). The function of each component can be seen in Table 1. The Perl script PSVLS launches the individual components and manages the application of HierVLS to each binding site available in the protein. Jobs are submitted to the HPCC via Torque/PBS. Table 1. Main software components of HierVLS. Component

Reference

PASS

[17]

Dock 4.0

[18]

Dms

[19]

HBPlus

[20]

Connolly

[21]

OpenBabel

[16]

Function Identifies putative binding sites Generates multiple ligandprotein docked configurations (> 10,000) without scoring or selection Calculates molecular surface of a molecule Calculates hydrogen bonds Calculates accessible surface area Converts chemical file formats

7.2. Creating an OpenStack Private Cloud

Figure 2. HierVLS/PSVLS components. Rectangles indicate Perl script components, ovals are compiled software components, and the hexagon represents specific libraries that must be included in the installation.

As a testbed, we used two nodes from our in-house HPCC cluster to construct an OpenStack cloud system. The Rackspace private cloud software Alamo v2.0 was used to install the OpenStack cloud software. Rackspace Alamo was installed with Ubuntu 12.04 as the host operating system, Chef for network configuration, and OpenStack Folsom release (http://www.openstack.org/software/folsom/). Using an Alamo install DVD allowed for a relatively easy installation process on bare hardware, creating a compute node with image object storage, block storage, and computing services, as well as the dashboard - a web interface for interacting with the control node to handle tasks such as launching and terminating virtual machine instances and creating virtual machine images. Each node in an OpenStack cloud must have a connection to the internet during installation; one DVD is used to install both compute and control nodes so it is necessary to download the Ubuntu 12.04 image during installation. Due to our network setup, the easiest way to

accomplish this was to connect each node to a network switch that then went through the head node of the existing HPCC, which acted as a router, and then out to the internet. More easily, nodes can be directly connected to a router. Intel VT-d hardware virtualization was enabled in the BIOS settings of the compute node to allow the OpenStack KVM hypervisor to run virtual machines (http://www.linux-kvm.org/page/Main_Page).

7.3. Creating and configuring the virtual nodes and connecting them into a virtual HPCC A Red Hat 5 Update 6 image was downloaded from rackerjoe (https://github.com/rackerjoe/oz-image-build) and configured to act as either a head or compute node. The HierVLS software was installed on the image. Essential library files and software components were installed/transferred to the virtual machine nodes. Environmental variables and symbolic links were set. Torque v4.1.3 was installed on the head and compute nodes. Users and passwords were created and configured for password-less SSH, which is required for Torque/PBS to transfer files between nodes. A user was created for running Cassandra on the compute nodes and the home directory from the head node was mounted on all compute nodes when they launch. A directory containing computational software was also mounted from the head node onto the compute nodes. This way, only the head node needs to have a complete copy of Cassandra, HierVLS, and its components. However, environmental variables and libraries were still configured independently on the compute node image. Because HierVLS is controlled by the Cassandra GUI, X11 Forwarding had to be configured on the head node so that users can access the graphical interface.

7.4. Quality Control Data Set A test data set was constructed using a relatively small protein - the N-terminal domain of the Human Papillomavirus 16 E6 oncoprotein (HPV16 E6), which has only two binding regions identified by PASS [17] in its experimentally determined structure (PDB code 2LJX) [22]. Four ligands were used in the test set: ethyl lactate, ethylene, ethyl acetate, and acetic acid. This data set was used to generate a quality control standard set of results used to validate HierVLS/PSVLS on each system. Running the quality control data set gave results consistent with the physical HPCC indicating that HierVLS was installed correctly on the virtual test environment.

8. Discussion HierVLS and Cassandra are not a bundled software package, but rather many software components linked together by scripts and controlled by a graphical user interface. Because of this, it can be difficult to ensure that all the required components are transferred to a new system. A successful installation must include all the necessary programs and libraries and settings in both head and compute nodes. Originally the cloud environment chosen for this project was Amazon EC2 because it is the most popular cloud provider and it has extensive documentation. Amazon also offers compute instances with many powerful CPUs which would be ideal for high performance computing. Some papers, however, have noted that running multiple VMs with fewer CPUs can be just as effective as a single VM with many CPUs [23]. It was later decided that Rackspace would make a better cloud provider because they use open source cloud software, OpenStack, and also provide support and APIs for working with Amazon EC2 virtual machine images. Furthermore, Rackspace has a software bundle called the ‘Rackspace private cloud’ that allows users to quickly install a Rackspace OpenStack cloud running Ubuntu 12.04 on bare hardware, as long as they have an internet connection. VirtualBox (www.virtualbox.org) machine images were constructed to test the transfer of the HierVLS system onto a virtual machine. Unfortunately, we could not launch the VirtualBox vdi images directly onto the OpenStack cloud, even though OpenStack has support for that format. The vdi images were then converted to qcow2 images using qemu. OpenStack was able to launch instances of the converted machine images, but connecting to the running instance was not successful, although there may be a way to do this. Instead we adopted the command-line only Red Hat 5.6 OpenStack images, which were launched and used without issue. This machine image was chosen because it was the most similar to the native CentOS 5.3 environment of HierVLS/PSVLS, and the VirtualBox virtual machine images used Centos 5.6. Because the Red Hat images were command-line only, X11Forwarding had to be configured to allow user access to the Cassandra GUI through the users’ computer, which is remotely connected to the cloud instance. We considered installing the Cassandra GUI directly on the host computer and modifying it to connect to the cloud instance. However, we decided to have Cassandra on the cloud and connect to it remotely, which allowed it to be used without any modifications. In the future, it would be convenient to create a web portal for users to interact with HierVLS and view results.

9. Conclusions We have demonstrated that HierVLS/PSVLS has the ability to run in an OpenStack cloud environment. OpenStack networks virtual machine instances in an amenable manner for the HierVLS/PSVLS software. The Torque resource manager performed well in the cloud environment. No substantial modifications to the HierVLS/PSVLS source code were needed, and the ported software is still fully back-compatible with HPCCs. Although there is still much to be done before HierVLS/PSVLs can be released as a SaaS product, we have demonstrated the feasibility of this endeavor. Next steps should be the development of a web interface for user interaction and displaying a more user-friendly analysis and presentation of binding results, as well as automatic creation and termination of compute nodes. Although no upper limit on the number of compute nodes was determined, we anticipate that the massively parallel nature of HierVLS/PSVLS is well complemented by the massive scale that can be leveraged by cloud computers.

10. Acknowledgements This work was supported in part by a grant from the National Institutes of Health under a subcontract award (DC010105) to SciReal, LLC. Benchmarking calculations were performed using resources from SHARCNET under the auspices of Compute Canada. The authors would like to acknowledge Dr. Sabah Mohammed and Dr. Aicheng Chen for helpful discussions.

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

11. References [1]

[2]

[3]

[4]

[5]

[6]

A. Hamosh, et al., "Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders," Nucleic Acids Res, vol. 33, pp. D514-7, Jan 1 2005. M. M. Hann and T. I. Oprea, "Pursuing the leadlikeness concept in pharmaceutical research," Curr Opin Chem Biol, vol. 8, pp. 255-63, Jun 2004. W. B. Floriano, et al., "HierVLS hierarchical docking protocol for virtual ligand screening of largemolecule databases," J Med Chem, vol. 47, pp. 56-71, Jan 1 2004. S. Dadgar, et al., "Paclitaxel Is an Inhibitor and Its Boron Dipyrromethene Derivative Is a Fluorescent Recognition Agent for Botulinum Neurotoxin Subtype A," J Med Chem, Mar 29 2013. X. Li, et al., "Sweet taste receptor gene variation and aspartame taste in primates and other species," Chem Senses, vol. 36, pp. 453-75, Jun 2011. Z. H. Ramjan, et al., "A cluster-aware graphical user interface for a virtual ligand screening tool," Conf

[16] [17]

[18]

[19]

[20]

[21]

Proc IEEE Eng Med Biol Soc, vol. 2008, pp. 4102-5, 2008. M. Armbrust, et al., "Above the Clouds: A Berkeley View of Cloud Computing," EECS Department, University of California, Berkeley UCB/EECS-200928, February 10 2009. M. D. Dikaiakos, et al., "Cloud Computing Distributed Internet Computing for IT and Scientific Research," Ieee Internet Computing, vol. 13, pp. 1013, Sep-Oct 2009. S. Ostermann, et al., "A Performance Analysis of EC2 Cloud Computing Services for Scientific Computing," Cloud Computing, vol. 34, pp. 115-131, 2010. H. Y. Chen, et al., "An Investigation on Applications of Cloud Computing in Scientific Computing," Information and Management Engineering, Pt V, vol. 235, pp. 201-206, 2011. J. J. Rehr, et al., "Scientific Computing in the Cloud," Computing in Science & Engineering, vol. 12, pp. 3443, May-Jun 2010. J. Cohen, et al., "RAPPORT: running scientific highperformance computing applications on the cloud," Philosophical Transactions of the Royal Society aMathematical Physical and Engineering Sciences, vol. 371, Jan 28 2013. K. Chine, "Scientific Computing Environments in the Age of Virtualization Toward a Universal Platform for the Cloud," Proceedings 2009 Ieee International Workshop on Open-Source Software for Scientific Computation, pp. 44-48, 2009. K. Jorissen, et al., "A high performance scientific cloud computing environment for materials simulations," Computer Physics Communications, vol. 183, pp. 1911-1919, Sep 2012. A. A. R.J. Sobie, I. Gable, C. Leavett-Brown, M. Paterson, R. Taylor, A. Charbonneau, R. Impey, W. Podiama. (2013, Feb 2013). HTC Scientific Computing in a Distributed Cloud Environment. arXiv:1302.1939 [cs.DC]. Available: http://arxiv.org/abs/1302.1939 N. M. O'Boyle, et al., "Open Babel: An open chemical toolbox," J Cheminform, vol. 3, p. 33, 2011. G. P. Brady, Jr. and P. F. Stouten, "Fast prediction and visualization of protein binding pockets with PASS," J Comput Aided Mol Des, vol. 14, pp. 383401, May 2000. T. J. Ewing, et al., "DOCK 4.0: search strategies for automated molecular docking of flexible molecule databases," J Comput Aided Mol Des, vol. 15, pp. 411-28, May 2001. F. M. Richards, "Areas, volumes, packing and protein structure," Annu Rev Biophys Bioeng, vol. 6, pp. 15176, 1977. I. K. McDonald and J. M. Thornton, "Satisfying hydrogen bonding potential in proteins," Journal of Molecular Biology, vol. 238, pp. 777-93, May 20 1994. M. L. Connolly, "Solvent-accessible surfaces of proteins and nucleic acids," Science, vol. 221, pp. 709-13, Aug 19 1983.

[22]

[23]

K. Zanier, et al., "Solution Structure Analysis of the HPV16 E6 Oncoprotein Reveals a Self-Association Mechanism Required for E6-Mediated Degradation of p53," Structure, vol. 20, pp. 604-617, Apr 4 2012. M. Alef and I. Gable, "HEP Specific Benchmarks of Virtual Machines on multi-core CPU Architectures," 17th International Conference on Computing in High Energy and Nuclear Physics (Chep09), vol. 219, 2010.

Suggest Documents