Molecular Docking on FPGA and GPU Platforms
Imre Pechan and Béla Fehér Department of Measurement and Information Systems Budapest University of Technology and Economics Budapest, Hungary
[email protected] Abstract—Molecular docking is an important problem of bioinformatics aiming at the prediction of binding poses of molecules. AutoDock is a popular, open-source docking software applying a computationally expensive but parallelizable algorithm. This paper introduces an FPGAbased and a GPU-based implementation of AutoDock and shows how the original algorithm can be effectively accelerated on the two different platforms. According to test runs, both implementations achieve significant speedups over AutoDock running on a single CPU core and on a quad-core system. Comparison of the two implementations proves that many-core graphics processing units can be a real alternative to FPGAs in the field of high performance computing. Keywords-FPGA vs. GPU; hardware acceleration; molecular docking; AutoDock; bioinformatics
I.
INTRODUCTION
Accelerating computationally intensive algorithms with custom hardware is an important application area of FPGAs. In the recent years, however, several new technologies emerged and spread which have been changing the role of FPGAs in high performance computing (HPC) applications. Multi-core CPUs became ordinary, cloud computing makes CPU clusters easily accessible, and graphics processing units can be programmed easily using standard languages such as C. FPGAs and GPUs have different hardware architecture and programming methodology; economic aspects such as hardware cost or power consumption must be taken into account as well when choosing a suitable accelerator platform. Making a good decision is usually not easy, which makes the comparison of FPGAs and GPUs interesting. The goal of this paper is to describe and compare the implementations of a bioinformatics-related algorithm called molecular docking on an FPGA-based hardware and on GPU. The aim of molecular docking is to predict the possible binding position and binding energy of two given molecules whose initial 3D structure is known. The two molecules are usually a large protein and a smaller ligand molecule (typically, the ligand consists of tens, the protein thousands of atoms). Molecular docking is used by the pharmaceutical industry for finding competitive inhibitor drugs. Inhibitors are ligands that can bind to and block the activity of a given target protein molecule called receptor. Docking algorithms are extremely time-consuming and they are often performed for hundreds of molecules in practice when looking for a
suitable inhibitor molecule, which sets up the demand for hardware acceleration. There are dozens of different docking software tools, some FPGA- or GPU-based implementations were reported as well [1, 2]. This paper introduces the accelerated versions of a free and popular docking software called AutoDock. No other FPGA-based implementation of AutoDock has been published until now. GPU-based implementations are known, but they only focus on certain parts of the algorithm and do not exploit the full parallelism available [3]. II.
AUTODOCK
AutoDock [4, 5] is a widespread, open-source docking software released under the GNU General Public License. It was developed by the Scripps Research Institute; its newest version is currently AutoDock 4.2. AutoDock should not be confused with AutoDock Vina, developed at the same institute, which is an entirely different software rather than a successor of AutoDock. A brief survey of the docking algorithm of AutoDock is presented in the next section; detailed description of the algorithm can be found in references [4, 5, 6]. A. The Algorithm The docking algorithm applied by AutoDock consists of a scoring function and an optimization method. The scoring function models chemical interactions and determines the free energy for a given geometrical arrangement of the molecules. The optimization method tries to find the global minimum of the scoring function, which corresponds to the energetically most favorable position of the molecules; this is the predicted binding pose. The degrees of freedom of the problem are the variables describing the position and orientation of the molecules. When calculating the scoring function for a given arrangement, two terms have to be determined: the intermolecular energy of the molecules and the internal energy of the ligand. In the former case the scoring function is calculated indirectly based on pre-calculated constant energy grids describing the protein molecule. During this process a trilinear interpolation formula has to be determined for every ligand atom. In case of the ligand internal energy the scoring function has to be calculated directly for every ligand atom pairs whose distance can change during docking. The applied optimization method is a genetic algorithm which consists of creating sets of potential solutions
(populations of entities) iteratively with operators such as selection, cross-over and mutation. After evaluating the current generation, a few randomly selected entities are subjected to a local search process similar to hill climbing. B. Parallelization The pseudo-code of the docking algorithm can be seen on Fig. 1 (HLP, MLP and LLP stand for high, medium and low level parallelization, respectively). The most simple, high level parallelization of AutoDock is based on the fact that usually several (10-100) distinct docking runs are performed for a given receptor-ligand pair. This is required not only for obtaining reliable results, but often more than one valid docked poses exist. Since these docking runs are totally independent from each other, they can be performed simultaneously, which can be easily exploited on a multicore platform. The structure of the optimization algorithm suggests a medium level parallelization method. In case of the genetic algorithm the entities of the next generation can be generated and evaluated in parallel. Similarly, the local search processes can be executed on the selected entities simultaneously. At low level the algorithm consists of four basic steps. First new gene values have to be generated, either according to the rules of the genetic algorithm or to those of the local search method. Then the coordinates of atoms have to be calculated; rotations around the rotatable bonds must be performed, and the whole ligand must be moved and rotated. Finally, the energy terms must be determined based on the calculated coordinates. All these steps allow low level, fine grained parallelization. The different genes of a new entity can be generated simultaneously; rotations of different atoms can be performed in parallel; intermolecular energy contribution of different ligand atoms and internal energy contribution of different atom pairs can be calculated at the same time. As it can be seen, there are several ways of parallelizing the algorithm, which makes both FPGA and GPU a promising platform for accelerating AutoDock. The algorithm mainly consists of floating point arithmetic operations, which are the strength of GPUs. On the other hand, the applied scoring function leads to a quite inaccurate chemical model. As a consequence arithmetic precision is not of primary importance, and fixed point arithmetic that fits better to the capabilities of FPGAs do not decrease the accuracy of the algorithm. Another important property of the docking algorithm is that the amount of its input and output data is quite low (a few tens of Mbytes at most), but the algorithm itself is complex and time-consuming. Therefore the speed of data movement between the main memory and the memory of an accelerator platform does not influence the performance since its time is negligible compared to the execution of the algorithm. III.
FPGA-BASED IMPLEMENTATION
AutoDock was implemented on the SGI RASC RC100 blade, which includes two Xilinx Virtex-4 LX200 FPGAs and five 8 Mbyte external SRAM memory modules per
for every docking run while stop condition is false //perform genetic algorithm for every new entity for every degree of freedom generate_DOF(); for every required rotation rotate_atom(); for every ligand atom calculate_intermol_E(); for certain atom pairs calculate_internal_E();
//HLP
//MLP //LLP //LLP //LLP //LLP
//perform local search for every selected entity //MLP while LS stop condidion is false … same steps as above … //LLP Figure 1.
Pseudo-code of docking algorithm
FPGA. Only one of the FPGAs was used for the implementation. The code was written in Verilog. The FPGA-based implementation was described in details in an earlier paper [6]; thus only a brief survey is presented here. Fig. 2 shows the architecture of the FPGA-based implementation. The implementation consists of four main modules executing the four main steps of the algorithm. The structure can be considered as a three stage pipeline; while Module 1 generates the genes of entity i, Module 2 calculates the atomic coordinates of entity i-1, and Module 3 and 4 evaluate the scoring function for entity i-2. Thus always three different entities are processed at the same time, which means that the possible medium level parallelization is applied only partially. Low level parallelization, on the other hand, is exploited fully by internal pipelines of the four main modules. When the internal pipeline of a module is full, it executes the corresponding operation in every clock cycle. That is, Module 1 generates a new gene value, Module 2 performs a rotation of an atom, Module 3 calculates the intermolecular energy contribution of a ligand atom and Module 4 calculates the internal energy contribution of an atom pair in each clock cycle. The FPGA-based implementation executes only one docking run at a time so high level parallelization is not exploited. The system includes a host CPU which calculates the initialization data and loads them to the SRAM memories of the FPGA before docking, and processes the results after it. The implemented algorithm is slightly different from the one used in AutoDock. The original genetic algorithm applies proportional selection, which requires calculating the relative fitness of every entity compared to the average fitness of the whole population. It is quite hard to implement this functionality in FPGA; therefore a much simpler binary tournament selection method was used. AutoDock represents the orientation of the ligand with a quaternion, which has to be normalized after some of its elements were changed randomly during the genetic algorithm or the local search process. In order to avoid this, the orientation is represented by three angles describing the direction of the axis of rotation, and the angle of rotation around this axis. Another simplification is that uniformly distributed random variables are used in the FPGA instead of the Cauchy distribution
Figure 2. FPGA-based implementation
applied in AutoDock. All these modifications were required for saving resources in the FPGA and for achieving higher speedup, but they do not decrease the performance of the algorithm significantly according to test runs [6]. It should be noted that applying these modifications in the original AutoDock code would virtually not change its speed since they do not concern the time-consuming part of the algorithm, the energy evaluation. IV.
GPU-BASED IMPLEMENTATION
Graphics processing units are parallel, multithreaded, many-core processors optimized for data-parallel applications, which consist of performing the same operations on many independent data elements. The GPUbased implementation was written in the CUDA C language, which allows it to run on any CUDA-aware NVIDIA GPUs. CUDA C gives minimal extensions to the standard C language and provides an API, which enable the user to write a C program consisting of serial code and parallel functions called kernels. The former runs on the host CPU, the latter are executed K-times parallel by K different CUDA threads on the GPU. Threads of a kernel are grouped into thread blocks. Threads within the same block can communicate and synchronize with each other; this is not possible between different thread blocks, which are scheduled and executed in a non-deterministic order based on run-time decisions. The NVIDIA GPU consists of multiprocessors, which manage, schedule and execute the thread blocks of the launched kernel in groups of 32 threads called warps. A full description of the CUDA architecture can be found in the CUDA C manual [7]. A. The Implementation The basic idea behind the GPU-based implementation is to exploit the high and medium level parallelization possibilities by different thread blocks and the low level parallelization by threads of the same block within the GPU. The code includes two important kernels. Kernel A realizes the genetic algorithm, so it generates and evaluates a single new entity from the current population; Kernel B performs the iterative local search process for a given entity. Both kernels execute the same code for scoring function evaluation; they differ only in how the degrees of freedom are generated. The other difference is that Kernel A performs only one fitness evaluation but Kernel B does many due to the iterative process. The host code generates initialization
data and copies it to the GPU memory; then it starts the first generation cycle. In each cycle first Kernel A is launched in parallel for each new entity of every independent docking run that has to be generated; then Kernel B is launched for each entity of every docking run which was selected for local search. This way the GPU takes full advantage of both the high and the medium level parallelization possibilities. Supposing that default values are used (population size = 150, local search probability = 6%) the number of thread blocks is 150*N in case of Kernel A and about 9*N in case of Kernel B where N denotes the number of docking runs. The low value of N results in a low number of blocks of Kernel B; in this case Kernel B will probably not be able to keep the multiprocessors of the GPU occupied. This means that the local search process can be parallelized much less effectively than the genetic algorithm. Unfortunately Kernel B fills more percentage of the total run time than Kernel A. With respect to the capabilities and parameters of current high-end NVIDIA GPUs these facts suggest that the speedup achieved by the implementation will strongly depend on the number of requested docking runs. The low level parallelization of the algorithm is exploited at thread level. Both Kernel A and Kernel B executes the four basic steps described earlier; they assign new values to the degrees of freedom, perform the rotation of atoms and then calculate the intermolecular and internal energy terms. During the first step each new gene value is generated by a different thread. In the second step, independent rotations are executed by different threads simultaneously. Similarly, the intermolecular energy contribution of different ligand atoms as well as the internal energy contribution of different atom pairs is calculated by different threads of the block in parallel. However, if the number of genes, the number of independent rotations, and the number of ligand atoms or that of the internal energy contributors is not an integer multiple of 32, some threads will remain idle. Intermolecular and internal energy calculation are independent from each other, but require different operations, thus they can be executed only sequentially on the GPU. This means that low level parallelization possibilities could be exploited less effectively in the GPU than in the FPGA. V.
RESULTS
Test runs consisting of 2.500.000 energy evaluations were performed for about 60 receptor-ligand molecule pairs on the SGI RASC platform and on two different GPUs, the GeForce GT220 with 6 multiprocessors and the GeForce GTX260 with 24 multiprocessors. The same dockings were executed by the original AutoDock code that ran on one core of a quad-core CPU system consisting of two dual-core 3.2GHz Intel Xeon CPUs; these run times were used as a reference for comparison. In addition, the dockings were performed using all the four CPU cores of the system, too. AutoDock does not support multi-core CPUs inherently. A trivial way of utilizing four cores is to distribute the requested number of docking runs evenly among them and start four independent AutoDock processes in parallel. Since the time taken by different runs is virtually always the same (it depends only on the types of molecules and the number of
TABLE I. PDB code 2cpp
1hvr
Figure 3. FPGA and GPU speedups
energy evaluations), this method fully exploits the multi-core capabilities without the need for modifying the original AutoDock code. The average speedup achieved when using all the four CPU cores of the system was about ×2.36. Fig. 4 shows the speedups achieved by the FPGA platform and the two GPUs over the single-core CPU version versus the number of requested docking runs (N) in case of two specific, quite different molecule pairs (Protein Data Bank code 1hvr and 2cpp). Since the FPGA executes the different runs sequentially, its speedup is a constant ×35 for 1hvr and ×12 for 2cpp. The speedups of the GPUs, on the other hand, highly depend on the number of docking runs, as expected. If N=1, both GPUs are only 3-5 times faster than the single CPU core. As N increases, the speedups get higher and later they become saturated. This happens when the number of parallel blocks is high enough for keeping all resources of the GPU busy. The maximal speedup is ×14 and ×17 in case of GT220, ×57 and ×75 in case of GTX260 for 1hvr and 2cpp, respectively. Table 1 shows the total run times of the dockings consisting of 10 and 100 runs on the different platforms (values are displayed in seconds). The average speedup for the whole set of 60 molecules is ×23 in case of the FPGA-based implementation regardless of N, ×12 and ×15 in case of GT220, ×30 and ×65 in case of GTX260, for N=10 and N=100, respectively. If the number of docking runs is low (N