A Parallel Implementation of Molecular Packing using

68 downloads 0 Views 323KB Size Report
lelization strategy which combines OpenMP and MPI to parallelize Packmol, a ... ordered molecular systems, turning the very first steps of the MD a quite .... The second scenario (S2) uses OpenMP on the Xeon Phi only and results in a ...
A Parallel Implementation of Molecular Packing using Xeon Phi Support ´ 3 Leandro Zanotto1 , Leandro Mart´ınez2 and Guido Araujo 1

Center for Computational Engineering & Sciences 2

Institute of Chemistry

3

Institute of Computing University of Campinas Campinas – SP – Brazil [email protected], [email protected], [email protected]

Abstract. The Molecular Packing problem consists in determining a molecular arrangement satisfying a given set of spatial constraints related to the geometry and distance between atoms of different molecules. The result of a solution to the MP problem is a box filled with molecules at valid coordinates that satisfies energy distribution constraints and is ready for Molecular Dynamics simulation. The solution of MP problems for large and complex molecular mixtures typically requires long computational times. This paper shows a hybrid parallelization strategy which combines OpenMP and MPI to parallelize Packmol, a well-known MP solver. Experimental results on an Intel Xeon multicore platform, with Xeon Phi co-processor support, reveals speedups of the order of 13x - 100x with respect to sequential code on each processor. Categories and Subject Descriptors C1.2 [Multiple Data Stream Architectures (Multiprocessors)]: Array and vector processors, Single-instruction-stream, Multiple-data-stream processors (SIMD); D1.3 [Programming Techniques]: Parallel Programming

1. Introduction Molecular Packing (MP) is a relevant problem that seeks to determine a valid spatial arrangement for molecules to establish an initial condition for Molecular Dynamics simulation (MD) [L. Martinez 2003]. When molecules are packed strong repulsive forces due to van der Walls force-fields can abruptly increase for short atom-to-atom distances, frequently resulting in the non-differentiability of the potential energy due to atom overlapping. Hence the distances between molecules must be large enough so that repulsive potentials does not disrupt the simulations. On a simple system, such as a water box, one can find an adequate solution for MP by ordering the molecules in a regular lattice. Nevertheless as the complexity of the system increases such approach can become very tedious. For example, building ordered molecular systems, turning the very first steps of the MD a quite cumbersome task. Moreover, for such complex systems, regular configurations would almost certainly contain overlapping atoms. To avoid that, a more automated solution to MP needs to be provided [L. Martinez 2003].

Figure 1. Examples of MP simulations boxes built with Packmol: (a) A mixture of water and urea; (b) Carbon nanotube containing water inside and carbon tetrachloride outside; (c) Water and carbon tetrachloride with a hormone in the interface. Reproduced from [L. Martinez 2009].

MP simulation is a computational intensive task. For a small problem scale, like the mixture of 400 water and 400 Urea molecules, a MP solver execution if fairly fast. Nevertheless, as the scale of the problem increases finding a solution of this problem task can take hours, thus restricting the scale of the systems that can be simulated. Packmol is a well-known (>5,000 downloads) MP solver that uses an optimization function to minimize the overall system energy [L. Martinez 2009]. It is based on the idea that random sets of appropriately packed molecules, with no intermolecular clashes, can be rapidly equilibrated to the thermodynamics energy using standard MD integration algorithms and energy minimization. In a previous work Packmol’s authors [L. Martinez 2009] have shown that speedups of the order of 5x can be achieved when using OpenMP in a Xeon quad-core architecture. This paper takes that research further ahead by proposing and evaluating different off-loading and parallelization strategies for Packmol. Experimental results reveal that a strategy that combines OpenMP and MPI on an Intel Xeon processor with the support of an Intel Xeon Phi coprocessor can result in speedups of the order of 100x. This paper is divided as follows. Section 2 discusses that main structure of the Packmol code and identifies its major computation hot spots. Section 3 describes that experimental setup and parallelization strategy. Section 4 describes the experimental results, and Section 5 concludes the work.

2. Packmol Profiling Packmol was first profiled using Intel Vtune, to find out the most time-consuming parts of the code. The hottest functions in Packmol were identified to be the Objective and Gradient functions, corresponding to 80% of the total program execution time, been the Objective function (OF) the most relevant of both.

Figure 2. Objective function seeks to determine the minimum distance between pairs of atoms

This is achieved by checking the distances between faces, axes and vertices, in

a 3D space bounding box, which corresponds to 14 different checks between neighbor atoms at each iteration. Most of the computation within OF is performed by function fparcs which stores the distance between atoms into variable local f. While computing the distance between atoms the code does not update the data matrices; hence it does not have any loop-carried dependency, making the outer loop of the OF an embarrassingly parallel DOALL loop.[Pacheco 2011] The local f values are accumulated at each iteration, and at the end of the outer loop execution returns the value of the minimum atom distance into the f variable.

3. Parallelization Strategies In principle a careful analysis of the code of Figure 3 suggests that the loop can be parallelized using OpenMP, MPI[J. Reinders 2012]. In order to evaluate the best approach to parallelize the OF the following four parallelization scenarios have been evaluated. The scenarios (Si, i=1-4) start using OpenMP parallelization in the host multicores and move toward offloading computation to a Xeon Phi co-processor with the support of MPI communication primitives. 1. S1: OpenMP is used to parallelize code in the Xeon host only 2. S2: OpenMP is used to parallelize code in the Xeon Phi co-processor only 3. S3: Both MPI communication code and OpenMP computation are executed in the Xeon Phi co-processor. 4. S4: MPI is used in the host to broadcast data to the Xeon Phi and to sum up the parallel values of the local f variables into the reduced global f variable.

Figure 3. 3(a) Objective Function with OpenMP and 3(b) MPI calls before calling Objective Function

The OpenMP and MPI code fragments that implement the OF parallelization are shown in Figures 3 and 4 respectively. In Figure 2 loop induction variables are declared private and variable f will store the result of the reduction from the iteration threads. In Figure 3 MPI BCAST is used to send he data required by the loop, which is then computed in the Xeon Phi co-processor. At the end of the computation MPI REDUCE is used to collect the partial sums into the final result global f. The code of Figures 3(a) and 3(b) are combined in different ways to implement the parallelization scenarios discussed above.

4. Experimental Results The platform used to evaluate the above mentioned parallelization scenarios is composed of an Intel 10-core and 64GB memory processor integrated to a Xeon Phi card with

Figure 4. (a) Speedup for S1 using OpenMP;(b) Speedup for S2 using OpenMP; (c) Speedup for S3 and S4 when using OpenMP and MPI

60-processors and 8GB memory.Packmol starts with the number of molecules nmols = 10.000.000, minimum distance dmin = 5.0 and number of atoms n = 3 * nmols. Figures 4(a), 4(b) and 4(c) show the resulting speedups for scenarios S1, S2 and S3/S4, respectively. In the scenario (S1) OpenMP results in a speedup of approximately 13x with respect to the Xeon host when the code is parallelized in the Xeon host only. The second scenario (S2) uses OpenMP on the Xeon Phi only and results in a speedup of approximately 72x over the Xeon Phi serial code. Considering the case of Figure 4(c), when MPI is combined with OpenMP. First MPI processes and OpenMP threads are entirely executed inside the co-processor (S3) producing speedups that approach 100x. In the second scenario of this case (S4) MPI is used to offload computation to the Xeon Phi co-processor and a speedup of 50x is achieved.

5. Conclusion The experiments show that OpenMP can produce very good speedups when more cores are available (72x). Given that OpenMP is an easy way to parallelize, this makes it an attractive solution to speedup MP solvers. On the other hand, the combination of OpenMP and MPI, running with Xeon Phi support, can also increase the application speedup (100x) even without the use of vectorization and vector alignment. Nevertheless, even with the large number of threads available on the Xeon Phi, a considerable share of the potential speedup is consumed by the latency required to transfer data through the PCI bus to the co-processor. This suggests that additional research needs to be done so as to find new parallelization scenarios capable of hiding this communication overhead. This work was supported by Fapesp - Grants 2010/16947-9, 2013/05475-7 and 2013/08293-7

References J. Reinders, M. D. M. (2012). Structured Parallel Programming: Patterns for Efficient Computation. Morgan Kaufmann, 1st edition. L. Martinez, R. Andrade, E. G. B. J. M. M. (2009). A package for building initial configurations for molecular dynamics simulations. L. Martinez, J. M. M. (2003). Packing optimization for the automated generation of complex system’s initial configurations for molecular dynamics and docking. Pacheco, P. S. (2011). An Introduction to Parallel Programming. Morgan Kaufmann, 1st edition.