2012 SC Companion: High Performance Computing, Networking Storage and Analysis
Accelerating Hydrocodes with OpenACC, OpenCL and CUDA J. A. Herdman and W. P. Gaudin High Performance Computing UK Atomic Weapons Establishment Aldermaston, UK email: {Andy.Herdman, Wayne.Gaudin}@awe.co.uk
S. McIntosh-Smith and M. Boulton Microelectronics Group Department of Computer Science University of Bristol, UK email:
[email protected]
Language (OpenCL) [2] require explicit parallel programming using library calls and specially written compute kernels. The OpenACC Application Programming Interface [3] offers a directive-based approach, similar to that found in OpenMP [4], for describing how to manage data and execute sections of code on the device. These two approaches to programming accelerators have different effects on factors such as programmer productivity, time required for modifying the code, programming language of choice, required application performance, and portability. Evaluating the strengths of these approaches using production codes is difficult, as applications are often complex and consist of hundreds of thousands of lines of code. Miniapplications—small, self-contained programs that embody essential performance characteristics of key applications— provide a viable way to trial new programming methodologies and/or architectures [5]. In this paper we present a study of the performance of a Lagrangian-Eulerian hydrodynamics code ported to GPUs using three technologies: OpenACC, OpenCL and CUDA. Specifically, we make the following key contributions:
Abstract—Hardware accelerators such as GPGPUs are becoming increasingly common in HPC platforms and their use is widely recognised as being one of the most promising approaches for reaching exascale levels of performance. Large HPC centres, such as AWE, have made huge investments in maintaining their existing scientific software codebases, the vast majority of which were not designed to effectively utilise accelerator devices. Consequently, HPC centres will have to decide how to develop their existing applications to take best advantage of future HPC system architectures. Given limited development and financial resources, it is unlikely that all potential approaches will be evaluated for each application. We are interested in how this decision making can be improved, and this work seeks to directly evaluate three candidate technologies—OpenACC, OpenCL and CUDA—in terms of performance, programmer productivity, and portability using a recently developed Lagrangian-Eulerian explicit hydrodynamics mini-application. We find that OpenACC is an extremely viable programming model for accelerator devices, improving programmer productivity and achieving better performance than OpenCL and CUDA. Keywords-OpenACC, OpenCL, CUDA, Hydrodynamics, High Performance Computing
I. I NTRODUCTION
•
The increasing number of transistors on a chip (as predicted by Moore’s law) has provided a continuous, dependable improvement in processor performance for several decades. Traditionally, these additional transistors were used to increase clock speeds, but since the mid-1990s they have been used to increase the number of cores on a single die. Future exascale machines will continue this multi-core trend, but limited by power and heat constraints, they will need to be comprised of a much larger number of lower-power, lower-performance cores. Current architectures that offer this style of parallelism include Graphics Processing Units (GPUs), Intel’s Xeon Phi, and Accelerated Processing Units (APUs) such as AMD’s Fusion. Programming for the large number of lightweight cores offered by these devices means departing from the traditional distributed MPI approach, to a tiered programming model which is designed to harness both coarse and fine grained parallelism. Accelerated programming platforms like Nvidia’s CUDA [1] and the Khronos Group’s Open Compute 978-0-7695-4956-9/13 $26.00 © 2013 IEEE DOI 10.1109/SC.Companion.2012.66
D. A. Beckingsale, A. C. Mallinson and S. A. Jarvis Performance Computing and Visualisation Department of Computer Science University of Warwick, UK email: {dab,acm}@dcs.warwick.ac.uk
•
•
We provide the first detailed documentation of the CloverLeaf mini-application which can be found as part of the Mantevo project [6]. We describe its hydrodynamics scheme as well as the features which make it amenable to parallelisation on GPU platforms. In the context of this mini-application we present, to our knowledge, the first direct comparison between OpenACC, OpenCL and CUDA, the three dominant programming models for GPU architectures. We provide a quantitative and qualitative comparison of these three approaches with regard to code development, maintenance, portability and performance on two problem sizes of interest on a GPU-based cluster (Cray XK6).
The remainder of this paper is organised as follows: Section II discusses related work in this field; Section III presents the hydrodynamics scheme employed within the CloverLeaf mini-application; Section IV provides details on each of the three implementations used in this study, as well as 465
1D temporary array 2D temporary array
the changes needed to make the overall algorithm more amenable to parallelisation on the GPU architecture; the results of our study are presented in Section V together with a description of our experimental setup; and finally, Section VI concludes the paper and outlines plans for future work.
"wind"
"wind"
II. R ELATED W ORK Nvidia’s CUDA is currently the most mature and widely used technology for developing applications for GPUs. However, directive-based approaches such as OpenACC, driven by the work from the Center for Application Acceleration Readiness (CAAR) team at Oak Ridge National Laboratory (ORNL) are becoming increasingly used [7]. To date, much less work exists which examines whether OpenCL is a viable alternative programming model. Insufficient work also exists to directly evaluate each of these programming models, both generally as well as for Lagrangian-Eulerian explicit hydrodynamics applications. Wienke et al. present the only direct comparison between OpenACC and OpenCL of which we are aware [8]. Their work, however, is focused on two applications from significantly different domains to our work: the simulation of bevel gear cutting, and a neuromagnetic inverse problem. Similarly, Reyes et al. present work which focuses on comparing different directive-based approaches, in particular several OpenACC implementations [9]. A reasonable body of work exists which has examined porting Smoothed Particle Hydrodynamics (SPH) applications to GPU based systems [10]–[13]. These approaches generally use mesh-less, particle based, Lagrangian numerical methods and are therefore significantly different to the hydrodynamics scheme on which our work is based. Studies involving SPH have also predominantly focused on utilising CUDA and have not sought to compare its performance, productivity or portability with alternative approaches such as OpenCL or OpenACC, which is a key focus of our work. Whilst Bergen et al. produce an OpenCL version of a finite-volume hydrodynamics application which is similar to that involved in our work, they do not present any performance results or compare the development or performance of the application to other alternative approaches [14]. The GAMER library also provides similar functionality to that employed here, however it is implemented entirely in CUDA and therefore does not provide a vehicle for evaluating the different development approaches available [15]. Brook et al. present their experiences of porting two computational fluid dynamics (CFD) applications to an accelerator (a Euler-based solver and a BGK model Boltzmann solver) [16]. Whilst the Euler-based solver is similar to the one documented in our work, they focus on the Intel Xeon Phi architecture and employ the OpenMP programming model. Lattice Quantum Chromodynamics (QCD) is an additional domain which has seen numerous applications succesfully
Figure 1: The advection routine calculates updates based on the direction of the “wind” (the material flow), and performs sweeps in the x and y directions to update quantities in a 1D temporary array, corresponding to one row/column of cells. Using a larger, temporary, 2D array allows all cells in the mesh to be updated in parallel.
ported to GPUs. However, a number of these studies employ the QUDA library [17]–[19]. This library is based on Nvidia’s CUDA technology and therefore these studies do not examine alternative approaches such as OpenCL or OpenACC. III. H YDRODYNAMICS S CHEME CloverLeaf is a mini-application that solves the compressible Euler equations, a system of equations describing the conservation of energy, mass, and momentum in a system. The equations are solved on a Cartesian grid in two dimensions. Each grid cell stores three quantities: energy, density, and pressure; and each cell corner stores a velocity vector. CloverLeaf solves the equations with second-order accuracy, using an explicit finite-volume method. Each cycle of the application consists of two steps: (i) a Lagrangian step advances the solution in time using a predictor-corrector scheme, distorting the cells as they move with the fluid flow; (ii) an advection step restores the cells to their original positions. This simple hydrodynamics scheme is representative of other codes, but is often written in such a way as to cause unnecessary dependencies in key computational sections. To avoid this, all scientific computation is carried out in small kernel functions, making long, complex loops containing many subroutine calls unecessary. IV. I MPLEMENTATIONS Profiling CloverLeaf shows that approximately 95% of the execution time is contained in six kernels. However, to achieve full GPU residency, i.e. the physics algorithm executed exclusively on the GPU with necessary data residing in device rather than host memory, fourteen unique kernels are required to be ported to the accelerator device. Only control code is executed on the host CPU. To create each implementation we started with the OpenMP implementation of the mini-application. Exposing the loop-level parallelism required by OpenMP required a combination of loop splitting and adding extra temporary 466
(a) Original
(b) Data-parallel
DO k = y_min, y_max DO j = x_min, x_max ! flux calculation ENDDO
DO k = y_min, y_max DO j = x_min, x_max ! flux calculation ENDDO ENDDO
DO j = x_min, x_max ! mass updates ENDDO DO j = x_min, x_max ! velocity updates ENDDO ENDDO
(d) OpenACC !$ACC PARALLEL LOOP … DO k = y_min, y_max DO j = x_min, x_max ! flux calculation ENDDO ENDDO !$ACC END PARALLEL LOOP
DO k = y_min, y_max DO j = x_min, x_max ! mass updates ENDDO ENDDO
!$ACC PARALLEL LOOP … DO k = y_min, y_max DO j = x_min, x_max ! mass updates ENDDO ENDDO !$ACC END PARALLEL LOOP
DO k = y_min, y_max DO j = x_min, x_max ! velocity updates ENDDO ENDDO
queue.enqueueNDRangeKernel(); // flux kernel queue.enqueueNDRangeKernel(); // mass kernel queue.enqueueNDRangeKernel(); // velocity kernel --for(k = get_global_id(1); …) { for(j = get_global_id(0); …) { // flux calculation } } --for(k = get_global_id(1); …) { for(j = get_global_id(0); …) { // mass updates } } --for(k = get_global_id(1); …) { for(j = get_global_id(0); …) { // velocity updates } }
flux_kernel_cuda>(); mass_kernel_cuda>(); velocity_kernel_cuda>();
!$ACC PARALLEL LOOP … DO k = y_min, y_max DO j = x_min, x_max ! velocity updates ENDDO ENDDO !$ACC END PARALLEL LOOP
--const int glob_id = threadIdx.x + blockIdx.x * blockDim.x; const int row = glob_id / (x_max + 4); const int column = glob_id % (x_max + 4); // flux calculation --const int glob_id = … // mass updates --const int glob_id = … // velocity updates
(e) CUDA
(c) OpenCL
Figure 2: Key differences between implementations of the advection computational kernel in CloverLeaf. The “data-parallel” version of the kernel can be easily ported to all of the programming methods used.
data storage. An example of using extra storage is shown in Fig. 1, where additional temporary data is stored in order to enable the advection loops to be parallelised. Whilst porting the code to the accelerator languages, we improved the data parallelism within each kernel, and applied these changes back to the OpenMP version to increase CPU performance. The development of the advection kernel in each of the three programming models is shown in Fig. 2; a similar approach was used for each of the remaining kernels. The original Fortran code was first modified in order to remove dependencies between loop iterations. The loops, however, must still be completed sequentially, as each loop uses data calculated by the previous loop.
OpenACC, loop-level pragmas were added to specify how the loops should be run on the GPU, and describe their data dependencies. For effective use of the GPU, data transfers between the host processor and the accelerator must be kept to a minimum. CloverLeaf is fully resident on the device; this was achieved by applying OpenACC data “copy” clauses at the start of the program, which results in a one-off initial data transfer to the device. The computational kernels exist at the lowest level within the application’s call-tree and therefore no data copies are required; we employ the OpenACC “present” clause to indicate that all input data is already available on the device. As in any block-structured, distributed MPI application, there is a requirement for halo data to be exchanged between MPI tasks. In the accelerated versions, however, this data resides on the GPU local to the host CPU, hence data to be exchanged is transferred from the accelerator to the host via the OpenACC “update host” clause. MPI send/receive pairs exchange data in the usual manner, and then the updated data is transferred from the host to its local accelerator using the OpenACC “update device” clause. A key point to note is that the explicit data packing (for sending) and unpacking (for receiving) is carried out on the device for maximum
A. OpenACC The OpenACC Application Program Interface is a highlevel programming model based on the use of directive pragmas. Supported by an initial group of three compiler vendors—CAPS [20], Cray [21], and PGI [22]—the aim is to add directives to the source code, minimising the modifications required to existing code. This provides an ease of programmability and portability by allowing a single source code to be run on both CPUs and accelerators. In order to convert the data-parallel version of the kernel to 467
Processor GPU Compute Nodes CPUs/Node GPUs/Node Total CPUs Total GPUs CPU Memory/Node GPU Memory/Node Interconnect
Chilean Pine AMD Opteron 6272 Nvidia X2090 40 1 1 40 40 32GB 6GB Cray “Gemini”
(see Table I). The default Fortran and C compilers are the Cray Compiling Environment (CCE) and the choice of MPI is MPICH2. The OpenCL version, however, was built with the GNU compiler environment, as we have been unable to successfully utilise the Cray compiler with the C++ OpenCL constructs. The CUDA kernels were compiled with the appropriate flags to enable double precision calculation capability on the Fermi architecture1 . Cray’s CrayPAT profiling tool was used to produce the timing profile for the OpenACC version, whereas for the OpenCL and CUDA versions kernel timings were derived by subsequently querying the event objects returned by each kernel invocation.
Table I: Summary of Chilean Pine hardware.
performance. B. OpenCL
B. Performance Analysis
OpenCL is an open standard enabling parallel programming of heterogeneous architectures. Managed by the Khronos group and implemented by over ten vendors—including AMD [23], Intel [24], IBM [25], and Nvidia [26]—OpenCL code can be run on many architectures without recompilation. The C bindings that form the interface to the functionality described by the OpenCL standard mean that integrating directly with the Fortran codebase of CloverLeaf is difficult. To ease programmability, a C++ header file is provided by the Khronos Group which allows access to the OpenCL routines in a more object-oriented manner [27]. This header file was used by a static C++ class to manage the interaction between the original Fortran code and the new OpenCL kernels. The class holds details about all the buffers and kernels used by the application, allowing C functions (which are easily callable from Fortran) to be written that initiate kernels and transfer data as needed. As with the OpenACC version of the code, data transfers between the host processor and the device must be minimised in order to maximise performance. This is achieved by creating and initialising all data items on the device, and allowing these to reside on the GPU throughout execution. Data is only copied back to the host in order to write out visualisation files, and for MPI communications.
The performance of the three accelerated implementations of CloverLeaf was tested using a representative square-shock benchmark problem. During the simulation, a small highdensity region of ideal gas expands into a larger low density region of ideal gas, causing a shock-front to form. Two test configurations, described in terms of the number of cells in the computational mesh, were used in our experiments: a large 38402 -cell problem and a smaller 9602 -cell problem. We analysed the performance of these two problems using one node of Chilean Pine containing a single Nvidia X2090 GPU. On the 9602 -cell problem the overall walltimes (of nonprofiled runs) for the OpenACC, OpenCL and CUDA versions were 2.0569s, 2.5580s, and 2.7798s respectively (see rightmost column of Figure 3). The OpenACC implementation was therefore 1.24× faster overall than the OpenCL version and 1.35× faster than CUDA version. Due to time constraints only initial, unoptimised, CUDA and OpenCL versions have been employed in this study, and we expect performance to improve as optimisations are implemented. The leftmost seven columns of Figure 3 present the walltimes for the various kernels of each version. These timings include an overhead caused by the various profiling mechanisms and therefore the sum of the individual kernel execution times is greater than the total execution time shown in the rightmost column. The calc dt kernel utilises a reduction operation to calculate the minimum timestep that the simulation should take. The accumulated time spent in this kernel shows a significant disparity between each version, likely due to the efficiency of the reduction function implemented in each version. The OpenACC reduction, implemented by Cray’s OpenACC compiler is likely to be highly optimised, and spends only 0.112s in the calc dt kernel. The CUDA version of the code uses a reduction coded partially by hand, and partially provided by the Thrust library, and spends 0.167s in this kernel. The OpenCL version of the code uses a hand-coded reduction (see later text), and achieves the worst
C. CUDA CUDA is Nvidia’s parallel platform and programming model, and provides support for general purpose computing on Nvidia GPUs [1]. The CUDA implementation is almost identical in design to the OpenCL implementation, and was implemented using a global class that coordinated data transfer and computation on the GPU, with helper functions to handle interoperability between the CUDA and Fortran code. V. R ESULTS A. Experimental Setup All experiments were conducted on Chilean Pine, a Cray XK6 hosted at the Atomic Weapons Establishment (AWE)
1 -gencode
468
arch=compute_20,code=sm_21
OpenACC
OpenCL
CUDA
Kernel time (s)
0.6 2 0.4 1 0.2 0
Total time (s)
3
0.8
0 advec_cell
advec_mom
calc_dt
PdV
update_halo
viscosity
Other kernels
Total
Figure 3: Runtimes of selected kernels and total application for the 9602 -cell problem.
performance in this kernel, taking 0.189s (7% of the total application runtime). For the 38402 problem, OpenACC provides a 4.91× improvement over one 16-core “Interlagos” socket. The CUDA version was 1.16× slower than the OpenACC version on this particular problem. Unfortunately, it was not possible to execute the OpenCL version of CloverLeaf on this problem size due to an “out of memory” error. This problem class requires approximately 4.5GB of device memory for data storage alone. We are currently investigating the cause of this problem and in particular why it only occurs with the OpenCL version. Insights and optimisations identified whilst undertaking the OpenACC porting work, which enabled the compiler to generate partitioned threaded code for the accelerator, also enabled compilers to improve the code generated for the CPU. These improved performance by over 1.4× when compared to the original CPU implementation.
tal amount of OpenCL code unique to CloverLeaf is 13,972 words. Additionally, OpenCL and CUDA both required extra work to re-write the computational kernels in C-style code. However, the simple design of the Fortran kernels eliminated much of the work that might be required in a legacy code. Developing the OpenACC version in an incremental manner (i.e. one kernel at a time) proved to be a straightforward process, which made validating and debugging the code considerably easier. Whilst it was also possible to develop the OpenCL and CUDA versions of the code in an incremental manner, the significantly larger code volumes required for each increment and the immaturity of the tool support, particularly for OpenCL, made debugging problems harder and more time consuming. CloverLeaf requires the use of several reduction operations. Under OpenACC, reductions were described by pragmas, and implemented by the OpenACC compiler. In CUDA, the reductions were implemented using a simple two stage approach. The first stage reduces within each block (a block being the CUDA equivalent to an OpenCL work group), producing an array of partial reductions (i.e. one per block). The second stage then combines the elements of the resulting partial-reduction array; this was implemented using the Thrust C++ library. We were unable to find an optimised library within the OpenCL “ecosystem” which provided equivalent functionality and therefore implemented our own reductions in a similar multi-stage approach.
C. Productivity Analysis In order to assess the programmer productivity offered by each approach, we measured the number of words of code (WOC) added for each version, considered whether computational kernels needed rewriting, and examined tool support. By considering words of code (not including symbols such as braces and parentheses) we were able to produce a metric that overcame the variations caused by different programming styles, something which affects the lines of code (LOC) metric. In terms of programmer productivity, OpenACC proved superior to both OpenCL and CUDA, requiring the addition of only 184 OpenACC pragmas (1,510 words of code). The OpenCL and CUDA versions of the code required an additional 17,930 and 13,085 words of code respectively. However, of the 17,930 words required by OpenCL, 3,958 can be attributed to the static class created to manage OpenCL objects. This static class could be used in other similar applications with little modification, meaning the to-
D. Portability Analysis The Cray OpenACC implementation is the only implementation to have been utilised extensively in this work. In the authors’ experience the interpretation of the OpenACC standard differs between vendors, and hence we experienced significant problems trying to port the OpenACC version of the code to other OpenACC implementations, such as PGI’s. We are currently working with these vendors to address these issues and will examine their performance in future work. Therefore, whilst in the long term OpenACC will 469
Version OpenACC OpenCL CUDA
Total 1,510 17,930 13,628
WOC Device 4,327 5,830
Host 13,608 7,798
Kernel Language Fortran OpenCL C CUDA C
Tools
Portability
Good Poor Average
Average Good Poor
Table II: Key development metrics for the three versions.
provide portability between different compiler and system vendors, our current implementation is constrained to the Cray platform. Similarly, utilising CUDA as a mechanism to take advantage of accelerator devices limits the choice of officially supported hardware platforms available to an organisation, as Nvidia only supports CUDA on their own hardware. The Ocelot project [28], and PGI’s CUDA Fortran [29] compiler do however provide alternatives for other languages and hardware. The OpenCL version of the code exhibited the highest portability, and using this version we were able to execute the application on both AMD and Nvidia GPU devices, AMD and Intel CPUs and also a pre-production Intel Xeon Phi. We will conduct a more detailed study into the performance of CloverLeaf on these platforms in future work.
ward, offering comparable performance and significantly improving programmer productivity compared to both CUDA and OpenCL. Using the optimised OpenCL version of the code we also intend to conduct a study into OpenCL’s suitability to express “on-node” parallelism for hydrodynamics applications on CPU platforms, compared to alternative approaches such as OpenMP. This work will ultimately strive to determine whether, for this class of application, OpenCL optimisations for accelerators can also deliver performance improvements for CPU devices and whether it is possible to use OpenCL to deliver performance between different architectures. We also plan to investigate hybrid programming models in which the host CPU is utilised for some of the computation work rather than idling during the majority of the computation. Finally we plan to use the OpenCL and OpenACC versions of CloverLeaf to evaluate the performance of a range of different platforms, including AMD GPUs and APUs, ARM based APUs such as Nvidia’s Tegra and CARMA platforms, Intel’s Xeon Phi range and Altera’s FPGA devices.
VI. C ONCLUSION AND F UTURE W ORK This paper has shown that the explicit hydrodynamics scheme employed by the CloverLeaf mini-application is amenable to accelerator technology. The key to improving the performance of the code on the GPU architecture was to maximise data parallelism within each of the main computational kernels, restructuring the loops to remove data dependencies between iterations. Whilst time consuming, this activity was necessary regardless of the programming model employed and was therefore constant across the three implementations. Feedback from the Cray compiler (CCE) proved to be crucial in understanding the partitioning of the threaded code on the accelerator. We used the generated listing files and runtime debugging options to capture which data items were actually being transferred. The performance results demonstrate the effectiveness of Cray’s OpenACC implementation. Producing functionally equivalent CUDA and OpenCL versions of CloverLeaf required considerably more programmer effort compared to the OpenACC version. The OpenACC version of the miniapplication also exhibits superior performance compared to the CUDA and OpenCL versions. However, it is likely that additional work to optimise the OpenCL and CUDA versions of the code could deliver performance improvements. In future work we plan to optimise both the OpenCL and CUDA implementations to determine whether it is possible to achieve superior performance compared to OpenACC for this class of code. We find that OpenACC is an extremely attractive and viable programming model for accelerator devices going for-
ACKNOWLEDGEMENTS This work is supported in part by The Royal Society through their Industry Fellowship Scheme (IF090020/AM) and by the UK Atomic Weapons Establishment under grants CDK0660 (The Production of Predictive Models for Future Computing Requirements) and CDK0724 (AWE Technical Outreach Programme). The authors would like to express their thanks to Cray, in particular Alistair Hart of the Cray European Exascale Research Initiative, for their help with OpenACC and also to John Pennycook of the University of Warwick for his help and advice on OpenCL. CloverLeaf is available via GitHub (see http://warwickpcav.github.com/CloverLeaf/ ), and is also released as part of the Mantevo project hosted at Sandia National Laboratories and can be accessed at https://software.sandia.gov/mantevo/. Sandia National Laboratories is a multiprogram laboratory managed and operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy’s National Nuclear Security Administration under contract DE-AC04-94AL85000. R EFERENCES [1] “CUDA API Reference Manual version 4.2,” http://developer. download.nvidia.com, April 2012. 470
[2] “The OpenCL Specification version 1.2,” http://www.khronos. org/registry/cl/specs/opencl-1.2.pdf, November 2011.
[18] “QUDA: A library for QCD on GPUs,” http://lattice.github. com/quda/, Oct 2012.
[3] “The OpenACC Application Programming Interface version 1.0,” http://www.openacc.org/sites/default/files/OpenACC.1. 0 0.pdf, November 2011.
[19] R. Babich, M. Clark, and B. Joo, “Parallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics,” IEEE, 2010.
[4] “OpenMP Application Program Interface version 3.1,” http:// www.openmp.org/mp-documents/OpenMP3.1.pdf, July 2011.
[20] “CAPS OpenACC Compiler The fastest way to manycore programming,” http://www.caps-entreprise.com, November 2012.
[5] M. A. Heroux, D. W. Doerfler, P. S. Crozier, J. M. Willenbring, H. C. Edwards, A. Williams, M. Rajan, E. R. Keiter, H. K. Thornquist, and R. W. Numrich, “Improving Performance via Mini-applications,” Sandia National Laboratories, Tech. Rep., 2009.
[21] “OpenACC accelerator directives,” http://www.training.praceri.eu/uploads/tx pracetmo/OpenACC.pdf, November 2012. [22] B. Lebacki, M. Wolfe, and D. Miles, “The PGI Fortran and C99 OpenACC Compilers,” in Cray User Group, 2012.
[6] “Mantevo - Home,” https://software.sandia.gov/mantevo/, November 2012.
[23] “Accelerated Parallel Processing (APP) SDK,” http://developer.amd.com/tools/heterogeneous-computing/ amd-accelerated-parallel-processing-app-sdk/, November 2012.
[7] A. Bland, J. Wells, O. Messer, O. Hernandez, and J. Rogers, “Titan: Early experience with the Cray XK6 at Oak Ridge National Laboratory,” in Cray User Group, 2012.
[24] “Intel SDK for OpenCL Applications 2012,” http://software. intel.com/en-us/vcsource/tools/opencl-sdk, November 2012.
[8] S. Wienke, P. Springer, C. Terboven, and D. Mey, “OpenACC - First Experiences with Real-World Applications,” Euro-Par 2012, LNCS, Springer Berlin/Heidelberg(2012), 2012.
[25] “OpenCL Lounge,” https://www.ibm.com/developerworks/ community/alphaworks/tech/opencl, November 2012.
[9] R. Reyes, I. Lopez, J. Fumero, and F. Sande, “A Comparative Study of OpenACC Implementations,” Jornadas Sarteco, 2012.
[26] “OpenCL NVIDIA Developer Zone,” https://developer.nvidia. com/opencl, November 2012.
[10] E. Rustico, G. Bilotta, G. Gallo, and A. Herault, “Smoothed Particle Hydrodynamics simulations on multi-GPU systems,” 20th Euromicro International Conference on Parallel, Distributed and Network-based Processing, 2012.
[27] “OpenCL 1.1 C++ Bindings Header File,” http://www. khronos.org/registry/cl/api/1.2/cl.hpp, November 2012. [28] “gpuocelot - A dynamic compilation framework for PTX,” http://code.google.com/p/gpuocelot/, November 2012.
[11] X. Gao, Z. Wang, H. Wan, and X. Long, “Accelerate Smoothed Particle Hydrodynamics Using GPU,” IEEE, 2010.
[29] “PGI — Resources — CUDA Fortran,” http://www.pgroup. com/resources/cudafortran.htm, November 2012.
[12] J. Pier, I. Figueroa, and J. Huegel, “CUDA-enabled Particlebased 3D Fluid Haptic Simulation,” Electronics, Robotics and Automotive Mechanics Conference, 2011. [13] J. Junior, E. Clua, A. Montenegro, and P. Pagliosa, “Fluid simulation with two-way interaction rigid body using a heterogeneous GPU and CPU environment,” Brazilian Symposium on Games and Digital Entertainment, 2010. [14] B. Bergen, M. Daniels, and P. Weber, “A Hybrid Programming Model for Compressible Gas Dynamics using OpenCL,” 39th International Conference on Parallel Processing Workshops, 2010. [15] H. Shukla, T. Woo, H. Schive, and T. Chiueh, “Multi-Science Applications with Single Codebase - GAMER - for Massively Parallel Architectures,” SC ’11 Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, 2011. [16] R. Brook, B. Hadri, V. Betro, R. Hulguin, and R. Braby, “Early Application Experiences with the Intel MIC Architecture in a Cray CX1,” in Cray User Group, 2012. [17] G. Shi, S. Gottleib, and M. Showerman, “Tuning And Understanding MILC Performance In Cray XK6 GPU Clusters,” in Cray User Group, 2012. 471