in galaxies and galactic nuclei. We present the code performance on a single node using different kinds of special hardware (traditional GRAPE, FPGA, and ...
Noname manuscript No. (will be inserted by the editor)
Accelerating Astrophysical Particle Simulations with Programmable Hardware (FPGA and GPU) R. Spurzem · P. Berczik · G. Marcus · A. Kugel · G. Lienhart · I. Berentzen · R. M¨ anner · R. Klessen · R. Banerjee
Received: date / Accepted: date
Abstract In a previous paper we have shown that direct gravitational N -body simulations in astrophysics scale very well for moderately parallel supercomputers (order 10-100 nodes). The best balance between computation and communication is reached if the nodes are accelerated by special purpose hardware; in this paper we describe the implementation of particle based astrophysical simulation codes on new types of accelerator hardware (field programmable gate arrays, FPGA, and graphical processing units, GPU). In addition to direct gravitational N -body simulations we also use the algorithmically similar “smoothed particle hydrodynamics” method as test application; the algorithms are used for astrophysical problems as e.g. evolution of galactic nuclei with central black holes and gravitational wave generation, and star formation in galaxies and galactic nuclei. We present the code performance on a single node using different kinds of special hardware (traditional GRAPE, FPGA, and GPU) and some implementation aspects (e.g. accuracy). The results show that GPU hardware for real application codes is as fast as GRAPE, but for an order of magnitude lower price, and that FPGA is useful for acceleration of complex sequences of operations (like SPH). We discuss future prospects and new cluster computers built with new generations of FPGA and GPU cards. Keywords astrophysics · FPGA · GPU · special purpose accelerators · particle simulations R. Spurzem, P. Berczik, I. Berentzen University of Heidelberg, Astronomisches Rechen-Institut (ARI-ZAH), M¨ onchhofstr. 12-14, 69120 Heidelberg E-mail: {berczik,spurzem}@ari.uni-heidelberg.de R.Banerjee, I. Berentzen, R. Klessen University of Heidelberg, Inst. f¨ ur Theor. Astrophysik (ITA-ZAH), Albert-Ueberle-Str. 2, 69120 Heidelberg E-mail: {klessen,iberent,banerjee}@ita.uni-heidelberg.de G. Marcus, A. Kugel, G. Lienhart, R. M¨ anner University of Heidelberg, Institute for Computer Engineering (ZITI), B6, 26, 68131 Mannheim E-mail: {guillermo.marcus,andreas.kugel,reinhard.maenner}@ziti.uni-heidelberg.de
2
1 Introduction Numerical algorithms for solving the gravitational N -body problem [1] have evolved along two basic lines in recent years. Direct-summation codes compute the complete set of N 2 interparticle forces at each time step; these codes are designed for systems in which the finite-N graininess of the potential is important or in which binary- or multiple-star systems form, and until recently, were limited by their O(N 2 ) scaling to moderate (N ≤ 105 ) particle numbers. A second class of N -body algorithms replace the direct summation of forces from distant particles by an approximation scheme. Examples are the Barnes-Hut tree code [2], which reduces the number of force calculations by subdividing particles into an oct-tree, and fast multipole algorithms which represent the large-scale potential via a truncated basis- set expansion [3, 4], or on a grid [5, 6]. These algorithms have a milder, O(N log N ) or even O(N ) scaling for the force calculations and can handle much larger particle numbers, although their accuracy can be substantially lower than that of the direct-summation codes [7]. A natural way to increase both the speed and particle number in an N -body simulation is to parallelize [8, 9]. Communication inevitably becomes the bottleneck when increasing the particle number. The best such schemes use systolic algorithms (in which the particles are rhythmically passed around a ring of processors) coupled with nonblocking communication between the processors to reduce the latency [10, 11]. A major breakthrough in direct-summation N -body simulations came in the late 1990s with the development of the GRAPE series of special-purpose computers [12], which achieve spectacular speedups by implementing the entire force calculation in hardware and placing many force pipelines on a single chip. The design, construction and implementation of GRAPE chips, and their further integration into performant accelerator boards called GRAPE boards, relates back to an initiative of D. Sugimoto, J. Makino (Univ. of Tokyo) and also P. Hut (Inst. for Advanced Study in Princeton) [13]. Statistical models had predicted a certain type of gravothermal oscillation in dense cores of globular star clusters (they are to first order million-body gravitating N -body systems); it had been impossible so far to prove that such physical phenomenon, similar to heat conduction in gaseous systems, did exist in a discrete stellar N -body system. With GRAPE the successful detection of gravothermal oscillations in direct N -body models was made [15]. One of the authors of this paper (RS) had visited Univ. of Tokyo several times in the 90s and witnessed, how astrophysical students and postdocs were wirewrapping the prototypes of the new GRAPE hardware. It became a quasistandard for superfast N -body hardware, several Gordon- Bell prizes were won, and many instutions in the world still use GRAPE these days (e.g. Rochester Inst. of Technology, Rochester, New York, USA; Univ. Amsterdam, NL; Univ. Marseille, FR; Univ. Tokyo and RIKEN Institute, Japan; Univ. Heidelberg, D; to mention only a few selected examples). The present GRAPE-6, in its standard implementation (32 chips, 192 pipelines), can achieve sustained speeds of about 1 Tflops at a cost of just ∼ $50K. In a standard setup, the GRAPE-6 is attached to a single host workstation, in much the same way that a floating-point or graphics accelerator card is used. Advancement of particle positions [O(N )] is carried out on the host computer, while interparticle forces [O(N 2 )] are computed on the GRAPE. More recently, “mini-GRAPEs” (GRAPE-6A) [20] have become available, which are designed to be incorporated into the nodes of a parallel computer. The mini-GRAPEs place four processor chips on a single PCI card and deliver a theoretical peak performance of ∼ 131 Gflops for systems of up to 128k
3
particles, at a cost of ∼ $6K. By incorporating mini-GRAPEs into a cluster, both large (≥ 106 ) particle numbers and high (≥1 TFlop/s) speeds can in principle be achieved. Requirements of GRAPE hardware, however, forced scientists to use less efficient algorithms than were originally available. Without significant loss of accuracy the Ahmad-Cohen (AC) neighbor scheme could be used[22], as implemented in codes like NBODY5 or NBODY6 [24]. In the AC scheme, the full forces are computed only every tenth timestep or so; in the smaller intervals, the forces from nearby particles (the “irregular” force) are updated using a high order scheme, while those from the more distant ones (the “regular” force) are extrapolated using a second-degree polynomial. A parallel implementation of NBODY6, including the AC scheme, exists, but only for general-purpose parallel computers [7]; the algorithm has not yet been adapted to systems with GRAPE hardware. The main reason for the problem is that the GRAPE hardware, and the custom built GRAPE chip, have a fixed design (ASIC = application specific integrated circuit), with design cycles of several years for a new version. The use of the AC scheme needs partial forces and also in some parts higher accuracy than the GRAPE would provide. To illustrate the problem it is instructive to look at the scaling of the N -body algorithms: T = αN + δN · Nn + βN 2 /γ
(1)
Here T is the wall clock time needed by our algorithm to advance the system by a certain physical time, N is the total particle number, Nn is a typical neighbour number for the AC scheme (of order 50-200). α, δ, β are time constants and γ is the efficiency factor for the AC scheme. If γ = 10 it means the full gravitational force for any particle needs to be computed only every tenth time. The GRAPE hardware is unable to compute neighbour forces efficiently, so one is forced to use Nn = 0 and consequently γ = 1. Since the speed-up achieved by GRAPE for β is so large (typically several hundred) this is a price to pay, to still reach a decent overall speedup. So, on hardware with GRAPE, the use of simpler parallel N -body algorithms has been enforced [23]. The reader interested in more details of the algorithms and performance analysis is also referred to [14, ?]. One may wonder, why the N 2 part of the algorithms could not be removed using TREE schemes or even fast multipole methods [2, 4]. The answer lies partly in very high accuracy requirements for very long-term integrations and partly in the strongly inhomogeneous nature of astrophysical systems, where a combination of brute algorithms with individual time step schemes proved to be more efficient even than fast multipoles [7, 25]. Particle-based astrophysical simulations in general comprise a much larger class of applications than just gravitational N -body simulations. Gas dynamical forces (pressure forces, viscous forces) are frequently modelled using the smoothed particle hydrodynamics approach [16]. This algorithm assigns thermodynamic quantities to particles (such as pressure, temperature), while in standard N -body models particles just have a mass. By appropriate averaging over neighbour particles (see detailed description of the algorithm below) the proper fields of thermodynamic quantities can be defined - moving the particles much like in a gravitating N -body model under the influence of all forces with some appropriate time step then provides fairly accurate solutions of complex partial differential equations, such as e.g. the Navier-Stokes equations for astrophysical plasmas [17]. State-of-the art parallel algorithms for SPH simulations exist e.g. in form of the Gadget code [18]. An analysis of the algorithm reveal practically the same struc-
4
ture as before for the direct N -body code using an AC scheme, with two important differences: (i) the number of floating operations required for the neighbour forces is some factor of five larger, because we have to deal with much more complex forces than just gravitational interactions; (ii) for many applications a TREE approximation to compute the distant gravitational interactions is sufficient in accuracy, therefore the the potential speed-up obtained by using GRAPE for this component is small. The interested reader can find a TREE implementation using GRAPE [21, ?] as well as reports from a broad community in science and engineering using SPH in the yearly SPHERIC conferences [19]. Note that some of the industrial applications do not require gravity (long-range forces) at all.
Fig. 1 Left: GRACE titan cluster in Astronomisches Rechen-Institut, ZAH, Univ. of Heidelberg; 32 nodes with GRAPE and FPGA accelerators, 4 Tflop/s peak speed, installed 2006; Right: Frontier kolob cluster at ZITI Mannnheim, 40 nodes with GPU accelerators, 17 Tflop/s peak speed; installed 2008
Recently, graphical processing units (GPU), become increasingly usable for highperformance numerical computations [36, 26, 30]. In particular the astrophysical simulation community seems to be pretty eager to use that new facility (see Princeton workshop http://astrogpu.org ). The CUDA libraries have been particularly helpful to encourage astrophysicists and other scientists to start testing and using GPU [26]. Academic institutions and industrial sites are currently planning to build large clusters using GPU’s as accelerator boards. Why is that so? Three main reason could be found: (i) GPU cards are produced for the industrial mass market, their production cycle is short and price- performance ratio is very good. (ii) GPU can reach very high performance for simple algorithmic structures and data flows (see citations above), comparable if not faster than GRAPE, (iii) GPU can be programmed to work also for more complex algorithms, potentially with less speed-up, but allowing the implementation of SPH and AC-scheme algorithms. Compared to GPU for example GRAPE cards in particular have a very long production cycle, are not as cheap as GPUs could be, while FPGA hardware tends to have a relatively high price-performance ratio. One has to bear in mind, however, that such statements are subject to quick change, due to fast product cycles as well as to cycles which are imposed by scientific project funding. The promising property of GPUs in our view is that it can potentially substitute both types of special hardware
5
we used so far - GRAPE for highest performance distant forces, MPRACE for medium performance but complex force algorithms. In this paper we will describe and compare performances obtained for the example of the SPH algorithm (AC neighbour scheme behaves very similar); the architectures compared are those of our presently running GRACE cluster, with a recent GPU. More details on the hardware used and on the variant of the SPH algorithm used are to be found below; we would also stress that while this comparison seems to be a bit unfair (comparing an older FPGA board with one of the most recent GPU), it is nevertheless extremely instructive, and it is selected like that, because this is the hardware we have currently available, including all necessary software environment. The present cluster containing FPGA will be used for some time still in heavy production, while the GPU card may be the hardware of choice for a next generation system, which is currently under consideration. We are also currently testing a new generation of FPGA hardware, called MPRACE-2, which at the time of writing this article was not ready for benchmarks with application codes. In the discussion we will interpret our results in terms of future perspectives for each of the special hardware, FPGA and GPU. The bottom line is, that we think our comparison results are still extremely useful, because they use practically the same software and software interfaces, just the different accelerator hardware. At the same time, however, one has to take into account the different generations of hardware used and should not one-dimensionally interpret the results in favour of either GPU or FPGA. Current GPU implementations achieve over 300 GFlops from a single board. This is quite logical, as it represents > 90% of time for pure-CPU algorithms, it is an O(n2 ) in its more basic form, and highly parallelizable. With particle numbers > 106 , the latter SPH part is also computationally very demanding, once the gravity has been accelerated. It is also more complex, as it is based in an interaction list between particles. We support such calculations, which can be applied to a wide variety of practical problems besides astrophysics, by programmable hardware. Related work include the developments of Hamada and Nakasato[35] and similar work has been done for Molecular Dynamics (MD)[37], as the algorithms are comparable.
2 Astrophysical Terms Implemented We implemented forces between particles, purely gravitational forces on all kinds of hardware, hydrodynamical forces using the SPH algorithm on FPGA and GPU hardware. In this paper we only describe comparative profiling for the single node computation of gravitational forces (on GRAPE and GPU) and of SPH forces (on FPGA and GPU). In both cases field particle data (masses, positions, velocities, for high oder algorithms also higher derivatives) are transferred to the memory linked to our accelerator boards; thereafter a stream of test particles is passed to the hardware, which then computes in parallel pipelines the expression ai = −
X i6=j
mj |rj − ri |3
(rj − ri )
(2)
where we have used a coordinate scaling such that the gravitational constant G = 1. For obvious reasons the particles in memory of the accelerator board are called jparticles, and the test particles, for which the simulation code requires new gravi-
6
tational forces are called i-particles. For details about the algorithm with individual hierarchical time steps and a copy algorithm for parallelization the reader is referred to [7, 11, 1]. The SPH method will only be described very briefly here, the interested reader is referred to [28] for our implementation specific issues. Several thorough general reviews of this method has been published by W. Benz [27], Monaghan[34] and Liu[32]. In SPH the gaseous matter is represented by particles. To form a continuous distribution of gas from these particles they become smoothed by convoluting their discrete mass distribution with a smoothing kernel W . This kernel is a strongly peaked function around zero and is non-zero only in a limited area. In contrast to the purely gravitational simulations the SPH method requires two steps, one to compute the new density at the location of i-particles from the j-particle distribution, using the smoothing kernel over Nn neighbour particles (with Nn ≈ 50 − 200); the discretized formula for the density ρi looks as follows:
ρi =
N X
mj W (ri − rj , h)
(3)
j=1
The next equation shows the physical formula for the acceleration of an i-particle due to the forces from the gradient of the pressure P and the so-called artificial viscosity avisc i ai = −
1 ∇Pi + avisc i ρi
(4)
The artificial viscosity is introduced to make the method capable of simulating shock waves [28, 34]. The SPH method is typically computed in a 2-stage algorithm. In the first stage, which shall be denoted as SPH Step 1, the densities are calculated according to formula 3. Subsequently the pressure P is calculated via a state equation. In the second stage of the SPH algorithm, denoted as SPH Step 2, the forces on the particles and other quantities like dissipated energy are calculated as expressed in formula 4. According to the forces the trajectories of the particles are integrated before starting again with SPH Step 1 of the next time step.
3 FPGA implementation In this section, we describe the implementation of a FPGA-based coprocessor for SPH computations. Field Programmable Gate Arrays (FPGA) are specialized chips that allow dynamic reconfiguration of its building blocks. Giving this capability, they are able to implement electronic circuits, or hardware designs (as we call them), by means of programming the interconnections between the blocks. This involves the creation of a library of floating point operators to support the required computations, to assemble them into a processing pipeline, and building the required control and communication logic. As our platform of choice, we use the MPRACE-1 coprocessor boards developed in-house, which are equipped with a Xilinx XC2V3000 FPGA, 8MB of ZBT-RAM and an external PLX9656 for PCI communications with the host at up to 264MB/s. The migration to our new MPRACE-2 boards is currently in progress.
7
Fig. 2 Design features of the new MPRACE-2 board
3.1 Hardware Architecture The coprocessor design consists of one or more computational pipelines, external memory to store the particle data and an output FIFO at the end of the pipeline to store the results, plus the required communication and control logic. An overview is shown in Fig. 1, which includes at the bottom the communication with the host and at the top the external memory modules used. The current design fits one SPH pipeline working at 62.5MHz and capable of storing up to 256k particles in memory. For each timestep, the coprocessor loads the particle data (position, velocity, mass, etc) into external memory. After selecting for step1 or step2 computations, neighbour lists are sent in sequence from the host in the format ip,N,jp1..jpN, being ip the index of the i-particle, N the number of neighbours, and jpX the corresponding j-particles indexes. Neighbour lists are received from the host and processed immediately at the rate of one neighbour interaction per cycle. Accumulated results are pushed into an output FIFO for later retrieval. Being the FIFO of a very limited size, results must be retrieved by the software regularly. In order to reduce the size of the pipelines, increase their speed and being able to accommodate more particles in the available memory, the precision of the floating point operators is reduced to 16-bits mantissa and 8-bits exponent, except for the accumulators. Given the described scheme, it is clear the algorithm is order O(nm), where n is the number of particles and m the average number of neighbors. Therefore, the overall performance of the processor is driven by the communication time (how fast can the neighbor lists be sent to the board) and the clock frequency of the coprocessor (how fast an interaction can be dispatched). To optimize communication, loading the particle data is done in two parts. For step1, position, velocities, mass and smoothing length (plus sound speed) are loaded, while the additional values for step2 (including the updated density, Balsara factor and pressure) are uploaded only before starting step2 computations. Additionally, as mentioned earlier, the pipelines are capable of switching between step1 and step2 computations, reusing several common parts of the operation flow. This allowsfurther savings in chip area, making it possible to use a single design for all required operations.
8
Fig. 3 FPGA Architecture
3.2 Software Tools Each floating point operator required considerable development time, and many operations can be optimized depending on their operator size or signed/unsigned results. Other specialized operations, like square a value, can also be optimized when compared to a generic multiplier. To gather all this advantages into a coherent interface, a floating point library was developed. This library provides parameterized operators to select the desired precision (exponent and mantissa), specialized operators to optimize for speed and area, simulation capabilities to simplify hardware verification, and a range of high performance accumulators. The pipelines for the SPH computations contain tens of operators, and putting them together is a time consuming and error-prone task. As a way of adding flexibility and reliability, a pipeline generator was developed. This software tool receives a programlike description of the operations to perform and produces a hardware pipeline, using the floating point library as the building blocks. This makes the pipeline correct-bydesign, and it is very flexible to adjust the precision of operands or to introduce changes on the operations performed. A more complete description can be found at [31].
3.3 Software Library Interfaces In order to use efficiently the capabilities of the coprocessor, we developed a software library for C/C++/FORTRAN languages. This library provides the user with a clean interface to the SPH functionality while hiding coprocessor-specific details like design loading, buffer creation, data partition and FIFO flushing. In addition, an emulation core is provided, that allows the library to perform the same operations with the host
9
CPU only. Particular attention was taken on the interface of the library with existing applications. Being the coprocessor performance directly proportional to the communication between the host and the board, a generic buffer management scheme[33] was implemented, allowing the library to access data structures of the application directly for direct conversion between the formats of the application and the coprocessor.
4 GPU implementation Graphic Processing Units (GPU) are highly parallel, programmable processors used primarly for image processing and visualization in workstations and entertainment systems. As they grow in complexity, they have added many features, including programmability, floating point capability and wider and faster interfaces. As a final addition, the latest versions include APIs to program them as custom accelerators, enabling us to test with relative ease the performance of the platform.
4.1 Hardware and Software Model As said previously, GPUs are in essence a highly parallel, programmable architecture. In its current incarnation, a board can have up to 128 parallel processing elements and 1.5GB of high speed RAM over a 384-bit bus. Such a level of parallelism comes at the cost of the memory hierarchy and coherency, commiting most resources on-chip to computational resources instead of cache memory.
Shared memory
Registers
Th d Thread
Local Memory
Registers
Th d Thread
Local Memory
Shared memory
Registers
Th d Thread
Local Memory
Registers
Th d Thread
Local Memory
GlobalMemory
C ConstantMemory M
Texture Memory
Fig. 4 Schematic organization of a GPU
A sketch of the organization of an NVIDIA GPU as the one used for this publication is depicted in Fig. 4. The board (and processor), referred to as a GeForce 8800GTX, consists of 16 multiprocessors, each one capable of managing 8 threads in parallel in SIMD fashion, for a total of 128 threads and communicates to the host using a 16-lane PCI-Express bus. Each multiprocessor has a small shared memory ( 16K) accessible to all threads of the same multiprocessor, a big register file (8K registers) which is split among allocated threads, and interfaces to the main (global) memory.
10
It is important to note, none of these memories are cached. In contrast, special readonly regions referred as constant memory (for constant values) and texture memory (for large data references/interpolated data) are cached. Having regions without cache, without write coherency and with several penalties to the memory access paterns used adds complications to the implementation of the algorithms. Fortunately, NVIDIA provides an API and computational model to make efficient use of the processors. The CUDA (Compute Unified Device Architecture) library provides a C-like programming language and compiler, with specific extensions for the platform. The library and API makes interfacing with the board a very easy task. In contrast to other architectures, NVIDIA GPUs based their computing model around several levels of units of work, where the most simple is very similar to lightweight threads (referred simply as threads), instead of the actual number processing elements available. Threads can be grouped in Blocks, which are assigned to a single multiprocessor. Blocks are in turn organized in a grid, which represents the current workload. This thread model allows the hardware to scale more easily to the number of computational units available, and by means of using a very high number of threads and a simple scheduler, to effectively hide the latency of memory accesses for each thread. At the same time, it still allows precise control on the work asigned to a single multiprocessor and the proper discovery of sibling threads for communication.
4.2 Algorithm description The current implementation of the SPH algorithm on the GPU is based in the communication interface developed originally for our FPGA accelerator. This means, we try to reduce the changes in the internal data formats and the communication patterns. As preparation stages, the software library preallocates the neccesary memory in the device to store the particle data, the neighbor lists and the results. The neighbor lists are described as with the FPGA, as one big array in the format ip,N,jp1..jpN, being ip the index of the i-particle, N the number of neighbours, and jpX the corresponding j-particles indexes. Processing is still divided in two steps, but the neighbor lists are sent to the board memory only once during step 1, and reused for step2. Results are read all at once at the end of each step. As on how to parallelize the work among the threads, we were confronted with several options. We finally decided to create one thread per each i- particle, or one thread per neighbor list. For this, we provide the GPU with an additional array containing the initial offset of each neighbor list in the NL array, which we called cutpoints. This array is a byproduct of the creation of the NL array, so we incur in no extra cost. This enables each thread to compute the starting point of its own neighbor lists without the need to scan the NL array. Each thread stores its results in shared memory, and writes them to main memory when it finishes. We partition the problem in such a way as to use as much of the shared memory as possible.
4.3 Software Library Interfaces Our software library is designed from the ground-up to support multiple implementations of the supported SPH algorithms. Therefore, the effort of extending it to support GPUs was low. As an addition, using the same interface allow us to use the same
11
applications as before without any significant change in the program interface: Just to change a switch in the initialization function, to define the type of processing core to use.
5 Results CPU runs and GPU runs are from a workstation equipped with a Intel Core 2 Quad at 2.4 GHz, 4GB of RAM and a GeForce 8800 GTX GPU with 768MB. MPRACE runs are from one node of the Titan cluster at the ARI-Heidelberg, with 2 Intel Xeon CPUs at 3.2 GHz, 4GB of RAM, a GRAPE board and a MPRACE-1 board in a PCI-X slot. All presented runs are serial runs, i.e. they use only one core. A GRAPE board is an accelerator board for gravity interactions. Performance and accuracy measures are from a serial NBODY simulation with gravity and SPH forces, running one single step using shared timesteps. Accuracy is compared relative to the original double precision implementation on the CPU. A parallel version also exists, and more information is available at [28]. Since the FPGA designs use limited precision, the accuracy of the results is compared to a double-precision implementation on the host. The results for the time evolution of energies during adiabatic collapse test runs with different particle numbers (10K, 50K and 100K) shows the absolute error in total energy conservation during the whole period of integration was less than 0.1%, and the absolute differences of acceleration for runs on the CPU or the MPRACE are on the same order of magnitude, making them non significant. For more detailed information, consult [28]. In the case of the GPU, it uses single precision computations, but has limited range for certain operations. The accuracy of the results is high, comparable to the use of SSE instructions in the CPU. First we show in Fig. 5 an implementation of a pure N - body algorithm with individual time steps [23], as it is used in practical applications, e.g. for the evolution of galactic nuclei with black holes [45]. Despite of the requirement of emulation of double precision the GPU card reaches the same speed as GRAPE, for a small fraction of the price. Moreover it provides programming flexibility, e.g. for a recently developed new 6th order scheme (Nitadori et al. 2008), as is shown in Fig. 5
Fig. 5 Speed of N -body simulation: Left: traditonal 4th order scheme, compared on the host, SSE, GRAPE board and GPU. Right: new 6th order scheme, see Nitadori et al., compared on the host, SSE, and GPU.
12
Fig. 6 Speed of N -body simulation compared on the host, GRAPE board and GPU.
Second, we present the present profiling status of our SPH code implementation. Here, we assume that large scale gravitational forces are ported to the GRAPE or GPU hardware. The remaining bottleneck lies in the hydrodynamic forces exerted from neighbour particles (see middle term of Eq. (1)). In the following we will discuss a profiling of this part of the code alone.
100
∆T (one shared timestep) [sec]
10
1
0.1
0.01 SPH on CPU SPH on MPRACE-1 SPH on GPU ratio: CPU/MPRACE-1 ratio: CPU/GPU 0.001 1
4
16 64 N [in K]
256
1024
Fig. 7 Time and Speedup of the SPH computations for the FPGA and the GPU.
13
Fig. 7 shows the time spent in the SPH computations in all different cases, as well as the speedup (ratio) against the CPU time. From this plot can be seen how the speedup is sustained for the particle sets, and is 6-11 times for the FPGA and about 17 times for GPU. More detailed profiling is required to explore why the MPRACE and GPU plots seem to converge at some point around 1M particles (currently unreachable by the MPRACE hardware). When considering the computational efficiency, our FPGA designs have a theoretical peak performance of 3.4 GFlops for Step1 and 4.3 GFlops for Step2, and from our measurements, the library overhead is negilible and the efficiency close to 100% for big data sets. However, in the case of the GPU, the theoretical peak performance is 518.4 GFlops (1.35 GHz × 16 multiprocessors × 24 warps), but the performance is only, at most, twice of the FPGA. This means the GPU is quite far from the maximum, leading us to believe there is plenty of possibilities to improve the algorithm and obtain better performance. In terms of power consumption, the GPU board consumes is in the order of 150W, while the MPRACE-1 uses a mere 20W. This is a power efficiency of 4.65 W/GFlop on the MPRACE-1 and 16 W/GFlop on the GPU, with a maximum of 0.28 W/GFlop if the full performance of the GPU could be achieved. Therefore, the MPRACE-1 is 3.4 times more power efficient than the GPU.
6 Conclusions We have presented implementations of force computations between particles for astrophysical simulations, using field programmable gate arrays (FPGA) and graphical processing units (GPU) and comparing the results with earlier application specific GRAPE hardware. The new software is used in custom parallel supercomputers, featuring FPGA and GPU accelerator cards with ultra fast and low latency communication networks (Infiniband) and a moderate number of nodes (32 for the titan GRACE cluster in Heidelberg, and 40 for the new Frontier GPU cluster in Mannheim). The overall parallelization of our codes is very good in this regime of a moderate number of accelerated nodes, as has been published in [23]. For a recent astrophysical application see [?]. The complexity of our particle based algorithms for gravitating N -body simulations and for smoothed particle hydrodynamics (SPH) simulations scales as T = αN + δN · Nn + βN 2 /γ
(5)
where N is the particle number and the other parameters are explained in Eq. 1. The last part on the right hand side is the long range gravitational force (complexity N 2 ) which has been implemented on GRAPE and GPU cards with nearly the theoretical peak speed of order 120 Gflop/s, which is close to the peak speed of 160 Gflop/s expected for double precision computations on this GPU board. The total sustained speed achieved on the titan GRACE cluster for a simulation of four million particles in direct N -body was 3.5 Tflop/s ([23] ). This cluster uses the GRAPE cards [20], with a peak speed of 128 Gflop/s (4 Tflop/s for the entire cluster). Our new frontier kolob cluster (40 nodes accelerated with NVIDIA Tesla GPU boards, peak speed of 17 Tflop/s) is presently in the commissioning phase. In full operation we expect a further speed up to 14 Tflop/s sustained for our N -body simulations; with most recent Tesla cards providing 1 Tflop/s per card the projected numbers
14
would go to 40 Tflop/s peak and 28 Tflop/s sustained, obtained at yet larger particle numbers in the simulation. Due to the operation at a balance between computation and communication costs our results can only be scaled up to larger node numbers if communication bandwidth and latency would speed up accordingly. This is presently impossible for any large general purpose supercomputers on the market - while their peak performance is much larger and approaching the Petaflop/s range, our direct gravitationally fully interacting large N -body systems exhibit the largest speed on our custom clusters with special hardware. More advanced direct N -body codes as well as simulation codes including hydrodynamical forces, such as smoothed particle hydrodynamics (SPH) have a more complex scaling, where the middle term in Eq. 1 becomes dominant in the second order, if the long-range forces have successfully been accelerated by GRAPE or GPU. This term is usually due to intermediate range forces. The architecture of our present and future special purpose supercomputers is tailored to this approach. It is a hybrid architecture of special hardware for astrophysical particle simulations. It is based on GPU (or GRAPE) special hardware to compute distant forces (in our applications: gravitational), and another programmable hardware (FPGA or GPU) to compute neighbour forces, which can be gravitational as well, or can represent gas dynamical forces as they occur in smoothed particle hydrodynamics simulations (SPH). Once the GPU (or GRAPE) hardware has sped up the distant force part, the remaining bottleneck lies in the computation of SPH forces. We have shown here that the computation of SPH forces on programmable FPGA hardware as well as on graphical processing units (GPU) delivers a significant speed-up of this part of the algorithm. Our comparison focused on FPGA hardware MPRACE-1, currently used in production, and on a new GPU card, just obtained for testing. The new GPU card performs somewhat better than the FPGA hardware, but this is just a momentary result, and will change again with using newer FPGA hardware (see below). Both FPGAs and GPUs are viable options of accelerating high performance computing applications. While GPUs provide a very fast and cost-efficient development path, FPGAs may remain competitive. GPUs can be programmed with relative ease and provide raw computational power, which can be exploited more easily in some applications than others. FPGAs provide custom architectures tailored to the requirements, with a longer development cycle but more efficient solutions. In addition, the low raw power consumption of the FPGA gives them an advantage over GPUs for large installations, where the power costs are a significant part of the operational costs, or at locations where power supply is limitedn (e.g. remote locations). However, this gap is expected to close over time, as new and more efficient boards are released. If the cost can be justified, it depends on the power efficiency of the application in question. GPU could be used also in the future to replace the GRAPE hardware, because they deliver comparable speed for the relatively simple task of computing long-range gravitational forces [26, 30]. Currently we are commissioning our next custom computing cluster “Frontier” with 40 NVIDIA Tesla nodes as GPU accelerator cards on 40 dual quad core Intel nodes. The peak performance of the new cluster is 17 Tflop/s ( http://www.uniheidelberg.de/presse/news08/pm281127-1kolob-e.html ) and we expect from previous scalings and benchmarks [23] sustained performances of order 14 Tflop/s for our application codes, with particle numbers out of reach for standard supercomputers.
15 Acknowledgements We thank the Volkswagenstiftung for funding the GRACE project under I/80 041-043, and by the Ministry of Science, Research and the Arts of Baden- W¨ urttemberg (Az: 823.219-439/30 and /36). This project is also partly funded by the German Science Foundation (DFG) under SFB 439 (sub-project B11) ?Galaxies in the Young Universe? at the University of Heidelberg. Furthermore we acknowledge a computing time grant obtained from the DEISA project with FZ Julich. PB thanks for the special support of his work by the Ukrainian National Academy of Sciences under the Main Astronomical Observatory ?GRAPE/GRID? computing cluster project. The “Frontier” cluster is funded by the excellence funds of the University of Heidelberg in the Frontier scheme.
References 1. Aarseth, S. J., 2003. Gravitational N-Body Simulations. ISBN 0521432723. Cambridge, UK: Cambridge University Press, November 2003. 2. Barnes, J., Hut, P., 1986. A Hierarchical O(NlogN) Force-Calculation Algorithm. Nature 324, 446–449. 3. van Albada, T. S., van Gorkom, J. H., Jan. 1977. Experimental Stellar Dynamics for Systems with Axial Symmetry. A&A 54, 121. 4. Greengard, L., Rokhlin, V., Dec. 1987. A fast algorithm for particle simulations. Journal of Computational Physics 73, 325–348. 5. Miller, R. H., Prendergast, K. H., Feb. 1968. Stellar Dynamics in a Discrete Phase Space. ApJ 151, 699. 6. Efstathiou, G., Eastwood, J. W., 1981. On the clustering of particles in an expanding universe. MNRAS 194, 503–525. 7. Spurzem, R., Sep. 1999. Direct N-body Simulations. Journal of Computational and Applied Mathematics 109, 407–432. 8. Dubinski, J., 1996. A parallel tree code. New Astronomy 1, 133–147. 9. Pearce, F. R., Couchman, H. M. P., 1997. Hydra: a parallel adaptive grid code. New Astronomy 2, 411–427. 10. Makino, J., Oct. 2002. An efficient parallel algorithm for O(N 2 ) direct summation method and its variations on distributed-memory parallel machines. New Astronomy 7, 373–384. 11. Dorband, E. N., Hemsendorf, M., Merritt, D., 2003. Systolic and hyper-systolic algorithms for the gravitational N-body problem, with an application to Brownian motion. Journal of Computational Physicsr 185, 484–511. 12. Makino, J., Taiji, M., 1998. Scientific simulations with special-purpose computers : The GRAPE systems by Junichiro Makino and Makoto Taiji. Chichester ; Toronto : John Wiley and Sons. 13. Sugimoto D., Chikada Y., Makino J., Ito T., Ebisuzaki T., Umemura M., 1990, Nature 345, 33 14. Makino, J., Hut, P., Performance analysis of direct N-body calculations, ApJS 68, 833, 1988. 15. Makino, J., Postcollapse Evolution of Globular Clusters, ApJ 471, 796, 1996. 16. Monaghan, J.J., Smoothed particle hydrodynamics, ARA&A 30, 543, 1992. 17. Klessen, R., Clark, P.C., Modeling Star Formation with SPH, in SPHERIC Proceedings, page 133, 2007. 18. Springel, V., The cosmological simulation code GADGET-2, MNRAS 364, 1105, 2005. 19. SPHERIC SPH European Research Interest Community, http://wiki.manchester.ac.uk/spheric/index.php. 20. T. Fukushige, J. Makino, and A. Kawai. Grape-6a: A single-card grape-6 for parallel pc-grape cluster system. PASJ, (57):1009–1021, 2005. 21. Kawai, A., Makino, J., Ebisuzaki, T., Performance Analysis of High-Accuracy Tree Code Based on the Pseudoparticle Multipole Method, ApJS 151, 13, 2004. 22. Ahmad, A., Cohen, L., Feb. 1973. Random Force in Gravitational Systems. ApJ 179, 885–896. 23. Harfst, S., Gualandris, A., Merritt, D., Spurzem, R., Portegies Zwart, S., Berczik, P., Performance Analysis of Direct N-Body Algorithms on Special-Purpose Supercomputers, New Astronomy 12, 357, 2007. 24. Aarseth, S. J., Nov. 1999. From NBODY1 to NBODY6: The Growth of an Industry. PASP 111, 1333–1346.
16 25. Makino, J., Aarseth, S. J., 1992. On a Hermite integrator with Ahmad-Cohen scheme for gravitational many-body problems. PASJ 44, 141–151. 26. R. Bellemana, J. Bdorfa, and S. Portegies Zwart. High performance direct gravitational n-body simulations on graphics processing units ii: An implementation in cuda. New Astronomy, 13(2):103–112, Feb. 2008. 27. W. Benz. Smooth particle hydrodynamics: A review. The Numerical Modelling of Nonlinear Stellar Pulsations, pages 269–288, 1990. 28. P. Berczik, N. Nakasato, I. Berentzen, R. Spurzem, G. Marcus, G. Lienhart, A. Kugel, R. Maenner, A. Burkert, M. Wetzstein, T. Naab, H. Vazquez, and S. Vinogradov. Special, hardware accelerated, parallel sph code for galaxy evolution. In SPHERIC Proceedings, pages 5–8, 2007. 29. T. Hamada, T. Fukushige, A. Kawai, and J. Makino. Progrape-1: a programmable specialpurpose computer for many-body simulations. FPGAs for Custom Computing Machines, 1998. Proceedings. IEEE Symposium on, pages 256–257, 15-17 Apr 1998. 30. T. Hamada and T. Iitaka. The chamomile scheme: An optimized algorithm for nbody simulations on programmable graphics processing units. http://jp.arxiv.org/abs/astroph/0703100v1, 2007. 31. G. Lienhart, A. Kugel, and R. Maenner. Rapid development of high performance floatingpoint pipelines for scientific simulation. In RAW Proceedings, 2006. 32. G. Liu and M. Liu. Smoothed Particle Hydrodynamics: a meshfree particle method. World Scientific, Singapore, 2005. 33. G. Marcus, G. Lienhart, A. Kugel, and R. Maenner. On buffer management strategies for high performance computing with reconfigurable hardware. In FPL, pages 343–348. IEEE, 2006. 34. J. Monaghan. Simulating free surface flows with sph. Journal of Comutational Physics, (110):399–406, 1994. 35. N. Nakasato and T. Hamada. Astrophysical hydrodynamics simulations on a reconfigurable system. In FCCM Proceedings, pages 279–280. IEEE, 2005. 36. H. Nguyen. GPU Gems 3. Addison-Wesley, New York, 2008. 37. R. Scrofano, M. B. Gokhale, F. Trouw, and V. K. Prasanna. Accelerating molecular dynamics simulations with reconfigurable computers. IEEE Transactions on Parallel and Distributed Systems, 2007. 38. Milosavljevi´ c M., Merritt D., Formation of Galactic Nuclei, ApJ 563, 34, 2001. 39. Milosavljevi´ c, M. & Merritt, D., Long-Term Evolution of Massive Black Hole Binaries, ApJ 596, 860, 2003. 40. Makino, J. & Funato, Y., Evolution of Massive Black Hole Binaries, ApJ 602, 93, 2004. 41. Berczik, P., Merritt, D., Spurzem, R. Long-Term Evolution of Massive Black Hole Binaries. II. Binary Evolution in Low-Density Galaxies, ApJ 633, 680, 2005. 42. Berczik, P., Merritt, D., Spurzem, R., Bischof, H.-P., Efficient Merger of Binary Supermassive Black Holes in Nonaxisymmetric Galaxies, ApJ 642, L21, 2006. 43. Kustaanheimo P., Stiefel E., Journal f¨ ur die reine und angewandte Mathematik 1965, 218, 204 44. Khalisi, E., Omarov, C.T., Spurzem, R., Giersz, M., Lin, D.N.C. 2003, in “High Performance Computing in Science and Engineering”, Springer Vlg., pp. 71-89. 45. Berentzen I., Preto M., Berczik P., Merritt, D., Spurzem R., 2009, ApJ, in press. 46. Preto M., Berentzen I., Berczik P., Spurzem R., 2008, in prep. 47. S. Vinogradov and P. Berczik. The study of colliding molecular clumps evolution. A&ApTr, 25(4):299–316, 2006.