Accelerating Astrophysical Particle Simulations with Programmable Hardware (FPGA and GPU) R. Spurzem2 , P. Berczik2 , G. Marcus1 , A. Kugel1 , G. Lienhart3 , I. Berentzen2,4 , R. M¨anner1 R. Klessen4 , R. Banerjee4 1
University of Heidelberg, Dept. of Computer Science V,
Central Inst. of Computer Engineering, located in Mannheim {marcus,kugel,maenner}@ti.uni-mannheim.de 2
University of Heidelberg, Astronomisches Rechen-Institut (ZAH) {berczik,berentzen,spurzem}@ari.uni-heidelberg.de 3 4
Silicon Software,
[email protected]
University of Heidelberg, Inst. f¨ ur Theor. Astrophysik (ZAH) {klessen,banerjee}@ita.uni-heidelberg.de
March 31, 2010
Abstract
cards.
In a previous paper we have shown that direct gravitational N -body simulations in astrophysics scale very well for moderately parallel supercomputers (order 10100 nodes). The best balance between computation and communication is reached if the nodes are accelerated by special purpose hardware; in this paper we describe the implementation of particle based astrophysical simulation codes on new types of accelerator hardware (field programmable gate arrays, FPGA, and graphical processing units, GPU). In addition to direct gravitational N -body simulations we also use the algorithmically similar “smoothed particle hydrodynamics” method as test application; the algorithms are used for astrophysical problems as e.g. evolution of galactic nuclei with central black holes and gravitational wave generation, and star formation in galaxies and galactic nuclei. We present the code performance on a single node using different kinds of special hardware (traditional GRAPE, FPGA, and GPU) and some implementation aspects (e.g. accuracy). The results show that GPU hardware for real application codes is as fast as GRAPE, but for an order of magnitude lower price, and that FPGA is useful for acceleration of complex sequences of operations (like SPH). We discuss future prospects and new cluster computers built with new generations of FPGA and GPU
1
Introduction
Numerical algorithms for solving the gravitational N body problem [?] have evolved along two basic lines in recent years. Direct-summation codes compute the complete set of N 2 interparticle forces at each time step; these codes are designed for systems in which the finite-N graininess of the potential is important or in which binary- or multiple-star systems form, and until recently, were limited by their O(N 2 ) scaling to moderate (N ≤ 105 ) particle numbers. A second class of N -body algorithms replace the direct summation of forces from distant particles by an approximation scheme. Examples are the Barnes-Hut tree code [?], which reduces the number of force calculations by subdividing particles into an oct-tree, and fast multipole algorithms which represent the large-scale potential via a truncated basis- set expansion [?, ?], or on a grid [?, ?]. These algorithms have a milder, O(N log N ) or even O(N ) scaling for the force calculations and can handle much larger particle numbers, although their accuracy can be substantially lower than that of the direct-summation codes [?]. A natural way to increase both the speed and particle number in an N -body simulation is to parallelize [?, ?]. Communication inevitably becomes the bot1
large (≥ 106 ) particle numbers and high (≥1 TFlop/s) speeds can in principle be achieved. Requirements of GRAPE hardware, however, forced scientists to use less efficient algorithms than were originally available. Without significant loss of accuracy the Ahmad-Cohen (AC) neighbor scheme could be used[?], as implemented in codes like NBODY5 or NBODY6 [?]. In the AC scheme, the full forces are computed only every tenth timestep or so; in the smaller intervals, the forces from nearby particles (the “irregular” force) are updated using a high order scheme, while those from the more distant ones (the “regular” force) are extrapolated using a second-degree polynomial. A parallel implementation of NBODY6, including the AC scheme, exists, but only for general-purpose parallel computers [?]; the algorithm has not yet been adapted to systems with GRAPE hardware. The main reason for the problem is that the GRAPE hardware, and the custom built GRAPE chip, have a fixed design (ASIC = application specific integrated circuit), with design cycles of several years for a new version. The use of the AC scheme needs partial forces and also in some parts higher accuracy than the GRAPE would provide. To illustrate the problem it is instructive to look at the scaling of the N -body algorithms:
tleneck when increasing the particle number. The best such schemes use systolic algorithms (in which the particles are rhythmically passed around a ring of processors) coupled with non-blocking communication between the processors to reduce the latency [?, ?]. A major breakthrough in direct-summation N body simulations came in the late 1990s with the development of the GRAPE series of special-purpose computers [?], which achieve spectacular speedups by implementing the entire force calculation in hardware and placing many force pipelines on a single chip. The design, construction and implementation of GRAPE chips, and their further integration into performant accelerator boards called GRAPE boards, relates back to an initiative of D. Sugimoto, J. Makino (Univ. of Tokyo) and also P. Hut (Inst. for Advanced Study in Princeton) [?]. Statistical models had predicted a certain type of gravothermal oscillation in dense cores of globular star clusters (they are to first order millionbody gravitating N -body systems); it had been impossible so far to prove that such physical phenomenon, similar to heat conduction in gaseous systems, did exist in a discrete stellar N -body system. With GRAPE the successful detection of gravothermal oscillations in direct N -body models was made [?]. One of the authors of this paper (RS) had visited Univ. of Tokyo several times in the 90s and witnessed, how astrophysical students and postdocs were wirewrapping the prototypes of the new GRAPE hardware. It became a quasi-standard for superfast N -body hardware, several Gordon- Bell prizes were won, and many instutions in the world still use GRAPE these days (e.g. Rochester Inst. of Technology, Rochester, New York, USA; Univ. Amsterdam, NL; Univ. Marseille, FR; Univ. Tokyo and RIKEN Institute, Japan; Univ. Heidelberg, D; to mention only a few selected examples).
T = αN + δN · Nn + βN 2 /γ
(1)
Here T is the wall clock time needed by our algorithm to advance the system by a certain physical time, N is the total particle number, Nn is a typical neighbour number for the AC scheme (of order 50-200). α, δ, β are time constants and γ is the efficiency factor for the AC scheme. If γ = 10 it means the full gravitational force for any particle needs to be computed only every tenth time. The GRAPE hardware is unable to compute neighbour forces efficiently, so one is forced to use Nn = 0 and consequently γ = 1. Since the speed-up achieved by GRAPE for β is so large (typically several hundred) this is a price to pay, to still reach a decent overall speedup. So, on hardware with GRAPE, the use of simpler parallel N -body algorithms has been enforced [?]. The reader interested in more details of the algorithms and performance analysis is also referred to [?, ?]. One may wonder, why the N 2 part of the algorithms could not be removed using TREE schemes or even fast multipole methods [?, ?]. The answer lies partly in very high accuracy requirements for very long-term integrations and partly in the strongly inhomogeneous nature of astrophysical systems, where a combination of brute algorithms
The present GRAPE-6, in its standard implementation (32 chips, 192 pipelines), can achieve sustained speeds of about 1 Tflops at a cost of just ∼ $50K. In a standard setup, the GRAPE-6 is attached to a single host workstation, in much the same way that a floating-point or graphics accelerator card is used. Advancement of particle positions [O(N )] is carried out on the host computer, while interparticle forces [O(N 2 )] are computed on the GRAPE. More recently, “mini-GRAPEs” (GRAPE-6A) [?] have become available, which are designed to be incorporated into the nodes of a parallel computer. The mini-GRAPEs place four processor chips on a single PCI card and deliver a theoretical peak performance of ∼ 131 Gflops for systems of up to 128k particles, at a cost of ∼ $6K. By incorporating mini-GRAPEs into a cluster, both 2
with individual time step schemes proved to be more efficient even than fast multipoles [?, ?]. Particle-based astrophysical simulations in general comprise a much larger class of applications than just gravitational N -body simulations. Gas dynamical forces (pressure forces, viscous forces) are frequently modelled using the smoothed particle hydrodynamics approach [?]. This algorithm assigns thermodynamic quantities to particles (such as pressure, temperature), while in standard N -body models particles just have a mass. By appropriate averaging over neighbour particles (see detailed description of the algorithm below) the proper fields of thermodynamic quantities can be defined - moving the particles much like in a gravitating N -body model under the influence of all forces with some appropriate time step then provides fairly accurate solutions of complex partial differential equations, such as e.g. the Navier-Stokes equations for astrophysical plasmas [?]. State-of-the art parallel algorithms for SPH simulations exist e.g. in form of the Gadget code [?]. An analysis of the algorithm reveal practically the same structure as before for the direct N -body code using an AC scheme, with two important differences: (i) the number of floating operations required for the neighbour forces is some factor of five larger, because we have to deal with much more complex forces than just gravitational interactions; (ii) for many applications a TREE approximation to compute the distant gravitational interactions is sufficient in accuracy, therefore the the potential speed-up obtained by using GRAPE for this component is small. The interested reader can find a TREE implementation using GRAPE [?, ?] as well as reports from a broad community in science and engineering using SPH in the yearly SPHERIC conferences [?]. Note that some of the industrial applications do not require gravity (long-range forces) at all.
Figure 1: Left: GRACE titan cluster in Astronomisches Rechen-Institut, ZAH, Univ. of Heidelberg; 32 nodes with GRAPE and FPGA accelerators, 4 Tflop/s peak speed, installed 2006; Right: Frontier kolob cluster at ZITI Mannnheim, 40 nodes with GPU accelerators, 17 Tflop/s peak speed; installed 2008 formance for simple algorithmic structures and data flows (see citations above), comparable if not faster than GRAPE, (iii) GPU can be programmed to work also for more complex algorithms, potentially with less speed-up, but allowing the implementation of SPH and AC-scheme algorithms. Compared to GPU for example GRAPE cards in particular have a very long production cycle, are not as cheap as GPUs could be, while FPGA hardware tends to have a relatively high price-performance ratio. One has to bear in mind, however, that such statements are subject to quick change, due to fast product cycles as well as to cycles which are imposed by scientific project funding. The promising property of GPUs in our view is that it can potentially substitute both types of special hardware we used so far - GRAPE for highest performance distant forces, MPRACE for medium performance but complex force algorithms. In this paper we will describe and compare performances obtained for the example of the SPH algorithm (AC neighbour scheme behaves very similar); the architectures compared are those of our presently running GRACE cluster, with a recent GPU. More details on the hardware used and on the variant of the SPH algorithm used are to be found below; we would also stress that while this comparison seems to be a bit unfair (comparing an older FPGA board with one of the most recent GPU), it is nevertheless extremely instructive, and it is selected like that, because this is the hardware we have currently available, including all necessary software environment. The present cluster containing FPGA will be used for some time still in heavy production, while the GPU card may be the hardware of choice for a next generation system, which is currently under consideration. We are also currently testing a new generation of FPGA hard-
Recently, graphical processing units (GPU), become increasingly usable for high-performance numerical computations [?, ?, ?]. In particular the astrophysical simulation community seems to be pretty eager to use that new facility (see Princeton workshop http://astrogpu.org ). The CUDA libraries have been particularly helpful to encourage astrophysicists and other scientists to start testing and using GPU [?]. Academic institutions and industrial sites are currently planning to build large clusters using GPU’s as accelerator boards. Why is that so? Three main reason could be found: (i) GPU cards are produced for the industrial mass market, their production cycle is short and price- performance ratio is very good. (ii) GPU can reach very high per3
ware, called MPRACE-2, which at the time of writing this article was not ready for benchmarks with application codes. In the discussion we will interpret our results in terms of future perspectives for each of the special hardware, FPGA and GPU. The bottom line is, that we think our comparison results are still extremely useful, because they use practically the same software and software interfaces, just the different accelerator hardware. At the same time, however, one has to take into account the different generations of hardware used and should not one-dimensionally interpret the results in favour of either GPU or FPGA. Current GPU implementations achieve over 300 GFlops from a single board. This is quite logical, as it represents > 90% of time for pure-CPU algorithms, it is an O(n2 ) in its more basic form, and highly parallelizable. With particle numbers > 106 , the latter SPH part is also computationally very demanding, once the gravity has been accelerated. It is also more complex, as it is based in an interaction list between particles. We support such calculations, which can be applied to a wide variety of practical problems besides astrophysics, by programmable hardware. Related work include the developments of Hamada and Nakasato[?] and similar work has been done for Molecular Dynamics (MD)[?], as the algorithms are comparable.
2
the simulation code requires new gravitational forces are called i-particles. For details about the algorithm with individual hierarchical time steps and a copy algorithm for parallelization the reader is referred to [?, ?, ?]. The SPH method will only be described very briefly here, the interested reader is referred to [?] for our implementation specific issues. Several thorough general reviews of this method has been published by W. Benz [?], Monaghan[?] and Liu[?]. In SPH the gaseous matter is represented by particles. To form a continuous distribution of gas from these particles they become smoothed by convoluting their discrete mass distribution with a smoothing kernel W . This kernel is a strongly peaked function around zero and is non-zero only in a limited area. In contrast to the purely gravitational simulations the SPH method requires two steps, one to compute the new density at the location of i-particles from the j-particle distribution, using the smoothing kernel over Nn neighbour particles (with Nn ≈ 50− 200); the discretized formula for the density ρi looks as follows: ρi =
i6=j
mj |~rj − ~ri |
rj 3 (~
− ~ri )
(3)
The next equation shows the physical formula for the acceleration of an i-particle due to the forces from the gradient of the pressure P and the so-called artificial viscosity avisc i ~ai = −
We implemented forces between particles, purely gravitational forces on all kinds of hardware, hydrodynamical forces using the SPH algorithm on FPGA and GPU hardware. In this paper we only describe comparative profiling for the single node computation of gravitational forces (on GRAPE and GPU) and of SPH forces (on FPGA and GPU). In both cases field particle data (masses, positions, velocities, for high oder algorithms also higher derivatives) are transferred to the memory linked to our accelerator boards; thereafter a stream of test particles is passed to the hardware, which then computes in parallel pipelines the expression X
mj W (~ri − ~rj , h)
j=1
Astrophysical Terms Implemented
~ai = −
N X
1 ∇Pi + ~avisc i ρi
(4)
The artificial viscosity is introduced to make the method capable of simulating shock waves [?, ?]. The SPH method is typically computed in a 2-stage algorithm. In the first stage, which shall be denoted as SPH Step 1, the densities are calculated according to formula 3. Subsequently the pressure P is calculated via a state equation. In the second stage of the SPH algorithm, denoted as SPH Step 2, the forces on the particles and other quantities like dissipated energy are calculated as expressed in formula 4. According to the forces the trajectories of the particles are integrated before starting again with SPH Step 1 of the next time step.
(2)
3
where we have used a coordinate scaling such that the gravitational constant G = 1. For obvious reasons the particles in memory of the accelerator board are called j-particles, and the test particles, for which
FPGA implementation
In this section, we describe the implementation of a FPGA-based coprocessor for SPH computations. 4
ing ip the index of the i-particle, N the number of neighbours, and jpX the corresponding j-particles indexes. Neighbour lists are received from the host and processed immediately at the rate of one neighbour interaction per cycle. Accumulated results are pushed into an output FIFO for later retrieval. Being the FIFO of a very limited size, results must be retrieved by the software regularly. In order to reduce the size of the pipelines, increase their speed and being able to accommodate more particles in the available memory, the precision of the floating point operators is reduced to 16-bits mantissa and 8-bits exponent, except for the accumulators. Given the described scheme, it is clear the algorithm is order O(nm), where n is the number of particles and m the average number of neighbors. Therefore, the overall performance of the processor is driven by the communication time (how fast can the neighbor lists be sent to the board) and the clock frequency of the coprocessor (how fast an interaction can be dispatched).
Field Programmable Gate Arrays (FPGA) are specialized chips that allow dynamic reconfiguration of its building blocks. Giving this capability, they are able to implement electronic circuits, or hardware designs (as we call them), by means of programming the interconnections between the blocks. This involves the creation of a library of floating point operators to support the required computations, to assemble them into a processing pipeline, and building the required control and communication logic. As our platform of choice, we use the MPRACE1 coprocessor boards developed in-house, which are equipped with a Xilinx XC2V3000 FPGA, 8MB of ZBT-RAM and an external PLX9656 for PCI communications with the host at up to 264MB/s. The migration to our new MPRACE-2 boards is currently in progress.
Figure 2: Design features of the new MPRACE-2 board
3.1
Hardware Architecture
The coprocessor design consists of one or more computational pipelines, external memory to store the particle data and an output FIFO at the end of the pipeline to store the results, plus the required communication and control logic. An overview is shown in Fig. 1, which includes at the bottom the communication with the host and at the top the external memory modules used. The current design fits one SPH pipeline working at 62.5MHz and capable of storing up to 256k particles in memory. For each timestep, the coprocessor loads the particle data (position, velocity, mass, etc) into external memory. After selecting for step1 or step2 computations, neighbour lists are sent in sequence from the host in the format ip,N,jp1..jpN, be-
Figure 3: FPGA Architecture To optimize communication, loading the particle data is done in two parts. For step1, position, velocities, mass and smoothing length (plus sound speed) are loaded, while the additional values for step2 (including the updated density, Balsara factor and pressure) are uploaded only before starting step2 computations. Additionally, as mentioned earlier, the pipelines are capable of switching between step1 and step2 computations, reusing several common parts of the oper5
4
ation flow. This allowsfurther savings in chip area, making it possible to use a single design for all required operations.
3.2
GPU implementation
Graphic Processing Units (GPU) are highly parallel, programmable processors used primarly for image processing and visualization in workstations and entertainment systems. As they grow in complexity, they have added many features, including programmability, floating point capability and wider and faster interfaces. As a final addition, the latest versions include APIs to program them as custom accelerators, enabling us to test with relative ease the performance of the platform.
Software Tools
Each floating point operator required considerable development time, and many operations can be optimized depending on their operator size or signed/unsigned results. Other specialized operations, like square a value, can also be optimized when compared to a generic multiplier. To gather all this advantages into a coherent interface, a floating point library was developed. This library provides parameterized operators to select the desired precision (exponent and mantissa), specialized operators to optimize for speed and area, simulation capabilities to simplify hardware verification, and a range of high performance accumulators. The pipelines for the SPH computations contain tens of operators, and putting them together is a time consuming and error-prone task. As a way of adding flexibility and reliability, a pipeline generator was developed. This software tool receives a program-like description of the operations to perform and produces a hardware pipeline, using the floating point library as the building blocks. This makes the pipeline correctby-design, and it is very flexible to adjust the precision of operands or to introduce changes on the operations performed. A more complete description can be found at [?].
4.1
Hardware and Software Model
As said previously, GPUs are in essence a highly parallel, programmable architecture. In its current incarnation, a board can have up to 128 parallel processing elements and 1.5GB of high speed RAM over a 384-bit bus. Such a level of parallelism comes at the cost of the memory hierarchy and coherency, commiting most resources on-chip to computational resources instead of cache memory.
Shared memory
Registers
Th d Thread
Local Memory
Registers
Th d Thread
Local Memory
Shared memory
Registers
Th d Thread
Local Memory
Registers
Th d Thread
Local Memory
GlobalMemory
3.3
Software Library Interfaces
C ConstantMemory M
Texture Memory
In order to use efficiently the capabilities of the coprocessor, we developed a software library for C/C++/FORTRAN languages. This library provides the user with a clean interface to the SPH functionality while hiding coprocessor-specific details like design loading, buffer creation, data partition and FIFO flushing. In addition, an emulation core is provided, that allows the library to perform the same operations with the host CPU only. Particular attention was taken on the interface of the library with existing applications. Being the coprocessor performance directly proportional to the communication between the host and the board, a generic buffer management scheme[?] was implemented, allowing the library to access data structures of the application directly for direct conversion between the formats of the application and the coprocessor.
Figure 4: Schematic organization of a GPU A sketch of the organization of an NVIDIA GPU as the one used for this publication is depicted in Fig. 4. The board (and processor), referred to as a GeForce 8800GTX, consists of 16 multiprocessors, each one capable of managing 8 threads in parallel in SIMD fashion, for a total of 128 threads and communicates to the host using a 16-lane PCI-Express bus. Each multiprocessor has a small shared memory ( 16K) accessible to all threads of the same multiprocessor, a big register file (8K registers) which is split among allocated threads, and interfaces to the main (global) memory. It is important to note, none of these memories are cached. In contrast, special read-only regions referred 6
the initial offset of each neighbor list in the NL array, which we called cutpoints. This array is a byproduct of the creation of the NL array, so we incur in no extra cost. This enables each thread to compute the starting point of its own neighbor lists without the need to scan the NL array. Each thread stores its results in shared memory, and writes them to main memory when it finishes. We partition the problem in such a way as to use as much of the shared memory as possible.
as constant memory (for constant values) and texture memory (for large data references/interpolated data) are cached. Having regions without cache, without write coherency and with several penalties to the memory access paterns used adds complications to the implementation of the algorithms. Fortunately, NVIDIA provides an API and computational model to make efficient use of the processors. The CUDA (Compute Unified Device Architecture) library provides a C-like programming language and compiler, with specific extensions for the platform. The library and API makes interfacing with the board a very easy task. In contrast to other architectures, NVIDIA GPUs based their computing model around several levels of units of work, where the most simple is very similar to lightweight threads (referred simply as threads), instead of the actual number processing elements available. Threads can be grouped in Blocks, which are assigned to a single multiprocessor. Blocks are in turn organized in a grid, which represents the current workload. This thread model allows the hardware to scale more easily to the number of computational units available, and by means of using a very high number of threads and a simple scheduler, to effectively hide the latency of memory accesses for each thread. At the same time, it still allows precise control on the work asigned to a single multiprocessor and the proper discovery of sibling threads for communication.
4.2
4.3
Software Library Interfaces
Our software library is designed from the ground-up to support multiple implementations of the supported SPH algorithms. Therefore, the effort of extending it to support GPUs was low. As an addition, using the same interface allow us to use the same applications as before without any significant change in the program interface: Just to change a switch in the initialization function, to define the type of processing core to use.
5
Results
CPU runs and GPU runs are from a workstation equipped with a Intel Core 2 Quad at 2.4 GHz, 4GB of RAM and a GeForce 8800 GTX GPU with 768MB. MPRACE runs are from one node of the Titan cluster at the ARI-Heidelberg, with 2 Intel Xeon CPUs at 3.2 GHz, 4GB of RAM, a GRAPE board and a MPRACE1 board in a PCI-X slot. All presented runs are serial runs, i.e. they use only one core. A GRAPE board is an accelerator board for gravity interactions. Performance and accuracy measures are from a serial NBODY simulation with gravity and SPH forces, running one single step using shared timesteps. Accuracy is compared relative to the original double precision implementation on the CPU. A parallel version also exists, and more information is available at [?]. Since the FPGA designs use limited precision, the accuracy of the results is compared to a doubleprecision implementation on the host. The results for the time evolution of energies during adiabatic collapse test runs with different particle numbers (10K, 50K and 100K) shows the absolute error in total energy conservation during the whole period of integration was less than 0.1%, and the absolute differences of acceleration for runs on the CPU or the MPRACE are on the same order of magnitude, making them non significant. For more detailed information, consult [?]. In the case of the GPU, it uses single precision computations, but has limited range for certain opera-
Algorithm description
The current implementation of the SPH algorithm on the GPU is based in the communication interface developed originally for our FPGA accelerator. This means, we try to reduce the changes in the internal data formats and the communication patterns. As preparation stages, the software library preallocates the neccesary memory in the device to store the particle data, the neighbor lists and the results. The neighbor lists are described as with the FPGA, as one big array in the format ip,N,jp1..jpN, being ip the index of the i-particle, N the number of neighbours, and jpX the corresponding j-particles indexes. Processing is still divided in two steps, but the neighbor lists are sent to the board memory only once during step 1, and reused for step2. Results are read all at once at the end of each step. As on how to parallelize the work among the threads, we were confronted with several options. We finally decided to create one thread per each i- particle, or one thread per neighbor list. For this, we provide the GPU with an additional array containing 7
tions. The accuracy of the results is high, comparable to the use of SSE instructions in the CPU. First we show in Fig. 5 an implementation of a pure N - body algorithm with individual time steps [?], as it is used in practical applications, e.g. for the evolution of galactic nuclei with black holes [?]. Despite of the requirement of emulation of double precision the GPU card reaches the same speed as GRAPE, for a small fraction of the price. Moreover it provides programming flexibility, e.g. for a recently developed new 6th order scheme (Nitadori et al. 2008), as is shown in Fig. 5
100
∆T (one shared timestep) [sec]
10
1
0.1
0.01 SPH on CPU SPH on MPRACE-1 SPH on GPU ratio: CPU/MPRACE-1 ratio: CPU/GPU 0.001 1
4
16 64 N [in K]
256
1024
Figure 7: Time and Speedup of the SPH computations for the FPGA and the GPU. Figure 5: Speed of N -body simulation: Left: traditonal 4th order scheme, compared on the host, SSE, GRAPE board and GPU. Right: new 6th order scheme, see Nitadori et al., compared on the host, SSE, and GPU.
(ratio) against the CPU time. From this plot can be seen how the speedup is sustained for the particle sets, and is 6-11 times for the FPGA and about 17 times for GPU. More detailed profiling is required to explore why the MPRACE and GPU plots seem to converge at some point around 1M particles (currently unreachable by the MPRACE hardware). When considering the computational efficiency, our FPGA designs have a theoretical peak performance of 3.4 GFlops for Step1 and 4.3 GFlops for Step2, and from our measurements, the library overhead is negilible and the efficiency close to 100% for big data sets. However, in the case of the GPU, the theoretical peak performance is 518.4 GFlops (1.35 GHz × 16 multiprocessors × 24 warps), but the performance is only, at most, twice of the FPGA. This means the GPU is quite far from the maximum, leading us to believe there is plenty of possibilities to improve the algorithm and obtain better performance.
Figure 6: Speed of N -body simulation compared on the host, GRAPE board and GPU. Second, we present the present profiling status of our SPH code implementation. Here, we assume that large scale gravitational forces are ported to the GRAPE or GPU hardware. The remaining bottleneck lies in the hydrodynamic forces exerted from neighbour particles (see middle term of Eq. (1)). In the following we will discuss a profiling of this part of the code alone. Fig. 7 shows the time spent in the SPH computations in all different cases, as well as the speedup
In terms of power consumption, the GPU board consumes is in the order of 150W, while the MPRACE1 uses a mere 20W. This is a power efficiency of 4.65 W/GFlop on the MPRACE-1 and 16 W/GFlop on the GPU, with a maximum of 0.28 W/GFlop if the full performance of the GPU could be achieved. Therefore, the MPRACE-1 is 3.4 times more power efficient than the GPU. 8
6
Conclusions
body systems exhibit the largest speed on our custom clusters with special hardware.
We have presented implementations of force computations between particles for astrophysical simulations, using field programmable gate arrays (FPGA) and graphical processing units (GPU) and comparing the results with earlier application specific GRAPE hardware. The new software is used in custom parallel supercomputers, featuring FPGA and GPU accelerator cards with ultra fast and low latency communication networks (Infiniband) and a moderate number of nodes (32 for the titan GRACE cluster in Heidelberg, and 40 for the new Frontier GPU cluster in Mannheim). The overall parallelization of our codes is very good in this regime of a moderate number of accelerated nodes, as has been published in [?]. For a recent astrophysical application see [?]. The complexity of our particle based algorithms for gravitating N -body simulations and for smoothed particle hydrodynamics (SPH) simulations scales as T = αN + δN · Nn + βN 2 /γ
More advanced direct N -body codes as well as simulation codes including hydrodynamical forces, such as smoothed particle hydrodynamics (SPH) have a more complex scaling, where the middle term in Eq. 1 becomes dominant in the second order, if the long-range forces have successfully been accelerated by GRAPE or GPU. This term is usually due to intermediate range forces. The architecture of our present and future special purpose supercomputers is tailored to this approach. It is a hybrid architecture of special hardware for astrophysical particle simulations. It is based on GPU (or GRAPE) special hardware to compute distant forces (in our applications: gravitational), and another programmable hardware (FPGA or GPU) to compute neighbour forces, which can be gravitational as well, or can represent gas dynamical forces as they occur in smoothed particle hydrodynamics simulations (SPH). Once the GPU (or GRAPE) hardware has sped up the distant force part, the remaining bottleneck lies in the computation of SPH forces. We have shown here that the computation of SPH forces on programmable FPGA hardware as well as on graphical processing units (GPU) delivers a significant speedup of this part of the algorithm. Our comparison focused on FPGA hardware MPRACE-1, currently used in production, and on a new GPU card, just obtained for testing. The new GPU card performs somewhat better than the FPGA hardware, but this is just a momentary result, and will change again with using newer FPGA hardware (see below).
(5)
where N is the particle number and the other parameters are explained in Eq. 1. The last part on the right hand side is the long range gravitational force (complexity N 2 ) which has been implemented on GRAPE and GPU cards with nearly the theoretical peak speed of order 120 Gflop/s, which is close to the peak speed of 160 Gflop/s expected for double precision computations on this GPU board. The total sustained speed achieved on the titan GRACE cluster for a simulation of four million particles in direct N -body was 3.5 Tflop/s ([?] ). This cluster uses the GRAPE cards [?], with a peak speed of 128 Gflop/s (4 Tflop/s for the entire cluster). Our new frontier kolob cluster (40 nodes accelerated with NVIDIA Tesla GPU boards, peak speed of 17 Tflop/s) is presently in the commissioning phase. In full operation we expect a further speed up to 14 Tflop/s sustained for our N -body simulations; with most recent Tesla cards providing 1 Tflop/s per card the projected numbers would go to 40 Tflop/s peak and 28 Tflop/s sustained, obtained at yet larger particle numbers in the simulation. Due to the operation at a balance between computation and communication costs our results can only be scaled up to larger node numbers if communication bandwidth and latency would speed up accordingly. This is presently impossible for any large general purpose supercomputers on the market - while their peak performance is much larger and approaching the Petaflop/s range, our direct gravitationally fully interacting large N -
Both FPGAs and GPUs are viable options of accelerating high performance computing applications. While GPUs provide a very fast and cost-efficient development path, FPGAs may remain competitive. GPUs can be programmed with relative ease and provide raw computational power, which can be exploited more easily in some applications than others. FPGAs provide custom architectures tailored to the requirements, with a longer development cycle but more efficient solutions. In addition, the low raw power consumption of the FPGA gives them an advantage over GPUs for large installations, where the power costs are a significant part of the operational costs, or at locations where power supply is limitedn (e.g. remote locations). However, this gap is expected to close over time, as new and more efficient boards are released. If the cost can be justified, it depends on the power efficiency of the application in question. GPU could be used also in the future to replace the GRAPE hardware, because 9
they deliver comparable speed for the relatively simple task of computing long-range gravitational forces [?, ?]. Currently we are commissioning our next custom computing cluster “Frontier” with 40 NVIDIA Tesla nodes as GPU accelerator cards on 40 dual quad core Intel nodes. The peak performance of the new cluster is 17 Tflop/s ( http://www.uniheidelberg.de/presse/news08/pm281127-1kolobe.html ) and we expect from previous scalings and benchmarks [?] sustained performances of order 14 Tflop/s for our application codes, with particle numbers out of reach for standard supercomputers.
Acknowledgments We thank the Volkswagenstiftung for funding the GRACE project under I/80 041-043, and by the Ministry of Science, Research and the Arts of BadenW¨ urttemberg (Az: 823.219-439/30 and /36). This project is also partly funded by the German Science Foundation (DFG) under SFB 439 (sub-project B11) ?Galaxies in the Young Universe? at the University of Heidelberg. Furthermore we acknowledge a computing time grant obtained from the DEISA project with FZ Julich. PB thanks for the special support of his work by the Ukrainian National Academy of Sciences under the Main Astronomical Observatory ?GRAPE/GRID? computing cluster project. The “Frontier” cluster is funded by the excellence funds of the University of Heidelberg in the Frontier scheme.
10