Implementation of a Lattice-Boltzmann-Method for ... - T-Systems

5 downloads 4432 Views 1MB Size Report
May 15, 2009 - Free development tool set at http://www.nvidia.com/cuda .... each sub domain is assigned to an own LBM kernel instance. • interface data is ...
Implementation of a Lattice-Boltzmann-Method for numerical Fluid Mechanics using the nVidia CUDA Technology Eugen Riegel & Thomas Indinger Technische Universität München Lehrstuhl für Aerodynamik FluiDyna GmbH 15.05.2009

aer

1

Outline • CUDA introduction • LBM introduction • LBM on CUDA • SunlightLB CUDA port • LBultra

• GPU hardware solutions • Conclusions

aer

2

CUDA Introduction • Why CUDA? • What is CUDA • CUDA – a highly parallel execution model • Heterogeneous programming • Restrictions

aer

3

CUDA – Why?

GPU-CPU performance compare

GT200: 240 hardware cores Harpertown: 4 hardware cores

aer

4

CUDA – What is CUDA •

Programming model for direct programming of nVIDIA GPUs



Developed by nVIDIA, released by the end of 2006



C (C++) like high level programming language



Free development tool set at http://www.nvidia.com/cuda • nvcc compiler (part of CUDA toolkit) • Programming Guides • Simple CUDA programs (part of CUDA SDK) • Available for Linux, Windows and MacOS



CUBLAS (linear algebra) and CUFFT (fast Fourier transformation) libraries • utilizing CUDA hardware for large computations • usable by any kind of programming language

aer

5

CUDA – a highly parallel execution model

CUDA physical and organization

aer

logical

6

CUDA – heterogeneous programming •



• •



• •

Host: consists of PC’s CPU and main memory, usual software execution environment Device: consists of nVIDIA GPU chip and GPU RAM Host defines and prepares Device computational tasks Host launches highly parallel Device execution and waits for end Hosts retrieves Device’s results and prepares next Device computational task Host ⇒ organizational tasks Device ⇒ computational tasks

aer

7

CUDA – Restrictions •

computation algorithm must by highly parallel to be able to utilize all GPU computing resources (at least 3840 threads today!)



High computation performance of up to ~1 TeraFLOP can be achieved in single precision calculations, double precision 4 times slower!



Data exchange between Host- and Device • Host code can’t access Device memory directly • Device Threads can’t directly access Host memory • cudaMemcpy() function must be used for explicit data exchange between Host and Device

aer

8

Lattice-Boltzmann-Method - Discretisation •

Simulation space is divided into many cube nodes



Flow state in each node is defined by a discrete distribution function of density onto discrete molecular flow vectors

aer

9

Lattice-Boltzmann-Method - Process •

Simulation process is divided into a propagation step and a collision step



Propagation step transfers distribution function entries along their molecular velocities to adjacent nodes



Collision step calculates the collision of distribution function entries arriving from adjacent nodes in the middle of current node

aer

10

Lattice-Boltzmann-Method – boundary data setup •

Static obstacle geometry is considered by “reflection” of distribution function entries along molecular velocities passing an obstacles outline



Distribution functions of in- and outflows can be calculated using given macroscopic velocity, density and stresses

aer

11

Lattice-Boltzmann-Method – Why? •

allows extremely complex object geometry • Uses simple Cartesian grids • No need for generation of complex geometry adapted grids



retrieves accurate simulation results with unsteady flows



is a perfectly parallel algorithm ¾ Very suitable for CUDA



allows efficient domain decomposition because of weak sub domain coupling ¾ Very suitable for distribution among multiple CUDA devices (multi GPU)

aer

12

LBM on CUDA • SunlightLB CUDA port • LBultra • Strategies to better CUDA kernel • Performance • Validation • Live presentation • Outlook

aer

13

SunlightLB CUDA port •

Porting the existing open-source LBM C-software “SunlightLB” to CUDA



Speed up of the port: • GPU: GeForce 8800GTS (~450 GigaFLOPs) • CPU: Core2Duo (~30 GigaFLOPs) • Estimated: ~1500% • Resulted: 150%



Estimated speed up has been missed by a factor of 10! Why? • porting have made CUDA kernel highly parallel ⇒ good • porting was unable to mind about CUDA memory access patterns ⇒ bad • bad GPU memory access patterns can cause a performance dropdown of up to a factor of 32!

aer

14

LBultra •

Completely new implementation of the “LBultra” LBM C++-Software:



Parallel usage of different CPU and CUDA LBM kernels • D3Q15 fixed refinement CUDA kernel • D3Q15 fixed refinement CPU multi core kernel



Interfaces for custom boundary data setup and obstacle data setup • homogenous velocity boundary surface • zero normal velocity gradient boundary surface • Spheres, Cylinders, Boxes, Slabs are available as obstacles



Domain decomposition abilities • simulation domain is divided in sub domains • each sub domain is assigned to an own LBM kernel instance • interface data is being exchanged between sub domain in the end of every time step



Online interactive 3D visualization

aer

15

LBultra •

Strategies to a better CUDA LBM kernel: • more attention on memory access patterns • algorithmic reduction of data transfers by joined propagation and collision phase • using shared GPU memory for explicit data caching



New CPU LBM kernel: • has been derived from the GPU kernel like “porting back” • offers high parallelism as well -> can utilize multi core CPUs • also profits from algorithmic improvements in CUDA kernel

aer

16

LBultra - performance •

Performance compare • GPU: nVIDIA Tesla C1060 (~933 GigaFLOPs): 78.0 MVPS • CPU: AMD Phenom X4 (~41 GigaFLOPs): 8.42 MVPS • Estimated: ~2240% • Resulted: 926%



Performance analysis: • performance still not optimal • cause: bad memory access pattern in “collection” of distribution function entries for propagation • solution: in progress

aer

17

LBultra - performance •

Operational performance • GPU: 3x nVIDIA Tesla C1060 with 4GB GPU RAM each ¾ 12 GB RAM for simulation data • Simulation domain size: 915x457x457 ¾~191 millions voxels ¾ Re-number ~4800 • Obstacle: Sphere, center point at (228, 228, 228), radius=57 nodes • Calculation speed: ~ 0.75 time steps per second ¾143 MVPS

aer

18

LBultra - validation • •

Common test case of a flow around a sphere Szenario 1 (Re range: [2.46-31.0]) – Domain size: 112x48x48 vertexes – Sphere: position (24; 24; 24), radius 8 vertexes



Szenario 2 (Re range [77.4–1208]) – Domain size: 224x96x96 vertexes – Sphere: position (48; 48; 48), radius 16 vertexes



Boundary surfaces setup – Inflow: homogenous velocity, equal to average velocity in adjacent plane – other: zero normal velocity gradient



Flow acceleration – acceleration using a volumetric force in predefined inflow direction – Strength of force is computed from the difference between flow’s current velocity in inflow surface and the predefined velocity



Measurement – 10000 time steps to enable convergence – cD measurement and averaging for 1000 further time steps

aer

19

LBultra - validation

aer

20

LBultra - live presentation

aer

21

LBultra - outlook •

implementation of new LBM kernels (CUDA and CPU) being capable of local refinement (current)



interfaces for CAD data import



interfaces for simulation data ex- and import to enable tool chaining



improvements on 3D online visualization



Interactive GUI to improve usability



Extension of LBM kernels to enable use of turbulence modeling for higher Re-Numbers

aer

22

CUDA hardware

nVIDIA Tesla preferred provider for customers in research & development Hardware distribution of GPU components, workstations and clusters CUDA software development and optimization services

Hardware Tesla C1060 (1x GPU Quadro FX 5800 4 GB 512-bit GDDR3) FluiDyna Tesla Workstations, up to 4 Tesla C1060 Tesla S1070 (4x GPU Quadro FX 5800 16 GB 512-bit GDDR3)

aer

23

CUDA hardware FluiDyna preconfigured cluster Configuration

FluiDyna 4 Tesla S1070 Preconfigured Cluster GPUs 16 Tesla T10 GPUs CPU Servers 4 HPC-Server FluiDyna WS 1xIQ-16 (single socket) CPU 4 Intel Core 2 Quad i7-920 QC 2.66GHz Memory 64 GB GPU memory up to 96 GB CPU DDR3 1333 memory Cluster Headnode 1 HPC-Server FluiDyna WS 1xIQ-16 (single socket) i7-920 Storage up to 32 TB SATA drives HPC Network SDR Infiniband with 8 way switch Communication GBit Ethernet

aer

24

Conclusions •

Lattice-Boltzmann-Method is very suitable for implementation with CUDA • achieves high computational performance • valid simulation results



C-Like programming language enables fast entry into CUDA technology



high performance CUDA programs require high effort on optimization • simple software porting is not enough!



nVIDIA’s GPU devices offer revolutionary performance in many fields: •~2000% faster •~1000% cheaper •~3700% less space consumption •~1000% greener (better energy efficiency)

¾ Additional effort yields a big profit!

Further Information: www.fluidyna.de

aer

25