Implementation of a Lattice-Boltzmann-Method for ... - T-Systems
Recommend Documents
do not currently exist for 3D primitives, with the exception of the modeling handbook published ... As CityGML ADE, the classes in the information model IMGeo.
In order to obtain partial discharge (PD) measurements a PD measuring probe ... Good linearity, due to the absence of magnetic materials. ... Rogowski coils are wound either on a rigid toroidal core form or on a flexible belt-like core form. ... of t
Cooperative Engineering concerns the application of. Concurrent ... member protects its assets and know-how; the network is .... part parameters) to the turnkey supplier who, in turn, .... usually well defined but information management is.
remote branch office. Security protocols .... not want to use some of the modes such as, the read mode, the write .... Certificate repository: contains copies of the.
IIIE - CONICET, parcialmente financiado por SECyT-UNS. Universidad Nacional ... tested the performance of GPU processors (using the CUDA library) against.
5 Jun 2012 - NET programs, thus unlocking existing routines to .NET ... A+ is an array programming language (Morgan Stan- ..... NET CLR with Lua. Journal ...
generic and portable CORBA-based dispatcher components can be developed which ... plementation of the dispatcher we defined the following design goals: ⢠Generic .... out how long it took the servant to process the request. In that way, the ...
logic invoking web methods in different IVM local nodes at client side. - MySQL for IVM databases in the RDB flavour. - Apache Xindice for IVM databases in the ...
herramienta Pedagógica y de Investigación” Manual de prácticas de laboratorio ... 12.004. [11] Watanabe E., Aredes M. , “Teoría de Potencia Activa y Reactiva.
algorithms in real time on a FPGA platform. ... down to programmable logical devices, in our case Xilinx ... results obtained using Matlab/ Simulink and System.
Feb 24, 2016 - Language (Opencl) implementation of two well known parallel tridiagonal solvers. ..... Pulse-coupled neural network performance for real-time ...
distribución gaussiana de anclaje (k0), desviación estándar de la distribución gaussiana de anclaje (σ) y el gráfico de la medida experimental junto a la curva ...
In this paper, the design of a single neuron which contains a sigmoid activation function was proposed and implemented using the FPGAs (Field. Programmable Gate ... FPGA by using VHDL or state machine or by using the schematic. (Hardware .... ITESO,
Oct 31, 2018 - versions of functional programming languages [Sto84, St84, He80, .... Here, Fun is the S-expression representation of a function - program, and.
Jun 5, 2012 - A+ is an array programming language (Morgan Stan- ley, 2008) ... 2 THE A+ PROGRAMMING ..... forum.apl2000.com/viewtopic.php?t=447 (Ac-.
1. a the stand-alone PV system includes a solar array, DC/DC converter, resistive ... 3. MPPT by the fuzzy logic approach. Fuzzy systems (FS) are based on fuzzy set ..... Solar-Powered Light-Flasher Applications, The 47th IEEE International ...
Oct 24, 2018 - Chemical Gas Dispersion Towards Next Generation Port (Tuas Maritime Hub)â funded by Singapore Maritime Institute (SMI-2016-MA-04) ...
Our first objective is to develop a âpilotâ for the robot, i.e., a walk controller that adapts ... terrain profile and negotiates obstacles, able to respond to commands for ...
Nov 9, 2015 - in substancia nigra and observed the effects in both thalamus and cerebral ... spiking activities right after the input from substancia nigra, ...
Servicios (SOA), dado que ofrecen una flexibilidad estructural que permite ... implementación de la arquitectura se valida por medio de una prueba que permite.
port triggered architecture [4], [5], [10], [18]. A major feature of the architecture is that the processing elements are algorithm-specific and that the control signals ...
A 0.18μm IMPLEMENTATION OF A FLOATING-POINT UNIT ... The Data-IntensiVe Architecture (DIVA) project [1][2] is .... test chip are summarized in Table II. 5.
Jun 27, 2016 - of the 50 participants had positive PC: PTSD findings and all these nine participants were referred to a mental health specialist. The current ...
Based Automated Irrigation System. Jonathan A. ... Keywords--- Automated Irrigation, Microcontroller, ... drip irrigation allows water to drip slowly to the roots of.
Implementation of a Lattice-Boltzmann-Method for ... - T-Systems
May 15, 2009 - Free development tool set at http://www.nvidia.com/cuda .... each sub domain is assigned to an own LBM kernel instance. ⢠interface data is ...
Implementation of a Lattice-Boltzmann-Method for numerical Fluid Mechanics using the nVidia CUDA Technology Eugen Riegel & Thomas Indinger Technische Universität München Lehrstuhl für Aerodynamik FluiDyna GmbH 15.05.2009
aer
1
Outline • CUDA introduction • LBM introduction • LBM on CUDA • SunlightLB CUDA port • LBultra
• GPU hardware solutions • Conclusions
aer
2
CUDA Introduction • Why CUDA? • What is CUDA • CUDA – a highly parallel execution model • Heterogeneous programming • Restrictions
Programming model for direct programming of nVIDIA GPUs
•
Developed by nVIDIA, released by the end of 2006
•
C (C++) like high level programming language
•
Free development tool set at http://www.nvidia.com/cuda • nvcc compiler (part of CUDA toolkit) • Programming Guides • Simple CUDA programs (part of CUDA SDK) • Available for Linux, Windows and MacOS
•
CUBLAS (linear algebra) and CUFFT (fast Fourier transformation) libraries • utilizing CUDA hardware for large computations • usable by any kind of programming language
aer
5
CUDA – a highly parallel execution model
CUDA physical and organization
aer
logical
6
CUDA – heterogeneous programming •
•
• •
•
• •
Host: consists of PC’s CPU and main memory, usual software execution environment Device: consists of nVIDIA GPU chip and GPU RAM Host defines and prepares Device computational tasks Host launches highly parallel Device execution and waits for end Hosts retrieves Device’s results and prepares next Device computational task Host ⇒ organizational tasks Device ⇒ computational tasks
aer
7
CUDA – Restrictions •
computation algorithm must by highly parallel to be able to utilize all GPU computing resources (at least 3840 threads today!)
•
High computation performance of up to ~1 TeraFLOP can be achieved in single precision calculations, double precision 4 times slower!
•
Data exchange between Host- and Device • Host code can’t access Device memory directly • Device Threads can’t directly access Host memory • cudaMemcpy() function must be used for explicit data exchange between Host and Device
aer
8
Lattice-Boltzmann-Method - Discretisation •
Simulation space is divided into many cube nodes
•
Flow state in each node is defined by a discrete distribution function of density onto discrete molecular flow vectors
aer
9
Lattice-Boltzmann-Method - Process •
Simulation process is divided into a propagation step and a collision step
•
Propagation step transfers distribution function entries along their molecular velocities to adjacent nodes
•
Collision step calculates the collision of distribution function entries arriving from adjacent nodes in the middle of current node
aer
10
Lattice-Boltzmann-Method – boundary data setup •
Static obstacle geometry is considered by “reflection” of distribution function entries along molecular velocities passing an obstacles outline
•
Distribution functions of in- and outflows can be calculated using given macroscopic velocity, density and stresses
aer
11
Lattice-Boltzmann-Method – Why? •
allows extremely complex object geometry • Uses simple Cartesian grids • No need for generation of complex geometry adapted grids
•
retrieves accurate simulation results with unsteady flows
•
is a perfectly parallel algorithm ¾ Very suitable for CUDA
•
allows efficient domain decomposition because of weak sub domain coupling ¾ Very suitable for distribution among multiple CUDA devices (multi GPU)
aer
12
LBM on CUDA • SunlightLB CUDA port • LBultra • Strategies to better CUDA kernel • Performance • Validation • Live presentation • Outlook
aer
13
SunlightLB CUDA port •
Porting the existing open-source LBM C-software “SunlightLB” to CUDA
•
Speed up of the port: • GPU: GeForce 8800GTS (~450 GigaFLOPs) • CPU: Core2Duo (~30 GigaFLOPs) • Estimated: ~1500% • Resulted: 150%
•
Estimated speed up has been missed by a factor of 10! Why? • porting have made CUDA kernel highly parallel ⇒ good • porting was unable to mind about CUDA memory access patterns ⇒ bad • bad GPU memory access patterns can cause a performance dropdown of up to a factor of 32!
aer
14
LBultra •
Completely new implementation of the “LBultra” LBM C++-Software:
•
Parallel usage of different CPU and CUDA LBM kernels • D3Q15 fixed refinement CUDA kernel • D3Q15 fixed refinement CPU multi core kernel
•
Interfaces for custom boundary data setup and obstacle data setup • homogenous velocity boundary surface • zero normal velocity gradient boundary surface • Spheres, Cylinders, Boxes, Slabs are available as obstacles
•
Domain decomposition abilities • simulation domain is divided in sub domains • each sub domain is assigned to an own LBM kernel instance • interface data is being exchanged between sub domain in the end of every time step
•
Online interactive 3D visualization
aer
15
LBultra •
Strategies to a better CUDA LBM kernel: • more attention on memory access patterns • algorithmic reduction of data transfers by joined propagation and collision phase • using shared GPU memory for explicit data caching
•
New CPU LBM kernel: • has been derived from the GPU kernel like “porting back” • offers high parallelism as well -> can utilize multi core CPUs • also profits from algorithmic improvements in CUDA kernel
Performance analysis: • performance still not optimal • cause: bad memory access pattern in “collection” of distribution function entries for propagation • solution: in progress
aer
17
LBultra - performance •
Operational performance • GPU: 3x nVIDIA Tesla C1060 with 4GB GPU RAM each ¾ 12 GB RAM for simulation data • Simulation domain size: 915x457x457 ¾~191 millions voxels ¾ Re-number ~4800 • Obstacle: Sphere, center point at (228, 228, 228), radius=57 nodes • Calculation speed: ~ 0.75 time steps per second ¾143 MVPS
aer
18
LBultra - validation • •
Common test case of a flow around a sphere Szenario 1 (Re range: [2.46-31.0]) – Domain size: 112x48x48 vertexes – Sphere: position (24; 24; 24), radius 8 vertexes
•
Szenario 2 (Re range [77.4–1208]) – Domain size: 224x96x96 vertexes – Sphere: position (48; 48; 48), radius 16 vertexes
•
Boundary surfaces setup – Inflow: homogenous velocity, equal to average velocity in adjacent plane – other: zero normal velocity gradient
•
Flow acceleration – acceleration using a volumetric force in predefined inflow direction – Strength of force is computed from the difference between flow’s current velocity in inflow surface and the predefined velocity
•
Measurement – 10000 time steps to enable convergence – cD measurement and averaging for 1000 further time steps
aer
19
LBultra - validation
aer
20
LBultra - live presentation
aer
21
LBultra - outlook •
implementation of new LBM kernels (CUDA and CPU) being capable of local refinement (current)
•
interfaces for CAD data import
•
interfaces for simulation data ex- and import to enable tool chaining
•
improvements on 3D online visualization
•
Interactive GUI to improve usability
•
Extension of LBM kernels to enable use of turbulence modeling for higher Re-Numbers
aer
22
CUDA hardware
nVIDIA Tesla preferred provider for customers in research & development Hardware distribution of GPU components, workstations and clusters CUDA software development and optimization services
Hardware Tesla C1060 (1x GPU Quadro FX 5800 4 GB 512-bit GDDR3) FluiDyna Tesla Workstations, up to 4 Tesla C1060 Tesla S1070 (4x GPU Quadro FX 5800 16 GB 512-bit GDDR3)
aer
23
CUDA hardware FluiDyna preconfigured cluster Configuration
FluiDyna 4 Tesla S1070 Preconfigured Cluster GPUs 16 Tesla T10 GPUs CPU Servers 4 HPC-Server FluiDyna WS 1xIQ-16 (single socket) CPU 4 Intel Core 2 Quad i7-920 QC 2.66GHz Memory 64 GB GPU memory up to 96 GB CPU DDR3 1333 memory Cluster Headnode 1 HPC-Server FluiDyna WS 1xIQ-16 (single socket) i7-920 Storage up to 32 TB SATA drives HPC Network SDR Infiniband with 8 way switch Communication GBit Ethernet
aer
24
Conclusions •
Lattice-Boltzmann-Method is very suitable for implementation with CUDA • achieves high computational performance • valid simulation results
•
C-Like programming language enables fast entry into CUDA technology
•
high performance CUDA programs require high effort on optimization • simple software porting is not enough!
•
nVIDIA’s GPU devices offer revolutionary performance in many fields: •~2000% faster •~1000% cheaper •~3700% less space consumption •~1000% greener (better energy efficiency)