16 – 18 September 2010, BULGARIA
1
Proceedings of the International Conference on Information Technologies (InfoTech-2010) 16-18 September 2010, Bulgaria
COMPARISON OF TWO PARALLEL ALGORITHMS FOR CLOTH SIMULATION1; Tzvetomir Ivanov Vassilev University of Ruse e-mail(s):
[email protected] Bulgaria Abstract: This paper describes two parallel algorithms for cloth simulation, using a mass-spring cloth model and image based collision detection and response approach. The algorithms are implemented on a GPU using OpenCL, the newest API for parallel programming of GPU, CPU and other devices. The speed of the two algorithms is measures and results are presented. Conclusions are drawn at the end of the paper. Key words: parallel algorithms, GPU programming.
1. INTRODUCTION Physical simulation and elastic deformable objects have been widely used by researchers in computer graphics. The main applications of garment simulation are in the entertainment industries, in the fashion design industry and in electronic commerce when customers shop for garments on the web and try them on in a virtual booth. The graphics processing unit (GPU) on today's commodity video cards has evolved into an extremely powerful and flexible processor (Houston, 2007). The latest graphics architectures provide tremendous memory bandwidth and computational power, with fully programmable vertex and pixel processing units that support vector operations up to full IEEE floating point precision. High level languages have emerged for graphics hardware, making this computational power accessible. Architecturally, GPUs are highly parallel streaming processors optimized for vector operations, with single instruction on multiple data (SIMD) pipelines. Not 1
Partially supported by the national research fund FNI10
2
PROCEEDINGS of the International Conference InfoTech-2010
surprisingly, these processors are capable of general-purpose computation beyond the graphics applications for which they were designed and many researchers have utilized them in cloth modelling (Zeller, 2005), (Georgii and Westermann, 2005). The nature of the mass-spring system makes it suitable for GPU implementation. NVIDIA (Zeller, 2005) have provided a free sample demo of massspring cloth simulation on their graphics processors. The modern GPUs can be also used for many other general purpose applications which are suitable for parallelization. There are two ways of programming these GPUs: through a graphics API, which can be used for both graphics applications and general purpose computations, and a parallel API, which is designated mainly for general computations. The aim of this work is to compare performance of two algorithms for parallel implementation of a mass-spring cloth model using OpenCL. The rest of the paper is organized as follows. The next section reviews cloth simulation techniques and describes the cloth model used in this work. Section 3 explains how the GPU can be programmed. Section 4 proposes two algorithms for the implementation of the cloth model on a parallel device, Section 5 give results and the last section concludes the paper. 2. CLOTH MODELLING Methods to model cloth for computer graphics have been investigated for more than two decades. Mass-spring particle systems are mainly used (Desbrun, and Schroeder, 1999), (Provot, 1995), while some employ finite element methods (Valino et al., 1995). Provot introduced a simple mass-spring topology (see Figure 1), which is commonly used owing to its efficiency and simplicity. He used linear (Hook) springs and applies explicit Euler integration. To account for super-elongation, caused by the linear springs, the particles' positions are constrained in a post correction step, so that springs do not exceed more 5-10% of their natural length. Vassilev et al. (Vassilev et al., 2001) extended this approach by modifying the Bend particles' velocities instead of their positions. The elastic model of cloth is a mesh of l×n mass points, each of them being linked to its neighbours by massless springs of natural length greater than zero. There are three different types of spring (Figure 1): “stretch”, “shear” and “bending” springs. As the names indicate, the first spring type implements resistance to stretching, the second – to shearing and the third – to bending. Shear Stretch Let pij(t), vij(t), aij(t), where i=1,…,l and Figure 1. Spring types in the cloth model j=1,…,n, be respectively the positions,
16 – 18 September 2010, BULGARIA
3
velocities, and accelerations of the mass points at time t. The system is governed by the basic Newton’s law: fij = mij aij,
(1)
where mij is the mass of point ij and fij is the sum of all forces applied at point ij. The force fij can be divided in two categories. Internal forces arise from the tensions of the springs. The overall internal force applied at the point ij is a result of the stiffness of all springs linking this point to its neighbours. The external forces can differ in nature depending on what type of simulation we wish to make. The forces most frequently included are: gravity, viscous damping, collision response, seaming, etc. The fundamental equations of Newtonian dynamics can be integrated over time by a simple Euler method. The Euler Equations are known to be very fast and to give good results, provided the time step ∆t is less than the natural period of the system T ≈ π m K , where K is the highest stiffness in the system. Numerous recent works in cloth simulation, see for example (Baraff and Witkin, 1998), have shown that improvements in stability are possible by using implicit integration. However, for complex garments with mapping of KES measurements to the spring properties, explicit integration still proved to be beneficial in terms of efficiency in our case. The advantages of Euler integration became particularly apparent when computation of the collision detection and response, which require small time steps, were taken into consideration. Similar results were also indicated by Volino and MagnenatThalmann, 1995. 0
3. API FOR GPGPU 3.1. Graphics APIs Usually the GPGPU programmers use OpenGL or Direct3D (DirectX). OpenGL tends to be favoured in the academic community due to the platform portability: it works on Windows, Unix and MacOS. DirectX/Direct3D, on the other hand, tends to be favoured in the computer game industry, where dependence on Windows is not a particular impediment. OpenGL is used in this work because of its platform independence. There are two programming languages that can be used to write GPU programs (which are called shaders) with OpenGL. The first one is Cg and the second is GL shading language (GLSL). Cg is supported only by NVIDIA and one has to install additional libraries, while GLSL can be compiled by the driver of the graphics card, if the OpenGL version is 2.0 or above. 3.2. Parallel APIs Compute Unified Device Architecture or CUDA is a parallel computing architecture, developed by NVIDIA, which enables dramatic increases in computing performance by harnessing the power of their GPU: GeForce, ION, Quadro, and
4
PROCEEDINGS of the International Conference InfoTech-2010
Tesla. Programmers use 'C for CUDA', which is C with NVIDIA extensions, to write programs for parallel execution on the GPU, which are called kernels. This language is compiled through a PathScale Open64 C compiler. CUDA is supported only by NVIDIA graphics cards. OpenCL (Open Computing Language) is a framework for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs, and other processors. OpenCL includes a language (based on C99) for writing kernels (functions that execute on OpenCL devices), plus APIs that are used to define and then control the platforms. OpenCL provides parallel computing using task-based and data-based parallelism. Its architecture is similar to NVIDIA's CUDA. OpenCL was initially developed by Apple, but later refined by the Khronos group (Khronos Group, 2008). It is a relatively new API and was released in Mac OS X v10.6 ("Snow Leopard") in August, 2009. ATI and NVIDIA released their first drivers supporting OpenCL in the second half of 2009. 4. CLOTH SIMULATION ON A PARALLEL DEVICE There are two possible algorithms for the implementation of the mass-spring model, described in Section 2, on a parallel device. Algorithm 1. Edge-point algorithm (EPA) Kernel 1: For each spring Compute spring force End for Kernel 2: For each mass point Add spring forces Compute external forces Compute velocity Correct super-elasticity Do collision detection and response Compute new position End for Algorithm 2. Pure point algorithm (PPA) Kernel 1: For each mass point Compute spring forces Compute external forces Compute velocity Correct super-elasticity Do collision detection and response Compute new position End for
16 – 18 September 2010, BULGARIA
5
In fact the "for loops" are not specified in the kernels, the operations inside are performed in parallel and are scheduled by the GPU control unit. Algorithm 1 (EPA) requires two kernels. The first one computes the tension forces for each spring. The second is executed for each mass point, it adds the spring and external forces, computes the new velocities, tests for collision and applies responses and in the end computes the new positions. Data structures needed for EPA algorithm: • Spring buffer: stores first and second point, spring stiffness and natural length of each spring. • Connectivity buffer: needed to distribute the spring forces to each mass point. It has a maximum of 16 entries for each mass point and each entry shows the spring number, in which this point is involved. In addition, a positive number shows that the mass is the first end of the spring and a negative number shows that this is the second end of the spring. This is necessary, because the spring force is computed in regards to the first spring point and it should be applied to the second point with a negative sign. • Velocity and position buffers: they store velocity and position of each mass point. • Normal vector buffers: stores the normal vectors at each cloth vertex. • Seam buffer: it stores the seaming information for each mass point. The idea of Algorithm 2 (PPA) is that there is no loop that goes for each spring. The computation of internal forces is included in the loop for each mass point. In this way the forces due to each spring will be computed twice, but it requires only one kernel so it may be more efficient than first to go for each spring. There is no need of a spring buffer for this algorithm, the information for each spring is kept in the connectivity buffer. Data structures needed for PPA algorithm: • Connectivity buffer: needed to distribute the spring forces to each mass point. It has a maximum of 16 entries for each mass point and each entry holds the point number of the other spring end, spring stiffness, natural length. • Velocity and position buffers: they store velocity and position of each mass point. • Normal vector buffers: stores the normal vectors at each cloth vertex. • Seam buffer: it stores the seaming information for each mass point. Figure 2 shows the scenario used to test the implemented algorithms. Collision detection is performed between the cloth and the table, as well as between the separate cloth pieces. In both cases an image-space based algorithm is applied, as described in (Vassilev et al., 2001).
6
PROCEEDINGS of the International Conference InfoTech-2010
Figure 2. Cloth pieces draping on a round table
5. RESULTS The two algorithms were implemented in OpenCL under Windows and Linux. Several pieces of cloth were draped on a virtual round table as shown in Figure 2. OpenGL was used for the visualization and maps rendering for the collision detection and response. In order to compare performance, the time for 2000 cloth simulation iterations were measured. Measurements in Windows and Linux were similar that is why only Windows results are shown in Figures 3 and 4. Times were measured for three different numbers of cloth mass points (1: 3976, 2: 4659, 3: 6134) and two variations of the simulation: without cloth-cloth collision detection and response, see Figure 3, and with cloth-cloth collision detection and response (Figure 4). One would expect that the EPA algorithm will perform better, because the spring forces are computed only once and then distributed to mass points. However, results show that the PPA algorithm outperform the EPA by an average of 25 per cent if no cloth-cloth collisions are considered and 19 per cent with cloth-cloth collision detection and response. This is due to the parallel character of the computations and probably because the pure point approach can be implemented in only one kernel. 6. CONCLUSION Two parallel algorithms for a mass-spring cloth model were implemented on modern GPUs using OpenCL. The edge-point algorithm (EPA) first computes all spring forces for each spring and then distributes the forces to each mass point. The pure point algorithm (PPA) computes the forces for each mass point, which will lead to each spring force to be computed twice. The results, however, show that the PPA shows better performance from 19 to 25 per cent than EPA, which is due to the parallel character of the computations.
16 – 18 September 2010, BULGARIA
7
Time (sec) on NVIDIA GPU GeForce GT 330M 1.4 1.2 1 0.8 EPA 0.6
PPA
0.4 0.2 0 1
2
3
Figure 3. Speed comparison of the two algorithms without cloth-cloth collision detection
Time (sec) on NVIDIA GPU GeForce GT 330M 1.8 1.6 1.4 1.2 1 EPA 0.8
PPA
0.6 0.4 0.2 0 1
2
3
Figure 4. Speed comparison of the two algorithms with cloth-cloth collision detection
8
PROCEEDINGS of the International Conference InfoTech-2010
REFERENCES Baraff, D, A. Witkin (1998). Large steps in cloth simulation. Computer Graphics, Proceedings of SIGRAPH’98, Annual Conference Series, 1998, pp. 43–54. Desbrun, M., P. Schroeder, A. Barr (1999). Interactive animation of structured deformable objects. Proceedings of Graphics Interface, pp. 1–8. Canadian Computer-Human Communications Society. Georgii, J and R. Westermann, (2005). Mass-spring systems on the GPU. Simulation Modelling Practice and Theory, 13:693–702, 2005 Houston, M (2007). General-Purpose Computation on Graphics Hardware. SIGGRAPH 2007 GPGPU Course, http://www.gpgpu.org/s2007/ Khronos Group. "Khronos Launches Heterogeneous Computing Initiative". http://www.khronos.org/news/press/releases/khronos_launches_heterogeneous_ computing_initiative/. Retrieved 2008. Provot, X (1995). Deformation constraints in a mass-spring model to describe rigid cloth behaviour. Proceedings of Graphics Interface'95, 1995, pp. 141-155. Vassilev, T., Spanlang, B., Chrysanthou Y (2001). Fast Cloth Animation on Walking Avatars, Computer Graphics Forum, 2001, 3 (20), 260-267. Volino, P., M. Courchesne, N. Magnenat-Thalmann (1995). Versatile and efficient techniques for simulating cloth and other deformable objects. Proceedings of SIGGRAPH’95, 1995, pp. 137–144 Zeller, C (2005). Cloth simulation on the GPU. SIGGRAPH’05: ACM SIGGRAPH 2005 Sketches, page 39, New York, NY, USA, 2005.