1
GPU-Accelerated Crack Path Computation Based on a Phase Field Approach for Brittle Fracture Alexander Schl¨uter, Institute of Applied Mechanics, University of Kaiserslautern
Adrian Willenb¨ucher, Charlotte Kuhn, Institute of Computer Science, Institute of Applied Mechanics, University of Kaiserslautern University of Kaiserslautern
Abstract—In recent years, a new approach to analyze fracturing has been developed. The so-called phase field models approximate cracks by a scalar, macroscopic field variable that distinguishes between broken and undamaged material. The phase field approach to fracture has significant advantages over more established methods. However it is necessary to solve a coupled set of nonlinear, partial differential equations to compute the evolving crack path. Often a finite element scheme is employed to compute the solution numerically which leads to a large number of unknowns. Here, paralled computing techniques like Graphic Processing Unit (GPU) computing can significantly decrease the computing time. This work illustrates how GPU computing can accelerate computationally expensive calculations with respect to a phase field model for dynamic brittle fracture. Keywords—dynamic fracture, phase field, GPU computing, CUDA.
I. I NTRODUCTION YNAMIC fracture phenomena occur in many technical applications that include impact loading. Hence it is important to provide tools that can predict those failure mechanisms correctly. For more complex problems it is necessary to apply numerical methods. Most established methods like the Extended Finite Element Method (X-FEM), e.g. in [1], and the Virtual Crack Closure Technique, e.g. in [2], model the crack explicitly. This requires tracking of the evolving crack, which can become very difficult, especially in 3D. In contrast to this, the phase field method has no need to track the crack path because the crack evolution follows implicitly from the solution of a set of partial differential equations. This set of equations does not only describe the mechanics of the the regarded body but also the evolution of an additional scalar field variable, the phase field, which distinguishes between undamaged and broken material. However, the quasi-static phase field model for brittle fracture presented in [3] neglects the materials inertia and is therefore not able to treat dynamic fracture problems. One aspect of this paper is the presentation of a phase field approach that extends the model
D
A. Schl¨uter, Institute of Applied Mechanics, University of Kaiserslautern, Postfach 3049, 67653 Kaiserslautern, Germany e-mail:
[email protected] A. Willenb¨ucher, Institute of Computer Science, University of Kaiserslautern, Postfach 3049, 67653 Kaiserslautern, Germany email:
[email protected] C. Kuhn, Institute of Applied Mechanics, University of Kaiserslautern, Postfach 3049, 67653 Kaiserslautern, Germany e-mail:
[email protected]
from [3] to the dynamic case. Similar phase field models for dynamic brittle fracture are presented for example in [4] and [5]. As mentioned above, the application of a phase field model to crack growth is particularly beneficial in 3D. However the numerical solution of 3D problems, e.g. with finite elements, is always related to a large number of unknowns. In fact, a large algebraic system of linear equations (SLE) has to be solved in every iteration step of every timestep. Typically this is done by iterative solvers, which perform a number of highly parallelizable matrix-vector operations. Since dynamic problems often require many timesteps, parallel computing techniques like GPU computing can considerably decrease the computing time. GPUs have up to a few hundred cores that can deliver a much greater computing power than a CPU of the same size. However GPUs are focused solely on computing which means that CPUs are still needed to access data from the disk, to perform serial tasks etc. Hence GPU-accelerated applications combine the flexibility of CPUs with the computing power of GPUs. For the aspects of GPU programming, the reader is referred to the literature (e.g. [6] and [7]). The second part of this paper describes the application of GPU computing to a finite element implementation of the aforementioned extended phase field model for dynamic brittle fracture. Two benchmark problems show the performance of the GPU-accelerated computations relative to standard serial computations on a single CPU. II.
A P HASE F IELD M ODEL FOR DYNAMIC F RACTURE
A linear elastic body Ω ⊂ R3 with L AM E´ constants λ and µ, mass density ρ and an internal crack set Γ is considered, see Fig. 1a). The mechanical displacement of a point x ∈ Ω is labelled as u (x, t), where the field u (x, t) satisfies D IRICHLET boundary conditions u (x, t) = u∗ (x, t) on ∂Ωu and traction boundary conditions σn = t∗ on ∂Ωt . Here, σ denotes the C AUCHY stress tensor and n is the outward directed normal vector on the boundary ∂Ω= ∂Ωu ∪∂Ωt . The linearized strain tensor ε = 12 ∇u + ∇T u serves as a strain measure. In the present model, the phase field s(x, t) varies smoothly between s = 1 in undamaged material and s = 0 in broken material, see Fig. 1b). The model is based on the idea of the classical G RIFFITH theory which predicts crack growth if it becomes energetically favourable. Therefore it is necessary to formulate an energy functional that describes the fracture, the
2
Additionally, it is necessary to model the irreversibility of crack growth which is done by defining boundary condtions s(x, t > t∗x ) = 0
if s(x, t∗x ) = 0
(6)
t∗x
Figure 1. Body with internal discontinuities (sharp cracks) Γ a) and approximation of internal discontinuities by a phase field s(x, t) b).
elastic and the kinetic energy of the body. The approximation of the fracture energy reads Z Z (1 − s)2 2 Gc dA ≈ + |∇s| dV , (1) Gc 4 Γ Ω | {z } ψs where Gc is the cracking resistance and is a length scale characterizing the crack width. In the limit → 0 the phase field approximation of the fracture energy (1) is exact, see [8] for details. The phase field s is linked to the elastic energy density ψe to model the loss of stiffness in broken material. If the crack field indicates broken material, i.e. s = 0, the material stiffness is reduced to the residual stiffness η λ, µ which is introduced to avoid numerical problems. The definition K − K + 2 2 ψe = tr (ε) +(s2 + η) tr (ε) + µ (εD : εD ) 2 |2 {z } | {z } compression ψe,ts (2) of the elastic energy is taken from [9]. The expressions tr− (ε) = min {0, tr(ε)} and tr+ (ε) = max {0, tr(ε)} describe the negative and the positive volumetric strain, whereas εD = ε − tr(nε) 1 denotes the deviatoric part of ε. The bulk modulus can be expressed in terms of the L AM E´ constants, i.e. K = λ + 32 µ. In order to achieve realistic crack behaviour in compression, the compressive part of ψe is not affected by the crack field. By means of (1) and (2) it is possible to formulate H AMIL TON s principle Z t2 Z Z ∗ ˙ ∇u, s, ∇s) dV + δ L(u, t · u dA dt = 0 t1
Ω
∂Ω
(3) for the considered body. The L AGRANGE density L can be expressed as 1 (4) ˙ ∇u, s, ∇s) = ρu˙ · u˙ − ψe − ψs . L(u, 2 Equation (3) yields the coupled E ULER -L AGRANGE equations 1−s ρ¨ u = divσ, 2sψe,ts − Gc 2∆s + = 0, 2 ∗ σn = t on ∂Ωt , ∇s · n = 0 on ∂Ω. (5)
for the crack field. In (6) is the time when the crack field becomes zero at the location x for the first time. The set of partial differential equations (5) is solved simultaneously for the displacement field u and the crack field s with a finite element scheme. The geometry is discretized by isoparametric, eight node brick elements with trilinear shape functions and four degrees of freedom at each node. Furthermore, an implicit N EWMARK method u˙ n+1 = u˙ n + ∆t¨ uγ ¨ γ = (1 − γ)¨ ¨ n+1 , 0 ≤ γ ≤ 1 where u un + γ u 1 2 ¨β un+1 = un + ∆tu˙ n + ∆t u 2 ¨ β = (1 − 2β)¨ ¨ n+1 , 0 ≤ β ≤ 1 where u un + 2β u
(7)
with parameters γ = β = 0.5 is used for time integration (tn → tn+1 = tn +∆t). The time step size ∆t is chosen by an automatic time step size control that adapts the time step size to the number of N EWTON iterations needed for convergence in the previous time step. The finite element discretization of (5) and the discretization in time (7) lead to a system of nonlinear equations (8) R dn+1 = 0 for the unknown nodal displacements un+1 and crack field values sn+1 at time tn+1 which are summarized in the vector dn+1 . N EWTON’s method requires a linearization of (8) (k+1) (k) (k) R dn+1 = R dn+1 + ∆R dn+1 (9) (k) (k) (k) ≈ R dn+1 − S n+1 ∆dn+1 = 0 where the index (k) indicates the current N EWTON iteration (k) and S n+1 is the overall tangent matrix of the problem. Equation (9) defines an iteration scheme, in which a system of linear (k) equations (SLE) has to be solved for ∆dn+1 in each iteration. Subsequently the vector of unknowns is updated according to (k+1)
(k)
(k)
dn+1 = dn+1 + ∆dn+1 (10)
(k+1) and the norm R dn+1 is compared to a predefined tolerance of δ = 10−16 to check for convergence. The required solution of the SLEs in every iteration step of the N EWTON algorithm is done with a Preconditioned Conjugate Gradient (PCG) method in this work. A description of the CG method can be found in [10]. Here the algorithm is summarized: • •
(k)
Initialization: Choose ∆dn+1,0 arbitrarily and set (k) (k) (k) p0 = r 0 = R dn+1 − S n+1 ∆dn+1,0 Loop: i=0,1,...
:
◦ if pi < tolerance
3
(k)
stop: ∆dn+1,i is the solution of (9) ◦ else r Ti r i ai = (k) pTi S n+1 pi (k) (k) ∆dn+1,i+1 = ∆dn+1,i + ai pi (k) r i+1 = r i − ai S n+1 pi rT r βi = i+1T i+1 ri ri pi+1 = r i+1 + βi pi The CG method requires only one matrix-vector multiplication, (k) (k) S n+1 pi , and the storage of four vectors ∆dn+1,i , r i , pi and (k) S n+1 pi . Additionally, six inner products need to be computed. The speed of convergence of the CG method increases with (k) decreasing conditioning number c = cond(S n+1 ). This behaviour can be exploited by so-called preconditioning techniques which accelerate the CG method. The idea is to approx(k) imate the positive-definite matrix S n+1 by another positive(k) definite matrix B, the preconditioner, so that B −1 S n+1 is close to the unity matrix 1. Then it is possible to rewrite (9) as 1 1 1 1 (k) (k) (k) B − 2 S n+1 B − 2 B 2 ∆dn+1 = B − 2 R dn+1 . (11) The matrix
0
1
1
(k)
S = B 2 S n+1 B − 2
(12) (k) S n+1 ,
has a much smaller conditioning number than 0 (k) cond(S ) cond(S n+1 ). As a consequence the CG-method will need fewer iterations to solve 0
0
S ∆d = R with and
0
1
0
(k)
(13)
∆d = B 2 ∆dn+1
(14)
0 1 (k) R = B − 2 R dn+1 .
(15)
(k)
The desired solution ∆dn+1 follows from (14). Hence another criterion for a good choice of the preconditioner B is that (14) should be easily solvable. In this work the preconditioner is chosen to be the diagonal of the tangent matrix i.e. (k) B = diag S n+1 . (16) The highly parallelizable matrix-vector operations which are fundamental to the PCG method can be accelerated on a GPU. Since a SLE has to be solved in every N EWTON-iteration of every time-step, the overall computing time for a finite element analysis can be reduced considerably. This is especially true if the solution of a SLE is the main computational effort which is the case for problems with a large number of unknowns, e.g. 3D-problems. III. PARALLELIZED GPU-I MPLEMENTATION Fig. 2 shows the basic work flow of a finite element analysis with the program FEAP1 and the additionally developed GPU1 http://www.ce.berkeley.edu/projects/feap/
accelerated subroutines. The finite element code FEAP serves as a basis, whereas the computation of the matrix structure and the solving of the SLE are reimplemented. These two tasks are chosen because they are parallelizable and require a significant amount of runtime. NVIDIAs CUDA 5.02 is used as a toolkit to develop these applications. CUDA allows to develop GPU applications in high level programming languages such as FORTRAN, C and C++. Therefore it is not necessary to employ assembly languages, which made GPU programming very uncomfortable in the past. Today there exist several applications that implement CUDA to allow GPU acceleration in commercial software such as MATLAB. A. Sparse Matrix Formats In this work, the open source C++ library CUSP3 is used to perform the sparse matrix operations that are fundamental to the PCG-method. CUSP supports the following sparse matrix formats: • Coordinate (COO) • Compressed Sparse Row (CSR) • Diagonal (DIA) • ELLPACK’s4 sparse matrix format (ELL) • Hybrid ELL/COO format (HYB). Unlike dense matrices, which are usually represented by two-dimensional arrays, sparse matrices need more sophisticated memory formats which avoid storing most or all zero entries, yet still allow for efficient operations. Some formats, like COO or CSR, are suitable for all sparse matrices, while others, like DIA or ELL, are optimized for special sparsity patterns, but are unsuited for matrices with different non-zero structures [11]. The GPU-based solver was tested with various formats, and the hybrid ELL/COO format proved to be the fastest one in the benchmarks. For an m × n matrix with at most k entries per row, the ELL format [12] stores the matrix elements in an m × k array data and the column indices of the elements in an m × k array indices; the row index of each element corresponds to its row index in data and indices. For example, the 5 × 4 matrix A, where 1 0 2 0 3 4 5 0 A = 0 6 0 0 0 0 7 8 0 0 0 9 and k = 3, is stored as 1 2 3 4 data = 6 − 7 8 9 −
− 5 − − −
0 0 indices = 1 2 3
2 http://www.nvidia.de/object/ cuda-parallel-computing-de.html 3 http://cusplibrary.github.io/ 4 http://cs.purdue.edu/ellpack/
2 − 1 2 − − 3 − − −
4
GP U
F EAP (CP U ) Converged solution (tn ): dn , d˙n , d¨n Automatic time step size control
tn+1 = tn + ∆t ∆t
Determine structure(k) of sparse matrix S n+1
R dn+1 , d˙n+1 , d¨n+1
=0
i=1 2
Time integration (N EWMARK)
...
N
Count non-zero (k)
entries in S n+1
R dn+1
=0 Reserve memory
N EWTONS method Compute
Intialize d1n+1 with dn Loop 1...kmax • Linearization: Solution: (k+1) dn+1
Adapt ∆t
Form tangent
position of non-zeros
(k) S n+1
• Fill in entries (k)
Form residual Rn+1
yes
no
Convergence?
• Solve (k) (k) (k) S n+1 ∆dn+1 = Rn+1 • Update: (k+1)
dn+1
(k)
(k)
Solve SLE with PCG method method (CUSP library)
= dn+1 + ∆dn+1
• Check for
convergence:
(k+1)
Rn+1 < tolerance
End
(k)
Complete S n+1 Solve system
Figure 2. Work flow of a GPU-accelerated Finite Element analysis with FEAP. Red color indicates tasks that are performed on the GPU whereas blue color illustrates the over all time loop.
Note that rows with fewer than k non-zeros, represented by “−”, have to be padded in the data and indices arrays. The ELL format is particularly well-suited for matrices in which the average number of non-zeros per row is not much lower than the maximum number; if this is not the case, a lot of memory is wasted due to padding. The COO format stores a matrix with k non-zero entries in three arrays row, col, and data with k elements each. The above matrix A with k = 9 would be stored as row = [0 0 1 1 1 2 3 3 4] col = [0 2 0 1 2 1 2 3 3] data = [1 2 3 4 5 6 7 8 9] This format is simple and works equally well for all sparsity patterns, but a matrix-vector multiplication for it is not as fast as one for the ELL format. Therefore, a hybrid of these two formats (HYB) works well for matrices in which most rows have approximately the same number of non-zeros – which are stored in ELL format – and a few rows have significantly more than that – which are stored in COO format [11].
B. Determination of the Tangent Matrix Structure (k)
Before the matrix S n+1 can be assembled, its structure, i.e., the coordinates of the non-zero elements, have to be determined. In FEAP, this is done in the comproa and comprob subroutines. First, comproa counts the number
of non-zeros and an array of sufficient length for storing all coordinates is allocated. Then, comprob fills this array with the coordinates. For the execution on the GPU, the memory accesses are rearranged in this process so that the outer loop, which iterates over the equations, has no data dependencies between its iterations. This way, it is possible to process as many equations in parallel as there are threads running on the GPU. A configuration of 6 blocks per multiprocessor and 512 threads per block, i.e., 16 threads per CUDA core and block, is found to result in the fastest execution. Similar to the CPU code, the GPU implementation first counts the number of non-zeros, so that an array for the coordinates can be allocated. However, instead of only incrementing a counter (like comproa does), it also stores the number of non-zeros for each equation in an array. Then, the prefix sums over this array are computed, resulting in an array of offsets into the matrix element coordinates array. This is necessary so that all threads can write the locations computed by them in parallel to a contiguous memory area. C. Tangent Matrix Completion and Iterative Solution The PCG method implemented in CUSP needs the full (k) matrix S n+1 even if it is symmetric. However, FEAP generally only assembles a triangular matrix for symmetric problems. (k) Therefore S n+1 has to be completed to a full matrix. The assembling part is still performed on the CPU but the present
implementation completes the matrix on the GPU using the following steps: 1) Copy matrix to GPU memory 2) Convert from CSR format to COO format for simpler processing. 3) Append two copies of the triangular matrix, with the row and column indices swapped in the second copy. 4) Set the values of the diagonal elements in the second copy to zero. 5) Sort the entries by row and column. 6) Reduce (by addition) elements with identical coordinates in order to remove the duplicate diagonal entries. This algorithm essentially performs the calculation S full := S tri + (S Ttri − diag(S tri )).
(17)
Although this could be done in a simpler way on the CPU, the above steps are necessary to allow a parallel execution on the GPU. IV. B ENCHMARKS The benchmarks are performed on a 4-core AMD Athlon II X4 645 processor with 8 GiB of 1333 MHz DDR3-SDRAM, and a NVIDIA Tesla C2075 with 6 GiB memory. All matrix computations were done with 64-bit double precision floatingpoint numbers on both CPU and GPU. A. Linear Elastic Cube
2
N
u3 [mm]
1
0
N
N
Figure 3. Benchmark problem: Linear elastic cube with edge length a = 10 [mm] that is dicretized by a cartesian grid of N xN xN elements. The cube is loaded by a displacement load u∗ = 2[mm] e3 at the surface x3 = a while the displacements at the surface x3 = 0 are all set to zero. The material parameters are YOUNG’s moduls E = 210000 [MPa] and P OISSON’s ratio ν = 0.3.
In a first step, a simple linear elastic cube serves as a benchmark problem (see Fig. 3). This example does not include the phase field model from section II. Fig. 4 shows the measured runtimes with respect to the number of elements per edge N . The GPU acceleration greatly decreases the overall computing time in comparison to standard FEAP, i.e. speedup ∼ 4, where the speedup is defined as Sp =
T1 T2
(18)
Runtime [s]
5
Total runtime FEAP/CPU Build matrix structure Complete and solve Total runtime FEAP/GPU
N Figure 4. Runtimes for the linear elastic cube example. The plot shows the total runtimes for standard FEAP on the one hand and for an analysis employing the GPU-accelerated setup on the other hand. Furthermore, the particular runtimes for matrix structure calculation and solving of the SLE are displayed for the GPU-accelerated setup. Table I. T IME IN SECONDS FOR MATRIX STRUCTURE COMPUTATION FOR N = 80 ON GPU AND CPU FOR FULL AND FOR TRIANGULAR MATRICES . full
triangular
GPU
1.7
0.8
CPU
26
11
Speedup
15.3
13.7
with the serial computing time on a single processor T1 and the computing time for the GPU-accelerated implementation T2 . The overall computing time for the GPU-accelerated implementation includes serial tasks remaining on the CPU as well as the matrix structure computation and the solution of the SLE which are outsourced to the GPU. In addition the effect of the GPU implementation on the reimplemented parts is investigated exclusively. For that purpose, the computing time is measured for the matrix structure computation and the solving of the SLE in particular. Table I shows the time needed for computing the matrix structure in the case of N = 80. As can be seen, the GPU implementation achieves a speedup of about 14. In fact, it is so fast that it takes only 2 − 3 % of the total runtime, so any further improvements would yield negligible gain. For N = 80, the GPU solver, including triangular matrix completion, runs for 8.2 seconds (of 39 seconds total runtime) resulting in a speedup of about 10 compared to FEAP’s CG solver, which requires 88 seconds. B. Crack Path Computation in 3D To illustrate the advantages of GPU-computing in a more practical example, a specimen with initial cracks on two sides is considered (see Fig. 5). Subsequently the evolving crack pattern is computed with the present phase field approach. The following dimensionless material data and numerical settings are used in the simulations: • Material: λ = µ = 105 , η = 10−4 , ρ = 1.0, Gc = 1.0 and = 0.1 • Meshing: 20x60x80 evenly distributed elements • N EWMARK-parameters: γ = β = 0.5
6
u∗ 0.5 0000000 1111111 1111111 0000000 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111 0000000 1111111
1.5
0.1
x2
000000 111111 111111 000000 000000 111111 000000 111111 000000 111111 000000 111111 000000 111111 000000 111111 000000 111111
x1 x3
1.2
2.0 Figure 5.
Specimen with initial cracks on two sides.
•
Load: linearly increasing, predefined displacement u∗ (t) = 0.0075t e2 on the surfaces x2 = const where e2 is the unit vector in x2 -direction Fig. 6 shows the crack path that is computed of both implementations, i.e. unmodified FEAP and the GPU-accelerated implementation of FEAP. One can observe the typical thumb shape of the crack front, which is a result of different mechanical states in the bulk and at the surfaces. In order to compare the GPU-performance to standard CPU computations, the analysis is carried out with unmodified FEAP in the first step and then repeated for the GPUaccelerated set-up shown in Fig. 2. In both cases the PCG method with diagonal preconditioning is utilized. The runtime for unmodified, serial FEAP is 86494s (∼ 24h) whereas the GPU-accelerated set-up reduced the overall runtime to only 7877s (∼ 2.2h). This results in a speedup of ∼ 10 which is significantly greater than the total speedup observed in the linear elastic cube benchmark. Therefore the reimplemented parts, i.e. matrix structure computation and solving of a SLE, must play a bigger role in such computations compared to their effect on the overall computing time in the elastic cube benchmark problem. t = 0.500
t = 0.560
t = 0.562
suited for parallelization. Compared to a computation on a single CPU, the GPU-accelerated implementation achieves a speedup of about 14 for the matrix structure computation and a speedup of 10 for the solving of the SLE in the first benchmark problem. The second benchmark problem assesses the performance of the GPU-accelerated implementation for a 3D crack problem. Here, the present phase field model for dynamic brittle fracture is used to find the crack pattern. In this case, the total runtime is decreased to less than one-tenth of the total runtime on a single CPU. The GPU-accelerated implementation of FEAP greatly improves the speed of finite element analysises in comparison to the state-of-the-art, which is in many cases running FEAP on a single CPU. This allows for more efficient finite element studies of computational expensive problems like phase field fracture modelling in 3D. R EFERENCES [1] [2]
[3]
[4]
[5]
[6]
[7]
[8]
Figure 6. Crack path for the 3D crack problem. The crack is visualized as an isovolume where the crack field is 0 < s < 0.1.
[9]
V. C ONCLUSION This paper oulines a GPU-accelerated finite element implementation of a phase field model for dynamic brittle fracture. The phase field model is an extension of the quasi-static model presented in [3] to the dynamic case. The main impact of this work is the GPU-acceleration of the finite element code FEAP by means of NVIDIAs CUDA and the open source library CUSP. GPU-acceleration affects two main tasks in the FEAP workflow: the computation of the tangent matrix structure and the solution of the system of linear equations (SLE). These tasks are chosen because the take up a considerable amount of runtime and are well
[10] [11]
[12]
R. Krueger, “Virtual crack closure technique: History, approach, and applications,” Appl. Mech. Rev., vol. 57, no. 2, p. 109, 2004. N. Mo¨es, J. Dolbow, and T. Belytschko, “A finite element method for crack growth without remeshing,” Int. J. Numer. Meth. Eng., vol. 46, no. 1, pp. 131–150, 1999. C. Kuhn and R. M¨uller, “A continuum phase field model for fracture,” Eng. Fract. Mech., vol. 77, no. 18, pp. 3625–3634, 2010, computational Mechanics in Fracture and Damage: A Special Issue in Honor of Prof. Gross. M. Hofacker and C. Miehe, “A phase field model of dynamic fracture: Robust field updates for the analysis of complex crack patterns,” IJNME, vol. 93, no. 3, pp. 276–301, 2013. M. J. Borden, C. V. Verhoosel, M. A. Scott, T. J. R. Hughes, and C. M. Landis, “A phase-field description of dynamic brittle fracture,” Comput. Meth. Appl. Mech. Eng., vol. 217–220, pp. 77–95, 2012. L. Polok and P. Smrz, “Fast linear algebra on GPU,” in High Performance Computing and Communications (HPCC). Liverpool, United Kingdom: IEEE Computer Society, 2012, pp. 439–444. K. Kumar Matam and K. Kothapalli, “Accelerating sparse matrix vector multiplication in iterative methods using GPU,” in International Conference on Parallel Processing (ICPP). Taipei, Taiwan: IEEE Computer Society, 2011, pp. 612–621. B. Bourdin, “Numerical implementation of the variational formulation of quasi-static brittle fracture,” Interfaces Free Bound., vol. 9, pp. 411– 430, 2007. H. Amor, J.-J. Marigo, and C. Maurini, “Regularized formulation of the variational brittle fracture with unilateral contact: Numerical experiments,” J. Mech. Phys. Solid., vol. 57, no. 8, pp. 1209–1229, 2009. J. Stoer and R. Bulirsch, Introduction to Numerical Analysis, ser. Text in Applied Mathematics. Springer, 2002, vol. 3. N. Bell and M. Garland, “Implementing sparse matrixvector multiplication on throughput-oriented processors,” in Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, ser. SC ’09. New York, NY, USA: ACM, 2009, pp. 18:1–18:11. [Online]. Available: http://doi.acm.org/10.1145/1654059.1654078 R. G. Grimes, D. R. Kincaid, and D. M. Young, ITPACK 2.0 user’s guide. Center for Numerical Analysis, The University of Texas at Austin, 1979.