A new parallel and GPU version of a TREOR-based

research papers Journal of

Applied Crystallography

A new parallel and GPU version of a TREOR-based algorithm for indexing powder diffraction data

ISSN 1600-5767

Received 25 July 2014 Accepted 1 December 2014

Ivan Sˇimecˇek,a* Jan Rohlıćˇek,b Toma´sˇ Zahradnickyá and Daniel Langra a

Czech Technical University in Prague, Faculty of Information Technology, Department of Computer Systems, Tha´kurova 9, 160 00 Prague 6, Czech Republic, and bInstitute of Physics AS CR, v. v. i., Na Slovance 2, 182 21 Prague 8, Czech Republic. Correspondence e-mail: [email protected]

# 2015 International Union of Crystallography

One of the key parts of the crystal structure solution process from powder diffraction data is indexing – the determination of the lattice parameters from experimental data. This paper presents a modification of the TREOR indexing method that makes the algorithm suitable and efficient for execution on graphics processing units. The TREOR algorithm was implemented in its pure form, which can be simply described as a ‘brute-force’ approach. The effectiveness and time consumption of such an algorithm was tested on several data sets including monoclinic and triclinic examples. The results show the potential of using GPUs for indexing powder diffraction data.

1. Introduction In general, crystal structure determination from diffraction data involves the following stages: (a) determination of the unit cell, which is called indexing, (b) space-group assignment, (c) structure solution and (d) structure refinement. Thus, indexing represents a starting point for the crystal structure determination process. Indexing of high-resolution powder diffraction data is a well solved problem, but in contrast to the recent advances in techniques for crystal structure determination (Cˇerny´ & Favre-Nicolin, 2007; Markvardsen et al., 2002; Altomare et al., 2013; David et al., 2002), there has been relatively little fundamental development of indexing methods since the basic works were published over 20 years ago. Many different approaches have been used to develop the algorithms, such as trial-and-error methods (TREOR; Werner et al., 1985), dichotomy methods (DICVOL; Boultif & Loue¨r, 1991), grid search or Monte Carlo methods (McMaille; Le Bail, 2004), and other approaches [ITO (Visser, 1969), SVDIndex (Coelho, 2003), X-Cell (Neumann, 2003)]. Most of these algorithms have been continuously improved from their initial to current states, for example, TREOR90 evolved into NTREOR09 (Altomare et al., 2009) and DICVOL into DICVOL06 (Loue¨r & Boultif, 2007), but even the newer versions inherit some of the drawbacks of their predecessors. In particular, they are unable to use the computation power offered by current computers to search for the unit-cell parameters. Today’s computer hardware allows for the implementation of indexing algorithms in a nonstandard way by utilizing brute-force methods. Thus, several problematic situations not covered fully by the existing methods can be solved.

166

doi:10.1107/S1600576714026466

1.1. Related work

There have already been efforts to implement new approaches to the indexing process, which have resulted in the McMaille (Le Bail, 2004), X-Cell (Neumann, 2003) and NTREOR9 (Altomare et al., 2009; Werner et al., 1985) software packages. McMaille utilizes a simulated annealing search or even an exhaustive grid search. X-Cell uses dichotomy methods; it solves, in a sophisticated way, the impurity phase problem and utilizes the space-group reflection extinction. NTREOR09, the software most closely related to our work, is also based on the trial-and-error approach. As far as we know, there are only a few papers related to the utilization of general purpose graphics processor unit (GPGPU) programming in different crystallographic areas, for example, in single-particle electron microscopy (Schmeisser et al., 2009), in computation of fast Fourier transforms for crystal structure solution and refinement (Shalaby & Oliveira, 2013), in computation of diffuse scattering patterns and application to magnetic neutron scattering (Gutmann, 2010), in computing of scattering maps of nanostructures (Favre-Nicolin et al., 2011), in real-space calculation of powder diffraction patterns (Gelisio et al., 2010), and in implementation of a dichotomy method suitable for a GPU (Sˇimecˇek, 2013). There are also works about GPU-accelerated preprocessing of data (Sauter et al., 2004, 2013), but there is none such utilization of GPGPU in the indexing process. 1.2. Technologies used

Brute-force algorithms try to find a solution by exploring all possible solution combinations, so a significant computing power is necessary. Such power can be attained by parallel computing, and this paper considers two possible alternatives. J. Appl. Cryst. (2015). 48, 166–170

research papers The first uses parallel processing in a distributed multiprocessing environment with OpenMP (Chapman et al., 2007), while the second uses parallel computation provided by the modern graphics processing units via the CUDA (https:// developer.nvidia.com/cuda-tools-ecosystem) architecture. 1.2.1. OpenMP. OpenMP is a cross-platform standard for parallel processing. The OpenMP API (application programming interface) specification is defined as a collection of compiler directives, library routines and environment variables extending the C, C++ and Fortran programming languages. They can be used to create portable parallel programs utilizing shared memory. The process of parallelization is, however, not automated; the programmer is responsible for the correct usage of the OpenMP APIs and avoidance of race conditions, deadlocks and other data consistency issues related to the shared memory environment. The core of OpenMP is the fork–join execution model. A typical OpenMP application starts with a single thread and spawns additional threads and/or uses other available computation resources to perform parallel tasks. Each program can be compiled as either sequential (serial) or OpenMP parallel by using an appropriate compiler command line option. This, however, does not mean that the program will produce correct results in both versions, as this is a responsibility of the programmer. 1.2.2. CUDA. CUDA is a common general purpose graphics processor unit technology. It is an API for GPGPU programming mainly focused on performance through massive parallelism. Programmers are required to have more knowledge of the underlying hardware than it is usual in CPUbased parallel programming. Parallel programs written in CUDA are, however, limited to the NVIDIA hardware. The CUDA parallel computing approach differs from that of the CPU in several aspects. Whereas the current CPUs are made to accommodate several threads (typically one or two threads per physical core), graphical processors are designed to run thousands of threads. GPU threads are organized into equally sized blocks, which are then mapped onto hardware building blocks of the GPU called streaming multiprocessors.

2. Current state-of-the-art 2.1. TREOR

The direct lattice parameters a, b, c, , and define the parameter space of the problem, and the goal of the indexing process is to find their correct values (or values of a subset of these parameters, depending on the crystal system) for a given experimental powder diffraction pattern. For each observed line, its interplanar spacing denoted by dhkl can be computed and consequently used in indexing in the form of Qhkl, which is denoted as 1=d2hkl :

The idea of the TREOR method is based on searching for the correct Miller indices of the selected observed Q values. For a set of observed lines, different combinations of Miller indices are tested by a simple trial-and-error method. If the set ~ and the of observed lines is represented by the vector Y square matrix M contains the Miller indices, the proposed ~ are calculated from unit-cell parameters A ~ ¼ M1 Y ~: A

ð2Þ

~ vector and the rank of the square The dimensions of the Y matrix M have to be equal to the number of unknown variables s (e.g. for the triclinic unit cell, s = 6). For each set of the trial unit cell, theoretical positions of Miller indices are calculated and compared with the remaining observed lines of the powder diffraction pattern. If the defined criteria are satisfied, the unit cell is saved. This simple procedure works nicely for all symmetries, but there is a predicament over the number of trials to perform for low symmetries, especially for triclinic systems, where the theoretical number of trials can be calculated by

n hkl s!; STEPS s

ð3Þ

where STEPS is the number of random combinations of s distinct lines taken into account, n_hkl is the number of hkltriples of all allowed Miller indices and s is the number of unknown unit-cell parameters. The number of trials in the triclinic unit cell is, for only one set of six observed lines and maximal values of h, k and l = 3, approximately equal to 1012, which makes the algorithm time consuming and practically unusable. This fact compelled the authors of previous versions of the TREOR algorithm to reduce the number of trials. One of the improvements of the algorithm was an estimation of the unitcell volume directly from the observed lines without knowledge of the unit-cell parameters. Another improvement was the dominant-zone test, which dramatically reduced the number of trials. These improvements enable TREOR to find, in a relatively very short time, the unit-cell parameters even for triclinic cells.

3. GPGPU TREOR algorithm and its improvements

where hkl are Miller indices and Axy represents a set of reciprocal unit-cell parameters.

A detailed description of the TREOR procedure which we implemented for GPGPU is shown in algorithm 1. The weakness and also the strength of such an algorithm is its pure implementation of the TREOR idea. The weakness is a very large number of trial tests in the case of the monoclinic or triclinic unit cell, but the advantage of such an implementation is that the algorithm is testing all possible combinations which are defined by the user at the beginning of the indexing process.

J. Appl. Cryst. (2015). 48, 166–170

I. Sˇimecˇek et al.

Qhkl ¼ h2 A11 þ k2 A22 þ l2 A33 þ hkA12 þ hlA13 þ klA23 ; ð1Þ

A new GPGPU version of TREOR algorithm

167

research papers

The algorithm uses these global parameters: threshold, the minimal number of indexed elements; threshold2, the maximal total error of indexed elements; threshold3, the minimal value of figure of merit for the acceptable solution; DB2 which stores all tested unit-cell parameters. The algorithm returns a database DB containing potential solutions. The function no_combs(n_hkl, s) denotes the number of combinations of all possible Miller indices (n_hkl) of size s, while the no_perms(RHS) function denotes the number of permutations of the RHS (right-hand side) vector elements.

sum_error stands for the sum of absolute errors between Qt and Qm of successfully indexed elements (obviously sum error indexed ). In real cases, the most time-consuming tasks are calls to the EvaluateError subroutine, while the calculation of equation (2) is very fast. For this reason, these two tasks were separated in distinct loops. In the first loop of algorithm 1 (code lines 4– 13), the database of suggested unit cells, obtained by calculation of equation (2) for all combinations of Miller indices with positions of observed lines, is created and saved into a temporary buffer DB2. The buffer is then used within the second loop (code lines 14–21), which evaluates the correctness of each suggested unit cell. If a unit cell is acceptable, it is consequently refined. In the case of figures of merit (FOM) higher than a user-defined limit, the solution is saved. In the shared memory environment, the parallelization is implemented as follows: (1) All iterations of the i loop (code line 4) can proceed in parallel. There is only a single critical section: insertion of items into DB2. (2) All iterations of the i loop (code line 14) can proceed in parallel. There is also only one critical section: insertion of items into DB. For GPU computing, additional changes need to be made: (1) The buffer DB2 is copied into the GPU memory from the CPU memory. (2) The GPU executes EvaluateError for all items in DB2. (3) The results need to be copied back to the CPU memory. Other routines such as RefineCell and CalculateFOM proceed at the CPU, as they are not suitable for GPU execution because they do not respect limitations of the GPU model.

4. Testing the implemented method 4.1. Experiment configuration

In algorithm 2, denotes the maximal tolerance of Q values, n_hkl is the number of hkl triples taken into account, and N is the number of observed peaks (size of the input data set). The function returns a pair of values (indexed, sum_error), where indexed is the number of successfully indexed elements of Qm (obviously indexed N), while

168



We have implemented all algorithms in C/C++ using OpenMP and the CUDA API for evaluation of performance and scalability. 4.1.1. HW and SW configurations. (a) Testing configuration 1: some experiments were performed on a small university cluster called ‘star’. Each star node is an IBM BladeCenter LS22 module with the following configuration: (i) 2 AMD Opteron 6C Processor Model 2435, 2.6 GHz, 6 MB L3 cache (12 computing cores). (ii) 8 GB RAM PC2-6400 CL6 ECC DDR2 800 VLP RDIMM. (iii) Gentoo Linux version 4.4.3-r2 p1.2 operating system. (iv) C compiler (gcc) and C++ compiler (g++) (version 4.4.3 with switches O3). (v) Sun Grid Engine job scheduler. (b) Testing configuration 2: some experiments were performed on a GPU station (Gentoo Linux version 4.4.3, CUDA SDK version 5.5) with a common low-cost CPU (Intel Core i5-760, four cores at 2.80 GHz), 8 GB RAM at 1333 MHz J. Appl. Cryst. (2015). 48, 166–170

research papers and an old yet still powerful GPU: NVIDIA GeForce GTX 480. (c) Testing configuration 3: some experiments were performed on a GPU station (Gentoo Linux version 4.4.3, CUDA SDK version 5.5) with a common CPU (Intel Core i7950, four cores at 3.07 GHz), 24 GB RAM at 1600 MHz and one of the current fastest GPUs: NVIDIA Tesla K40c. 4.1.2. Indexing data files. The following data files were used: (a) Ortho1: test1b:dat [Cd3(OH)5(NO3), N = 20, orthorhombic crystal system and correct solution a = 3.4203 (3), b = ˚ , reported by Ple´vert et al. 10.0292 (6), c = 11.0295 (6) A (1989)], (b) Mono1: cim:dat [Cimetidine C10H16N6S, N = 21, monoclinic crystal system and correct solution a = 6.821 (1), ˚ , = 106.42 (1) , as reported by b = 18.818 (3), c = 10.374 (2) A Hadicke et al. (1978)], (c) Mono2: Taxol:dat [C45H49NO133C4H8O2, N = 20, monoclinic crystal system and correct solution a = 16.329 (2), b ˚ , = 100.61 (1) , as reported by = 17.704 (2), c = 17.504 (1) A Gao & Parker (1996)], (d) Mono3: vanil:dat [C19H14O6, N = 24, monoclinic crystal system and correct solution a = 14.3181 (4), b = 8.04071 (9), ˚ , = 100.3559 (13) , as reported by Ghouili et c = 13.5524 (3) A al. (2014)], (e) Tri1: pw-7.dat [C22H28CuIN2O4, N = 48, triclinic crystal system and correct solution a = 7.73, b = 11.29, c = ˚ , = 80.28, = 81.26, = 75.88 ], 14.33 A ( f) Tri2: test6b:dat [K2(S2O8), N = 25, triclinic crystal system and correct solution a = 5.115 (1), b = 7.034 (2), c = ˚ , = 106.32 (2), = 90.18 (2), = 106.12 (2) , as 5.505 (1) A reported by Naumov et al. (1997)]. Input restrictions for TREOR are shown in Table 1. For orthorhombic and monoclinic cells the values (the maximal values of hkl triplets) used for time measurement are much higher than the minimal possible values.

Table 1 Input restrictions for TREOR. Minimal possible values of hkl triples and the values used for time measurement. Data file

Min. hkl

Min. |h| + |k| + |l|

Ortho1 Mono1 Mono2 Mono3 Tri1 Tri2

113 121 111 112 112 111

4 3 2 3 3 3

Used hkl 333 222 222 222 112 112

Used |h| + |k| + |l| 5 4 4 4 3 3

Table 2 Measured total times (in seconds) for the indexing of testing configuration 1 (TC1). The symbol 0 denotes the time spent in the EvaluateError subroutine. Data file Ortho1 Mono1 Mono2 Mono3 Tri1 Tri2

TC1 (1 thread)

TC1 (12 threads)

TC10 (1 thread)

TC10 (12 threads)

7.80 25.76 10.75 58.44 4236.25 1050.64

0.77 2.64 1.20 5.94 554.50 97.62

7.47 25.10 10.31 49.30 4265.58 1053.44

0.75 2.24 1.12 5.97 542.61 93.35

Table 3 Measured total time (in seconds) for the indexing of testing configuration 3 (TC3). The symbol 0 denotes the time spent in the EvaluateError subroutine. Dat file

TC3 (1 thread)

TC3 (4 threads)

TC30 (1 thread)

TC30 (4 threads)


5.30 17.50 7.30 39.70 2877.70 713.70

1.05 3.60 1.64 8.10 756.02 133.10

5.00 16.80 6.90 33.00 2855.08 705.10

1.01 3.00 1.50 8.03 726.61 125.02

Table 4 Measured time (in seconds) for the indexing with GPU support.

4.2. Evaluation of results

The symbol 0 denotes the time spent in the EvaluateError subroutine for testing configurations 2 and 3.

Measured times for the single-threaded and multi-threaded variants are given in Tables 2 and 3. The speedup due to multithreaded execution ranges from 7.7 to 10.8 for testing configuration 1 and from 3.8 to 5.4 for testing configuration 3. We can conclude that this application scales very well. Measured times for the GPU version are presented in Table 4. A comparison with the CPU version is difficult because the GPU accelerates only part of the computation (the EvaluateError subroutine). The speedup of this subroutine ranges from 4.4 to 11.1 (testing configuration 2 with a CPU versus testing configuration 2 with a GPU) or from 10 to 30 (testing configuration 3 with a CPU versus testing configuration 3 with a GPU). The overall speedup due to the GPU acceleration ranges from 1.2 to 6.4 for testing configuration 2 and from 1.9 to 19.6 for testing configuration 3. We can conclude that the GPU version gains significant speedup over the multi-threaded version even on a relatively old GPU (mainly for the triclinic search).

We found the idea behind the TREOR program rewarding. An interest in extending this idea further to handle data in low-symmetry (the monoclinic and triclinic) systems better has remained. Our program is not an extension to the existing programs, it is a build-up from the ground. In the current version, it is a ‘pure’ implementation of the TREOR idea. The main difference to the previous versions of the TREOR

J. Appl. Cryst. (2015). 48, 166–170


Data file

TC2

TC3

TC20

TC30


1.19 2.53 2.09 9.08 127.63 30.30

0.54 0.76 1.20 1.74 38.66 10.46

0.70 0.60 0.30 1.10 105.38 20.08

0.10 0.10 0.20 0.30 23.51 6.03

4.3. Comparison with N-TREOR09


169

research papers algorithm is in the basic idea that the program simply searches all combinations. Instead of trying to apply the crystallographic experience in code, the program is aimed at maximal hardware utilization (including multi-threaded execution and possibility to use a GPU), and therefore, the achieved computing times are at an acceptable level also for triclinic cases.

5. Conclusions The evolution of GPUs is opening up new opportunities for the acceleration of computations used in crystallography. The main objective of this paper was to test the feasibility of a GPU-accelerated algorithm for the indexing of powder diffraction data. We have shown another implementation of the existing TREOR algorithm which can be simply described as a ‘brute force’ approach. The massive parallelism of the modern graphics processing units is fully utilized and a significant speedup of the computation was achieved. The achieved computing times for GPU implementation are at an acceptable level for triclinic cases also. Several tests on real examples of different complexity show that this approach can be used, so the GPU represents a cost-effective solution for the indexing of powder diffraction data.

This research has been supported by SGS grant No. SGS14/ 106/OHK3/1T/18. This work was also supported by the IT4Innovations Centre of Excellence project (CZ.1.05/1.1.00/ 02.0070), funded by the European Regional Development Fund and the national budget of the Czech Republic via the Research and Development for Innovations Operational Programme, as well as the Czech Ministry of Education, Youth and Sports via the project Large Research, Development and Innovations Infrastructures (LM2011033).

170



References Altomare, A., Campi, G., Cuocci, C., Eriksson, L., Giacovazzo, C., Moliterni, A., Rizzi, R. & Werner, P.-E. (2009). J. Appl. Cryst. 42, 768–775. Altomare, A., Cuocci, C., Giacovazzo, C., Moliterni, A., Rizzi, R., Corriero, N. & Falcicchio, A. (2013). J. Appl. Cryst. 46, 1231–1235. Chapman, B., Jost, G. & Pas, R. (2007). Using OpenMP: Portable Shared Memory Parallel Programming, Scientific and Engineering Computation Series. Cambridge: The MIT Press. Boultif, A. & Loue¨r, D. (1991). J. Appl. Cryst. 24, 987–993. Cˇerny´, R. & Favre-Nicolin, V. (2007). Z. Kristallogr. 222, 105–113. Coelho, A. A. (2003). J. Appl. Cryst. 36, 86–95. David, W. I. F., Shankland, K., McCusker, L. B. & Baerlocher, Ch. (2002). Structure Determination from Powder Diffraction Data. Oxford Science Publications. Favre-Nicolin, V., Coraux, J., Richard, M.-I. & Renevier, H. (2011). J. Appl. Cryst. 44, 635–640. Gao, Q. & Parker, W. L. (1996). Tetrahedron, 52, 2291. Gelisio, L., Azanza Ricardo, C. L., Leoni, M. & Scardi, P. (2010). J. Appl. Cryst. 43, 647–653. Ghouili, A., Rohlıćˇek, J., Ayed, T. B. & Hassen, R. B. (2014). Powder Diffr. 29, 361–365. Gutmann, M. J. (2010). J. Appl. Cryst. 43, 250–255. Hadicke, E., Frickel, F. & Franke, A. (1978). Chem. Ber. 111, 3222. Le Bail, A. (2004). Powder Diffr. 19, 249–254. Loue¨r, D. & Boultif, A. (2007). Z. Kristallogr. Suppl. 2007, 191–196. Markvardsen, A. J., David, W. I. F. & Shankland, K. (2002). Acta Cryst. A58, 316–326. Naumov, D. Y., Virovets, A. V., Podberezskaya, N. V., Novikov, P. B. & Politov, A. A. (1997). J. Struct. Chem. 38, 772–777. Neumann, M. A. (2003). J. Appl. Cryst. 36, 356–365. Ple´vert, J., Louer, M. & Louer, D. (1989). J. Appl. Cryst. 22, 470–475. Sauter, N. K., Grosse-Kunstleve, R. W. & Adams, P. D. (2004). J. Appl. Cryst. 37, 399–409. Sauter, N. K., Hattne, J., Grosse-Kunstleve, R. W. & Echols, N. (2013). Acta Cryst. D69, 1274–1282. Schmeisser, M., Heisen, B. C., Luettich, M., Busche, B., Hauer, F., Koske, T., Knauber, K.-H. & Stark, H. (2009). Acta Cryst. D65, 659– 671. Shalaby, E. M. & Oliveira, M. A. (2013). J. Appl. Cryst. 46, 594–600. Sˇimecˇek, I. (2013). Adv. Intelligent Systems Comput. 188, 409–416. Visser, J. W. (1969). J. Appl. Cryst. 2, 89–95. Werner, P.-E., Eriksson, L. & Westdahl, M. (1985). J. Appl. Cryst. 18, 367–370.

J. Appl. Cryst. (2015). 48, 166–170

A new parallel and GPU version of a TREOR-based

A new parallel and GPU version of a TREOR-based

Suggest Documents

A PARALLEL WATERMARKING APPLICATION ON A GPU

A New Parallel Version of the DDSCAT Code for ... - PIERS

Parallel Branch and Bound on a CPU-GPU System - LAAS

Parallel Branch and Bound on a CPU-GPU System - LAAS

Implementation of a Thread-Parallel, GPU-Friendly Function ...

A Massively Parallel Implementation of QC-LDPC Decoder on GPU

Implementation of a Parallel GPU-Based Space-Time Kriging ... - MDPI

A PARALLEL GMRES VERSION FOR GENERAL ... - CiteSeerX

Parallel, distributed and GPU computing technologies ... - BioMedSearch

Parallel Triangular Solvers on GPU

Parallel, distributed and GPU computing technologies ... - BioMedSearch

GPU Parallel Computing Architecture and CUDA Programming ...

CPU and GPU Parallel Kramers-Klein Calculations

A new parallel TreeSPH code

GPU Parallel Computation in Bioinspired Algorithms. A review.

A Programming Model for GPU-based Parallel Computing ... - BME IIT

A GPU-Based Parallel Genetic Algorithm for Generating ... - IEEE Xplore

A Quasi-Parallel GPU-Based Algorithm for Delaunay Edge-Flips

Toward a Multi-level Parallel Framework on GPU ... - Science Direct

Toward a Multi-level Parallel Framework on GPU ... - ScienceDirect

A Programming Model for GPU-based Parallel Computing with ... - BME

gpuMF: A Framework for Parallel Hybrid Metaheuristics on GPU with

A parallel GPU-based algorithm for Delaunay edge-flips

MRISIMUL: A GPU-Based Parallel Approach to MRI Simulations