c 2016 IEEE. Personal use of this material is permitted. Permission from IEEE must be
obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Full citation: H. C. da Silva, F. Pisani and E. Borin, “A Comparative Study of SYCL, OpenCL, and OpenMP,” 2016 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW), Los Angeles, CA, 2016, pp. 61-66. DOI: 10.1109/SBAC-PADW.2016.19. Keywords: C++ language; application program interfaces; message passing; parallel programming; program compilers; API functions; API methods; C++ programming model; CPU; GPU; OpenCL; OpenMP; SYCL; compilers; hardware accelerators; heterogeneous computing devices; programmability; runtime systems; standard C++ compilation frameworks; Benchmark testing; C++ languages; Kernel; MOS devices; Performance evaluation; Program processors; Programming; OpenCL; OpenMP; SYCL; parallel programming; performance evaluation; programmability evaluation. The following manuscript was accepted for publication by IEEE. The IEEE-published version can be found at: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7803697& isnumber=7803657.
A Comparative Study of SYCL, OpenCL, and OpenMP H´ercules Cardoso da Silva∗ , Fl´avia Pisani† , and Edson Borin‡ Institute of Computing University of Campinas (UNICAMP) Campinas, SP, Brazil Email: ∗
[email protected], †
[email protected], ‡
[email protected]
Abstract—Recent trends indicate that future computing systems will be composed by a group of heterogeneous computing devices, including CPUs, GPUs, and other hardware accelerators. These devices provide increased processing performance, however, creating efficient code for them may require that programmers manage memory assignments and use specialized APIs, compilers, or runtime systems, thus making their programs dependent on specific tools. In this scenario, SYCL is an emerging C++ programming model for OpenCL that allows developers to write code for heterogeneous computing devices that are compatible with standard C++ compilation frameworks. In this paper, we analyze the performance and programming characteristics of SYCL, OpenMP, and OpenCL using both a benchmark and a real-world application. Our performance results indicate that programs that rely on available SYCL runtimes are not on par with the ones based on OpenMP and OpenCL yet. Nonetheless, the gap is getting smaller if we consider the results reported by previous studies. In terms of programmability, SYCL presents itself as a competitive alternative to OpenCL, requiring fewer lines of code to implement kernels and also fewer calls to essential API functions and methods.
I. I NTRODUCTION When it comes to the parallel execution of programs, devices such as GPUs, DSPs, and FPGAs offer great performance potential. Furthermore, due to the ever-increasing demand for computational performance, the use of these devices is rapidly becoming prevalent. However, they have different execution and programming models than that of a traditional CPU, so the responsibility of choosing the correct tools and re-implementing the code when working with heterogeneous systems falls on the programmer. OpenCL [1] and CUDA [2] are among the most popular frameworks employed in heterogeneous programming [3]. Nonetheless, they still present a few challenges to less experienced developers, for instance, requiring a certain level of explicit control over memory management and also that different codes be written for the host and the device. SYCL [4] is an emerging programming model that relies on the fundamental concepts, portability, and efficiency of OpenCL, while also including much of the usability and flexibility of C++. It facilitates the creation of OpenCL programs by embedding the device source code inline with the host code, thus reducing the learning effort for programmers and allowing them to focus on parallelization techniques rather than syntax.
Comparing the performance of SYCL with well-established models, such as OpenMP and OpenCL, enables us to come closer to understanding some of the practical characteristics of this new approach. This is exactly the focus of this study, in which we analyze the implementation of two applications in each of the aforementioned languages. The first is a simple program called 27stencil that belongs to the EPCC OpenACC benchmark suite [5] and the second is a real-world application that implements the seismic processing method Common Midpoint (CMP) [6]. Analyzing the contrast between several characteristics of the implementations such as execution time, memory usage, kernel size, and number of API calls, we were able to develop a better grasp of the features present in SYCL. In particular, the facilities it provides when it comes to increasing the programmability of systems, which will be further discussed in the experimental results section. We evaluated the SYCL implementation provided by AMD, named triSYCL, and found that the performance of SYCLbased programs is not on par with the ones implemented in OpenMP and OpenCL yet. However, the gap is getting smaller if we consider the results reported by previous studies [7], [8]. In order to increase the reproducibility of our experiments, we are making the implementations we used for the 27stencil and CMP tests public1 . We hope that this repository can also be a first step towards the creation of a new benchmark that can be used to compare SYCL with OpenMP and OpenCL. This paper is organized as follows: Section II discusses studies related to SYCL. Section III describes the 27stencil and CMP methods used in our tests. Section IV presents our experimental setup and the results we obtained by comparing the OpenCL, OpenMP, and SYCL implementations of the 27stencil benchmark and CMP application. Section V has our conclusions and possibilities for future work. II. R ELATED W ORK A. Programmability of Heterogeneous Systems Although there has been considerable research aiming to improve the programmability of heterogeneous devices, the SYCL API is still in its early stages, so there are only a few published studies that discuss it. 1 https://github.com/herculeshcs/SYCL-OpenCL-OpenMP-Benchmark
In terms of other initiatives that intend to make heterogeneous programming more efficient, we have the CLOP language and compiler [9]. Implemented in the D programming language, this platform allows the seamless embedding of compute kernels in heterogeneous applications, unlike OpenCL and CUDA, which require separate host and device code. CLOP takes advantage of compile time code rewriting, and goes in the same direction as SYCL. There is also the Heterogeneous Programming Library (HPL) [10], a recent approach to improving programmability while still providing portability. Its programming model is similar to CUDA and OpenCL, but HPL kernels are written in C-like language embedded in C++. Native OpenCL C kernel support in HPL was proposed by Vi˜nas et. al [11], favoring code reuse. This is also possible with SYCL, but at the expense of providing the kernel as a compiled OpenCL object. In an effort to increase the portability and performance of a Vectorized Geometry package (VecGeom, which is a part of Geant4, a simulation framework used by the main LHC experiments), B´ır´o explored the use of OpenCL instead of the current CUDA backend [12]. He cites SYCL as a way of providing OpenCL support with a lot less code modification, making it reasonable if the existing structure is intended to be kept as much as possible. B. Comparative Studies Involving SYCL Trigkas [7] has investigated a prototype implementation of SYCL made by Codeplay Software Ltd., called SYCLONE. He compared SYCL with two established parallel programming models, OpenCL and OpenMP, in terms of programmability (number of lines of code) and performance (execution time in both an Intel Xeon and Intel Xeon Phi devices). To this end, six scientific applications from the EPCC OpenACC benchmark suite [5] and the Intel OpenCL code samples [13] were implemented using each of the models. In our paper, we use an adapted implementation of the 27stencil code presented by Trigkas as our benchmark application. Beattie [8] used a more recent version of Codeplay’s SYCL called ComputeCpp [14] to make another investigation about this API. Although this version is also unavailable to the public, it is possible to ask its developers for notifications about it through their website. He analyzed Trigkas’ results in comparison to the ones obtained with the new SYCL implementation, finding a few cases where SYCL’s performance has improved. He also examined new results obtained by porting the SLAMBench benchmark to SYCL. III. C ASE S TUDIES To motivate the choice of the 27stencil and CMP case studies for this paper, this section describes each of the these problems and the reasons why they were selected for this comparative analysis of SYCL, OpenCL, and OpenMP. A. 27stencil The main consideration behind the use of the 27stencil application for our study is the fact that it was previously
evaluated by Trigkas [7] and Beattie [8] also using OpenCL, OpenMP, and SYCL. Trigkas ported the original OpenACC benchmark to these three APIs and listed the main segments of the code on his master’s thesis. We combined the partial code in his text with the original benchmark and built our OpenCL, OpenMP, and SYCL versions of the 27stencil application. Besides the advantage of giving us a baseline for our comparison, the 27stencil problem is important by itself, as stencil operations are typically used in Partial Differential Equations (PDEs) solvers, common in many scientific applications such as fluid dynamics. In this type of operation, every point inside a multidimensional grid is updated in both time and space, based on the weighted contribution of a subgroup of its neighbors. For 27stencil, we have a 3D grid and each point is updated based on the values of 26 neighbors [7], [8]. B. Common Midpoint Method (CMP) When processing seismic data, the Common Midpoint (CMP) optimization method is typically used to improve the signal to noise ratio. In this technique, a group of seismic traces that share the same midpoint, called CMP gather, are added together to produce a single improved trace, the stacked trace. Before adding the traces, a Normal Moveout (NMO) correction is applied to the them according to the distances between their sources and receivers, causing signals that are produced by the same reflectors to be added together. In this sense, the quality of the stacked trace depends on the quality of the NMO curve [6]. In the CMP method, the NMO curve is defined by a hyperbolic curve, also known as the traveltime curve, which depends on the offset between the source and the receiver that produced the seismic trace and the average velocity in which the wave propagated during the seismic data acquisition. Although the offsets are known, the velocity is usually not, so it must be determined. In order to find the velocity that provides the best stacking, the CMP method performs a search using several different velocities. For each of them, the CMP method computes the semblance, a coherence metric that indicates whether the NMO curve defined by a given velocity would produce a good stacking. This investigation involves computing the semblance function for several different velocities, which are defined by the search space. An auxiliary procedure computes the semblance for each velocity in the search space and returns the one that generated the maximum semblance value. Since seismic traces are represented in the computer as a discrete set of samples, called time samples, this procedure is performed for every time sample on every gather to search for the velocity that provides the best semblance for a given time sample. Once the best NMO velocity is found, the traces are stacked, producing a single trace for each CMP gather. Since the search space is known, this investigation can be performed concurrently, and it is what is called an embarrassingly parallel problem. In fact, each gather, time sample, and
even semblance can be calculated simultaneously to accelerate the computation. In addition to its importance in real-world seismic data processing, we chose this application due to the fact that our research group had already designed and implemented a sequential version of the CMP method in C and parallelized versions using OpenMP and OpenCL. In this study, we ported our OpenCL code to use SYCL in order to compare the performance of all of these approaches. IV. E XPERIMENTAL E VALUATION In this section, we present the results we obtained in our tests with SYCL, OpenCL, and OpenMP and analyze them regarding performance and programmability. We reiterate that, in order to make our results more reproducible and also to start a new benchmark that facilitates the comparison of SYCL to other well-established APIs, such as OpenMP and OpenCL, the implementations discussed in this section are available at the public repository listed on the first page of this article. A. Experimental Setup We implemented three versions of the 27stencil application based on the code listed by Trigkas in his master’s thesis [7]. Each version is implemented using a different parallel programming API and they are referred to as 27stencil-OMP for OpenMP, 27stencil-OCL for OpenCL, and 27stencil-SYCL for SYCL. We also implemented three versions of the CMP stacking method [6] using the three aforementioned APIs. They are referred to as CMP-OMP, CMP-OCL, and CMP-SYCL, respectively. The machine used in our experiments has two 2.6 GHz Intel Xeon E5-2670 processors and 64 GB of DDR3 RAM. The operating system is Red Hat 4.4.7-16. All tests were compiled with g++ 4.9 with the -O3 flag unless stated otherwise. For our runtime environment, we used triSYCL2 (the implementation provided by AMD of the OpenCL SYCL C++ layer provisional specification), the Intel OpenCL SDK (driver 1.2.0.43 and runtime 15.1), and OpenMP 4.0. Each implementation was executed 10 times and we measured the running time of the code that effectively performs the computation, that is, the region of interest (ROI). The times reported in the following subsections are the average of these measurements and their respective standard deviations. B. 27stencil-OMP Performance As a first step in our investigation, we aimed at reproducing the tests performed by Trigkas [7], which we are naming “baseline”, to establish a parameter for our comparison. The 27stencil-OMP application is the one we were better able to recreate, so we ran it for four different input sizes using several OpenMP configurations, each one with a different number of threads. Figure 1 shows the execution times measured for each number of threads in our system along with the baseline results reported by Trigkas. As the 27stencil partial code presented in his thesis has two main loops, one that performs 2 https://github.com/amd/triSYCL
the bulk of the computation and another that copies the value from an auxiliary matrix, we divided our time measurements between these two operations. The computing system used in Trigkas’ experiments (2 x Intel Xeon E5-2650) has the same amount of processors and cores as ours (2 x Intel Xeon E5-2670), but the clock frequency of our system (2.6 GHz) is slightly higher than his (2.0 GHz). However, our initial results were on average 5.07 times slower than the ones he obtained, when comparing both ROI execution and matrix copy times with the baseline, and 3.34 times when considering only the ROI execution time. We conjecture that the main reason for this performance difference is the compiler used in the experiments. We used g++ 4.9 with the -O3 flag, while Trigkas used the Intel C++ Compiler with the default optimization flag. We also tried to use the Intel C++ Compiler (15.0.2) with the Intel (1.2.0.43) driver, but the results were slightly worse than the ones obtained with g++ 4.9. Since we do not know which compiler version was used by Trigkas, we cannot verify whether the compiler is in fact the major source of the discrepancy. We then decided to experiment with different pragmas to investigate the impact this had on our execution times. By using “#pragma omp parallel for private(i,j,k) schedule(dynamic) shared(a1) num threads(th)” instead of the same pragma employed by Trigkas, our total execution time results improved by an average of 1.63 times. The times reported in Figure 1 reflect this experiment, and we see that now our results are on average 3.75 times slower than his, when comparing both ROI execution and matrix copy times with the baseline, and 2.00 times when considering only the ROI execution time. In the future, we intend to further explore code optimizations to find out other possible reasons for the difference in performance. We also note that the results we obtained for 8 and 32 threads are not within the expected range. We used the Linux perf tool to analyze the execution and verified that when executing with 32 threads, 76% of the CPU cycles is spent executing code from the libgomp.so, the OpenMP runtime. We also found out that when running with 32 threads, the program executes roughly twice the number of instructions of when executing with 31 or 33 threads. These results suggest that this is an issue with the OpenMP runtime. C. SYCL, OpenCL, and OpenMP Performance 1) 27stencil: Considering the results presented in the previous subsection, we pursued tests that involve not only OpenMP, but OpenCL and SYCL as well. For the OpenMP times reported in this subsection, we chose the best configuration presented in Section IV-B for each data size. Since Trigkas [7] only reported exact numbers for OpenMP executions, we do not present his results in our graphs in this subsection. Still, it is possible to estimate approximate running times from his graphs in order to make our analysis. Figure 2 displays the execution times for our tests calculated according to the description in Section IV-A. The differences in the results for OpenMP were discussed in the previous subsection, but there is some divergence in the execution times
Fig. 1. 27stencil-OMP performance results in comparison to the baseline for different input sizes (1 to 100 million) and thread counts (8 to 80). The label above each column represents 27stencil-OMP’s matrix copy and ROI execution times and the baseline ROI execution time in seconds, respectively.
obtained with the OpenCL implementation as well. While Trigkas’ OpenCL results for the 100M input stay well under the 50-second mark (approximately between 20 and 30 s), ours go above 60 s. It is a much smaller difference if compared to the results presented in the OpenMP implementation, however. If we consider only the ROI execution time of 31.93 s, we see that our result comes much closer to what he reported.
mances to what was reported by Trigkas, but ComputeCpp showed better execution times than SYCLONE, since the code developed with this new SYCL implementation takes about 1.5 times the duration of the OpenCL execution, while the previous approach needed approximately 5 times [8]. Analyzing the graph in Figure 2, we can also see that we were able to obtain good results using the AMD triSYCL implementation. On average, the execution times of 27stencil-SYCL are 2.35 times that of 27stencil-OCL, and 2.22 times compared to 27stencil-OMP. We note that our SYCL measures are closer to OpenMP times than to the ones for OpenCL. We consider this an interesting result, given that triSYCL is built on top of OpenMP instead of OpenCL like Codeplay’s approaches, SYCLONE and ComputeCpp. Another information that can be used to evaluate the performance of our implementations is memory usage. Figure 3 has the results obtained by the pmap tool for this metric.
Fig. 2. ROI and matrix copy execution times for 27stencil. The label above each column represents the values of these two measures, respectively.
Beattie [8] also attempted to reproduce the results obtained by Trigkas with the Intel Xeon Phi accelerator. Despite finding the same pattern for all data sizes (SYCL was the slowest, followed by the serial and OpenCL versions with somewhat similar performance, and then OpenMP as the fastest by a large margin), his results diverge from Trigkas’ as well. Still, we must consider that the difference is expected in this case, as the experimental setup of Beattie’s virtual machine was considerably different from the machines used by Trigkas. Beattie points out that, relatively speaking, his OpenCL, OpenMP, and serial implementations have similar perfor-
Fig. 3. Memory usage for 27stencil.
We see that SYCL’s layer of abstraction has increased the amount of memory used in comparison to the 64-thread OpenMP execution by 1.16 times on average. Nonetheless, it uses an average of 0.39 times the memory used by OpenCL, which is quite a considerable improvement. 2) CMP: To analyze the potential of SYCL in terms of realworld applications, we chose the seismic stacking algorithm CMP, for which we had already optimized OpenCL and OpenMP implementations. Figure 4 illustrates the execution times of the ROI of our CMP implementations in SYCL, OpenCL, and OpenMP for a problem size of 1.08 × 108 semblances.
code to use the same data structures as the OpenMP code in order to evaluate the impact of this change. D. SYCL, OpenCL, and OpenMP Programmability In this subsection, we discuss and compare the programmability of SYCL with OpenMP and OpenCL. To this end, two metrics are adopted: number of non-empty lines of kernel code and number of API calls. We consider an “API call” any explicit call to a function or method defined on the SYCL [4], OpenCL [15], or OpenMP [16] specifications that if removed would cause the program to not behave as expected. Figure 6 shows the number of lines of code for the three versions of 27stencil and CMP. 27stencil-OMP presents the shortest kernel code for 27stencil, but we note that 27stencil-SYCL has fewer lines than 27stencil-OCL, indicating that it may be more easily programmable. For CMP, the OpenMP code has the largest kernel size. This is due to the fact that a few features implemented inside of CMP-OMP’s ROI are defined outside of the ROI in CMP-SYCL and CMP-OCL. Still, SYCL presents fewer lines of code compared to its OpenCL counterpart, again hinting at more programmability.
Fig. 4. ROI execution times for CMP.
Much like our results for 27stencil, CMP-OCL presented the best execution time, followed by CMP-OMP, and then CMP-SYCL. The proportion of the OpenCL results stayed similar, with SYCL having an execution time of 2.77 times that of OpenCL, but it performed a little better in comparison to OpenMP (1.38 times). Figure 5 displays the memory usage results for the CMP application. Again, OpenCL displayed the greatest amount of memory consumption among the tests. However, this time CMP-SYCL used only 0.20 times the amount measured for CMP-OCL, and was also more efficient than CMP-OMP, presenting 0.32 times its memory usage.
Fig. 6. 27stencil and CMP kernel size.
The number of API calls for each 27stencil and CMP implementation is displayed in Figure 7. We see that OpenMP is the most compact of the three APIs for both 27stencil and CMP. Nevertheless, SYCL has some clear improvement when compared to OpenCL.
Fig. 5. Memory usage for CMP. Fig. 7. Number of API calls for 27stencil and CMP.
This difference may be explained considering that our SYCL implementation uses the same data structures as the ones used with OpenCL, which differ from the ones used with OpenMP. In the future, we intend to re-implement our SYCL
Although the metrics adopted in this study are not definitive proof that SYCL is more easily programmable than OpenCL, since programmability is a complex issue to be measured, we
consider that they are indicators that it is possible to implement code that performs the same computations using both APIs with SYCL being the more concise representation. Therefore, when we analyze these instances where the developer has to write fewer lines of code and insert less API calls when programming in SYCL, we can say that we saw in practice that this standard is heading in the direction of being a competitive alternative to OpenCL in terms of programmability. E. SYCL Results Overview Table I shows an overview of the relative performance of SYCL in comparison to OpenCL and OpenMP. The values in the table represent how many times the SYCL result is smaller or larger than the one obtained with each of the other APIs. For 27stencil execution time and memory usage, the reported results are the average between the relative performances for each input size. Shaded cells display the cases where SYCL outperformed the other approaches. TABLE I OVERVIEW OF THE SYCL/O PEN CL AND SYCL/O PEN MP RESULTS .
Execution time Memory usage Kernel size API calls
27stencil [SYCL/* (×)] OpenCL OpenMP 2.35 2.22 0.39 1.16 0.73 1.19 0.45 4.50
CMP [SYCL/* (×)] OpenCL OpenMP 2.77 1.38 0.20 0.32 0.85 0.46 0.75 25.00
our experiments indicate that our OpenCL and OpenMP-based implementations are 2.35 to 2.77 times and 1.38 to 2.22 times faster than our SYCL-based programs, respectively. Still, if we consider the SYCL results reported by Trigkas [7] in 2014 (5 times slowdown compared to OpenCL), it is possible to see that this performance gap is getting smaller. Our analysis of the code produced with each one of these APIs indicates that, although OpenMP performs very well in terms of programmability, SYCL presents itself as a competitive alternative to OpenCL, requiring fewer lines of code to implements kernels that perform the same computations and fewer calls to essential API functions and methods. As future work, we expect to experiment with different SYCL runtimes and to be able to execute our SYCL programs on other machines and on hardware accelerators such as GPUs and the Intel Xeon Phi. We also intend to extend our performance analysis by further investigating data transfer times and our programmability study by adding more metrics. ACKNOWLEDGMENTS The authors thank Petrobras, CAPES, CNPq, and FAPESP for their invaluable financial support and our colleagues at the HPG and LMCAD labs for their contributions and support. R EFERENCES [1] “The open standard for parallel programming of heterogeneous systems,” accessed Nov. 22, 2015. [Online]. Available: https://www.khronos.org/opencl/
F. Evolution of the triSYCL Implementation The experiments detailed in this paper use a version of triSYCL downloaded from the AMD repository on GitHub in November of 2015. We tried experimenting with a newer version, downloaded in August of 2016, and we noticed that: • The code could not be compiled with g++ 4.9 anymore. Nonetheless, it worked out-of-the-box with g++ 5.2. • Performance results for the 27stencil benchmark were exactly the same. • It took 4.6 s to run the CMP-SYCL program with the most recent triSYCL version, while the time was only 4.1 s with the one we used previously. However, we believe this could be an issue related to the different compiler version, since we also verified that CMP-SYCL takes 4.6 s to run when using the old triSYCL version with g++ 5.2. Despite the lack of performance improvement, triSYCL has been updated to support more SYCL features, including the SYCL 2.2 pipes and reservations and the blocking pipe extension from Xilinx. V. C ONCLUSIONS In this study, we analyzed the performance and programming characteristics of SYCL, OpenMP, and OpenCL using a benchmark called 27stencil and the real-world seismic processing application CMP. Our investigation indicates that applications that rely on available SYCL runtimes are not on par with the ones based on OpenMP and OpenCL in terms of performance yet. In fact,
[2] “Parallel Programming and Computing Platform,” accessed Nov. 22, 2015. [Online]. Available: http://www.nvidia.com/object/cuda home new.html [3] J. Kim, T. T. Dao, J. Jung, J. Joo, and J. Lee, “Bridging OpenCL and CUDA: A Comparative Analysis and Translation,” in Proc. SC, 2015, pp. 82:1–82:12. [4] L. Howes and M. Rovatsou, “SYCL™ Specification - SYCL integrates OpenCL devices with modern C++,” accessed Nov. 21, 2015. [Online]. Available: https://www.khronos.org/registry/sycl/specs/sycl-1.2.pdf [5] N. Johnson, “EPCC OpenACC Benchmarks,” accessed Nov. 22, 2015. [Online]. Available: https://github.com/EPCCed/epcc-openacc-benchmarks [6] O. Yilmaz, Seismic Data Analysis: Processing, Inversion, and Interpretation of Seismic Data, 2nd ed. Society Of Exploration Geophysicists, 2000. [Online]. Available: https://books.google.com.br/books?id=ceu1x3JqYGUC [7] A. Trigkas, “Investigation of the OpenCL SYCL Programming Model,” Master’s thesis, The University of Edinburgh, Edinburgh, UK, 2014. [8] T. Beattie, “Investigation of the SYCL for OpenCL Programming Model,” Master’s thesis, The University of Edinburgh, Edinburgh, UK, 2015. [9] D. Makarov and M. Hauswirth, “CLOP: A Multi-stage Compiler to Seamlessly Embed Heterogeneous Code,” in Proc. GPCE, 2015, pp. 109 – 112. [10] “Heterogeneous Programming Library,” accessed Nov. 24, 2015. [Online]. Available: http://hpl.des.udc.es [11] M. Vi˜nas, B. B. Fraguela, Z. Bozkus, and D. Andrade, “Improving OpenCL Programmability with the Heterogeneous Programming Library,” Procedia Comput. Sci., vol. 51, pp. 110 – 119, 2015. [12] G. B´ır´o, “Investigation of OpenCL support in the VecGeom geometry package,” European Organization for Nuclear Research (CERN), Tech. Rep., 2014. [13] “Intel® SDK for OpenCL™ Applications,” accessed Nov. 22, 2015. [Online]. Available: https://software.intel.com/en-us/intel-opencl [14] “Codeplay - ComputeCpp,” accessed Nov. 30, 2015. [Online]. Available: https://www.codeplay.com/products/computecpp
[15] A. Munshi, “The OpenCL Specification,” accessed Nov. 30, 2015. [Online]. Available: https://www.khronos.org/registry/cl/specs/opencl-1.2.pdf [16] O. A. R. Board, “OpenMP Application Program Interface,” accessed Nov. 30, 2015. [Online]. Available: http://www.openmp.org/mpdocuments/OpenMP4.0.0.pdf