Demonstrating Performance Portability of a Custom ...

Demonstrating Performance Portability of a Custom OpenCL Data Mining Application to the Intel Xeon Phi Coprocessor R

∗

Alexander Heinecke

TM

Dirk Pflüger

Technische Universität München

Universität Stuttgart

[email protected]

[email protected]

Dmitry Budnikov, Michael Klemm, Arik Narkis, Maxim Shevtsov, Ayal Zaks, Sergey Lyalin Intel

{dmitry.budnikov,michael.klemm,arik.narkis}@intel.com {maxim.y.shevtsov,ayal.zaks,sergey.lyalin}@intel.com ABSTRACT Many data-intensive tasks can be accelerated significantly by switching to accelerators such as GPUs and coprocessors. Alas, large portions of the algorithms often need to be re-implemented as different accelerators typically require different programming interfaces. The heterogeneous programming model of OpenCL was designed as a standard to provide functional portability across a variety of supported devices. However, OpenCL does not guarantee performance portability. The same algorithm may need to be implemented differently, albeit using the same language and interfaces, to achieve optimal performance on different devices. We demonstrate how OpenCL facilitates different devicespecific optimizations while it maintains the same kernel codebase. We also show how to optimize execution parameters based on runtime inputs such as dataset size and target device, which is made possible with OpenCL. We employ a grid-based data mining regression and classification application as an example. Our approach optimizes code on the fly, with only minimal modifications applied for a target device. We focus on general performance portability across the R NVIDIA* Tesla K20X GPU and the new many-core Intel Xeon PhiTM coprocessor. Finally, we report and analyze performance measurements for both types of devices.

1.

INTRODUCTION

In recent years, technical and scientific progress has encouraged the creation, collection, and storage of an everincreasing amount of data. Large datasets arise in medical imaging, experimental physics, astrophysics, and numerical ∗corresponding author

simulations (e. g., crash tests), or from browsing and purchasing in online stores, social networks, or from financial transactions. To “mine data”, i.e., to gain knowledge from these vast datasets, is thus one of the main challenges in data-driven applications. Sparse grids [3, 14] provide a numerical method for learning tasks in data mining which is, in contrast to most classical approaches, well-suited for dealing with huge amounts of data. The number of potential combinations of platforms and architectures has also increased during the last years. Today one can use all kinds of systems ranging from big compute clusters down to personal computers for data-mining tasks, depending on the amount of data to be analyzed. Where huge amounts of data are involved, accelerators (GPUs, coprocessors etc.) can speed up the analysis by a remarkable factor. However, each accelerator typically had and still has its own flavor of programming interface. With the release of a successor accelerator, code has to be re-engineered to make best use of the capabilities provided by the new architecture. The industry standard OpenCL targets this problem and defines a virtual machine that can be mapped to (data-)parallel processors such as GPUs and many-core coprocessors as well as general-purpose CPUs (with rich SIMD vector instruction sets). As OpenCL supports runtime code generation, optimizations that depend on runtime parameters can be easily applied. We demonstrate such a usage scenario for a data mining application and analyze its performance portabilR ity across different platforms: NVIDIA* K20X, and Intel Xeon PhiTM coprocessors. Sect. 2 discusses the related work. Sect. 3 introduces the compute platforms under investigation. We then continue with a description of classification and regression in data mining in Sect. 4. Sect. 5 elaborates on the implementation of the sparse grid approach and shows that a direct parallelization of the classical approach is merely impossible due to the recursive traversal of the underlying data structures. We evaluate the performance on several massive datasets in Sect. 6; each dataset consists of up to hundreds of thousands of data points. Sect. 7 concludes the paper.

2.

RELATED WORK

Sparse grids have been shown to work well for a wide range of different applications: partial differential equations in various settings [2, 27] as well as applications in economics [15,24], regression [10,23], classification [4,11,22], and many more. This paper extends our earlier work [13] or sparse grids in data mining. There we have used much simpler ansatz functions for the spatial discretization, which enables much easier implementations. It also laid the foundations for the Intel Xeon Phi coprocessor by investigating the performance and programmability of the Knights Ferry prototype. Furthermore, there is plenty of work available handling regular and structured sparse grids on CPUs with vector extensions and GPUs, please refer to [9, 16–18]. However, as these ideas do not (easily) extend to spatially adaptive sparse grids, these results cannot be used here. TODO: Here’s a duplication of references. grewe13portable ( [12]) is mentioned twice, but cited in different contexts below. There are several papers that investigate OpenCL performance for different kinds of applications, for instance [6, 12, 12]. In [6], a triangular solver and matrix multiplication is used as a proxy application. We focused on an advanced data-mining application with a highly complex algorithm, a new and challenging domain for performance portability across accelerators. We share their view that automatic and dynamic tuning is key to achieve this. The authors of [12] examine how OpenCL code can be generated from an OpenMP-parallel and already optimized CPU version. In our case we rely on a domain-specific code generator for OpenCL code to make optimal use of all hardware features for our application. Finally, [12] provides insights into the performance difference of the same OpenCL code on NVIDIA and AMD GPUs and concludes that different optimizations are needed for each GPU to achieve best possible performance. In our work, we come to a similar conclusion, however we compare a NVIDIA GPU to the recently announced Intel coprocessor.

3.

ARCHITECTURAL OVERVIEW

Today’s scientific compute facilities need to satisfy a steadily increasing computational demand by the applications they run, while being more and more focused on reducing energy consumption, time to market, and total costs of ownership. For the future, experts expect heterogeneous architectures with moderate amounts of “fat” cores and a large number of accelerators or coprocessors. The November 2012 Top500 list1 shows that next-generation supercomputers will utilize coprocessors or accelerators to speed-up computation. The Intel Xeon Phi coprocessor [5] and the NVIDIA Tesla K20X GPU [21] are two instances of devices with increased compute capabilities compared to traditional (host) CPUs. The Intel coprocessor and the NVIDIA accelerator follow different design principles. Whereas the Intel coprocessor is a many-core CPU that is based on the Intel Architecture (IA), the Tesla K20X is a massively parallel accelerator with specialized processing elements. The Intel coprocessor plugs into standard PCI Express* slots and features 61 general-purpose cores that run with R about 1100 MHz. The cores are based on a refreshed Intel 1

http://www.top500.org

Pentium (P54C) design with 64-bit scalar instructions and 512-bit SIMD vector instructions that operate on 16 singleprecision (SP) or 8 double-precision (DP) floating point operations per instruction. With the fused-multiply-add instructions the coprocessor provides about 2.1 TFLOPS and about 1.1 TFLOPS for SP and DP, respectively. Each core executes four hardware threads with round robin scheduling between instruction streams, i. e., in each cycle the next instruction stream is selected. The coprocessor has a typical cache structure of separate 32 KB L1 caches for data and instructions and an inclusive 512 KB L2 cache per core. The L2 caches are connected through a high-bandwidth ring for fast on-chip communication. An L3 cache does not exist due to the high-bandwidth GDDR5 memory (350 GB/sec at 2750 MHz). Following the key principles of the IA platform, all caches and the coprocessor memory are fully coherent. In contrast to the Intel coprocessor, the NVIDIA Tesla K20X does not contain general-purpose compute cores. It consists of 14 multiprocessors with 192 processing elements each. The processing elements run at a clock speed of 732 MHz and a memory-bandwidth of 250 GB/sec. A 1.5 MB L2 cache is shared across the 14 multiprocessors. As this device is based on the NVIDIA Kepler* GK110 architecture, it does not offer vector instructions like the Intel coprocessor; the 192 processing elements per multiprocessor are instead programmed according to the Single Instruction Multiple Threads (SIMT) paradigm. All processing elements execute either the same instruction or no-op instructions. Peak double precision floating point performance for K20X is 1.3 TFLOPS (which is comparable to the Intel Xeon Phi coprocessor), but peak SP performance is 3.95 TFLOPS which is much higher than for the Intel coprocessor. Based on IA, an Intel coprocessor can support all programming models that are available for traditional IA processors. The compilers for the coprocessor fully support standard Fortran (incl. Co-Array Fortran) and C/C++. OpenMP*, R R Intel Threading Building Blocks, Intel CilkTM Plus as well as POSIX* threads can be used to parallelize an applicaR tion. The Intel Composer XE can automatically generate code for the Vector Processing Unit (VPU) either through auto-vectorization, semi-automatically by pragmas, or manually through intrinsic functions. TODO: Do we want to add references for PGI and HMPP? Due to the special architecture, the Tesla K20X only supports a limited set of programming paradigms. The most important ones are CUDA* and OpenCL*, which are dataparallel programming models. OpenACC* borrows the concepts of OpenMP as a pragma-based programming model to offload computation to GPUs. Third-party compilers such as the PGI compiler suite or HMPP support offloading of Fortran code to the GPU. All three models restrict the language features in the offloaded code so as not to violate the GPU programming model.

4.

DATA MINING WITH SPARSE GRIDS

Two prominent and related tasks in data mining are classification and regression. Both aim to generalize from known data and to predict a target property for new, previously unknown data. Thus, we start with a set S of M data points ~ xm ∈ Rd in a d-dimensional feature space for which some

target values ym ∈ K are known: n o S = (~xm , ym ) ∈ Rd × K

.

m=1,...,M

To learn and to generalize, we assume that we obtained S by a random and noisy sampling of an unknown function f , which we aim to reconstruct. We then use this function to predict a target value for new ~ x. As we are dealing with finite data, we can restrict ourselves to the domain [0, 1]d without the loss of generality. For the task of binary classification, we learn a continuous function f , and check whether the function value is negative or not—the absolute function value provides a measure of confidence in our prediction. We restrict ourselves to reconstructions fN of f in some finite-dimensional function space VN which is spanned by a basis {ϕj (~x)}j=1,...,N . Thus, fN can then be written as a linear combination of basis functions with coefficients αj , fN (~ x) =

N X

αj ϕj (~ x) .

(1)

j=1

To obtain a unique fN and to be able to deal with noise, we solve the regularized least squares problem ! M 1 X ! (ym − fN (~ xm ))2 + λ||∇fN ||2L2 , fN = arg min M m=1 fN ∈VN (2) see [11,23] and the references cited therein for further details. On one hand we ensure closeness to our training data, minimizing the mean square error on it. On the other hand we incorporate a smoothness assumption in data mining, which states that close data points are very likely to have a similar function value, by enforcing some degree of smoothness of fN and preventing oscillations. This trade-off can be influenced by a good choice of the regularization operator λ; for a given dataset this can be achieved via cross-validation [1]. Note that this is a general approach: other classification methods can be formulated the same way choosing different error and smoothness terms [8].

basis functions ϕj that are centered at grid points as in classical finite element schemes. Thus we obtain an algorithm that scales linearly in M only and that is therefore well-suited for the treatment of an almost arbitrarily large amount of data to analyze. Unfortunately, a straightforward discretization with an equidistant mesh width hn := 2−n with respect to each dimension is affected by the curse of dimensionality: the number of grid points, (2n + 1)d , increases exponentially with the dimensionality d. This naturally restricts the approach to low-dimensional problems.

4.1

Sparse Grids

Sparse grids are an elegant way to beat the curse of dimensionality. They have been applied in various fields of applications [3]. For sufficiently smooth functions (as we target in our setting), they reduce the number of grid points by many orders of magnitude to only O(2n nd−1 ) while keeping a high accuracy similar to that of the full grid case. In the following, we describe the basic principles of sparse grids as briefly as possible; see [3,23] for details. Sparse grids are based on a hierarchical (and thus inherently incremental and adaptive) formulation of the one-dimensional basis, which is extended to the d-dimensional setting via a tensor product approach. As we consider piecewise linear functions fN defined on an equidistant mesh, we can derive one-dimensional basis functions ϕl,i that depend on a level l and an index i out of the reference hat function ϕ(x) := max(1−|x|, 0) via translation and scaling as ϕl,i (x) := ϕ(2l x − i). The hierarchical basis for a particular level l with a mesh of width hl = 2−l then has three basis functions on level 1 and odd-indexed basis functions on all other levels up to l, see Fig. 1 (left) for l = 3. If ~l and ~i are vectors of levels and indices per dimension for a certain basis function, we can formulate d-dimensional basis functions that are centered at ~ x~l,~i = (i1 2l1 , . . . , id 2ld ) as a product of the respective one-dimensional basis functions, ϕ~l,~i (~ x) :=

d Y

ϕlk ,ik (xk ) ;

k=1

Minimizing (2) using (1) we obtain a system of linear equations 1 1 BB T + λC α ~ = B~ y, (3) M M the solution of which is Rthe coefficient vector α ~ of fN . The N × N matrix C, cij = ∇ϕi (~ x)∇ϕj (~ x) d~ x stems from the smoothness term; the N × M and M × N matrices B and B T , bij = ϕi (~xj ), and the vector ~ y of the target values yi from the error term. So far, we have not yet specified which kind basis functions to use for learning. Conventional approaches for classification and regression mainly choose global ansatz functions which are associated to the locations of the training data points, aiming to represent f with as few basis functions as possible. But as they depend on the data set at hand, they typically scale at least quadratically or even worse with the number of training data points M . Thus, they are unsuited for the mining of large datasets as we consider here. In contrast, we discretize the feature space [0, 1]d and choose

where |l|1 denotes the classical l1 norm for vectors, i.e., the sum of the one-dimensional levels. In higher-dimensional settings, we obtain hierarchical increments (function spaces) W~l, for which the grid points are expressed as the Cartesian product of the one-dimensional basic functions on the respective one-dimensional levels. Fig. 1 (right) shows the grids of the two-dimensional hierarchical increments W~l up to level 3 in each dimension. Note that in each W~l without any grid points located on the boundary, all basis functions have zero overlap as in the one-dimensional case. Starting from a hierarchical scheme of increments W~l we select those that contribute most to the overall solution in general for sufficiently smooth functions. Fortunately, the contribution of basis functions decreases asymptotically with increasing |~l|1 . Thus, we may omit those W~l that have many basis functions (and are therefore costly): a continuous knapsack problem leads to a diagonal cut-off in the tableau if the error is measured in the L2 - or maximum-

l1=1

0,0

1,1

l =1 .

x0,1

x1,1

x0,0

l1=2

l1=3

l1

1,1

0,1

l =1

l2=1

.

x1,1

2,3

2,1 l =2 x2,1

.

x2,3

l =2

3,1 3,3 3,5 3,7 l =3 .

x3,1 x3,3 x3,5 x3,7

2,3

2,1

l2=2

x2,1

l2=3

3,1

l2

l =3

Figure 1: One-dimensional basis functions up to level 3 (left), and tableau of hierarchical increments W~l up to level 3 in both dimensions (right).

norm (see [3]). This leads to the sparse grid space M Vn := W~l , |~ l|1 ≤n+d−1

a direct sum of an a-priori selection of subspaces W~l. Note that the hierarchy of basis functions in one-dimensional problems can be considered a binary tree (except for the three basis functions on the first level): A basis function with level-index pair (l, i) has two hierarchical children (l + 1, 2i − 1) and (l + 1, 2i + 1). In higher dimensions, the hierarchy forms some kind of d-dimensional binary tree of ddimensional binary trees, where any node has 2d child nodes and up to d hierarchical ancestors. Also note that algorithmically efficient algorithms working on sparse grids directly reflect this hierarchical structure via recursive function calls. In addition to the a-priori selection of hierarchical increments, the incremental structure of the sparse grid’s basis allows to start with a regular (non-adaptive) sparse grid for a low level n as defined before, solve the linear system to obtain surpluses α ~ , and adaptively select which grid points to add next. This a-posteriori refinement adapts to the problem at hand and permits to spend more degrees of freedom in regions with a high density of the data to train on. A simple (and typically very effective) criterion for adaptive refinement that we use in the following is to select the refinement candidates with the highest absolute values of hierarchical surpluses (coefficients αj ), and to create all of the 2d children if they do not yet exist. In both classification and regression, adaptive refinement is an important ingredient: spending too few basis functions locally prevents learning the underlying structure. But spending too many functions is also harmful, since it expresses the noisy training data too detailed and this leads to a low accuracy on new data points (overfitting). Whereas sparse grids can work on many more dimensions than before, the treatment of the boundary of the domain requires additional consideration. On level 1 (just one degree of freedom), the best guess is a function with constant value. On all other levels, we “fold up” the basis functions adjacent

.

x2,3

3,3 3,5

3,7

.

x3,1 x3,3 x3,5 x3,7 Figure 2: Modified one-dimensional piecewise linear basis functions up to level 3.

to the boundary, by applying a linear extrapolation towards the boundary value (see Fig. 2). This way we can deal with a high number of dimensions and start with only 2d + 1 grid points. However, this small number of grid points does not come for free as the inner-most compute kernel requires a four-way if-branch that determines if the currently treated point is on level 1, on the left or right boundary or a grid point on the inner grid.

5.

SPARSE GRID ALGORITHMS

In this section, we introduce the implementation details for the mathematical description of the sparse grid approach described in Sect. 4. Before descending into the details of the individual variants of the algorithms we implemented, we provide several high-level remarks. Both the N × N matrices BB T and C in the linear system (3) are not sparse. Assembling them explicitly would require O(N 2 ) storage space. Obviously, this is out of scope for large grids. Furthermore, the matrix BB T depends on the data S and thus does not exhibit any regular structure (apart from symmetry) that the algorithm could exploit for efficient matrix-vector operations or solvers. But both matrices can be efficiently applied to a vector by making use of the hierarchical tree structure of sparse grids. This encourages the use of iterative numerical solvers such as the conjugate gradient (CG) method. A drawback of using the matrix C that stems from the smoothness term k∇f k2L2 in (2) is that the complexity depends exponentially on d through the factor 2d . Recent research shows that due to the hierarchical basis, a simpler term to enforce a degree of smoothness can be used P 2 instead [23]. Minimizing N j=1 αj results in C = I where I is the identity matrix. Thus, only B and B T pose a challenge for implementation and parallelization. P As (~v )m := (B T α ~ )m = N xm ) = fN (~ xm ), the apj=1 αj ϕj (~ plication of the matrix B T to the surplus vector α ~ can be computed by evaluating the sparse grid function fN at all

l1=1

l1=2

l1=3

l1

l2=1

l2=2

l2=3

Algorithm 1 Recursive version of the operation ~v = B T α ~: recEval evaluates every data point ~ xm only at grid points that are necessary to compute vm = fN (~ xm ) for the coefficient vector α ~. for all ~ xm ∈ S do vm ← recEval (~ xm , α ~ , grid) end for

Algorithm 2 Recursive version of the operation ~ u = B~v . ~ ~ u←0 for all ~ xm ∈ S do ~ um ← recEvalT (~ xm , vm , grid) ~ u←~ u+~ um end for

l2

5.1 Figure 3: Evaluation of a data point (cross) on a sparse grid recursive in both dimensionality and level: only the affected basis functions (colored) have to be evaluated.

data points ~xm . The matrix-vector multiplication can then be parallelized over the set of evaluation points. Since there may be only one non-zero basis function per subspace for any evaluation point ~ xm (ignoring all basis functions with grid points on the grid boundary), we only have to identify which one and can disregard all the others (see Fig. 1). In one dimension, the search for the non-zero basis function can be performed by recursive descent in the binary tree of basis functions. This requires the evaluation of only n basis functions, i.e., one per level. In multiple dimensions, the tensor product structure of the basis can be exploited: After the evaluation of the one-dimensional basis function ϕlk ,ik in dimension k, the algorithm multiplies the result with the sum of the individual components in the remaining k − 1 dimensions of all basis functions that share ϕlk ,ik in dimension k (see Fig. 3). Thus, we obtain a recursive structure in both the different dimensions (changing the direction of the descent) and the level (descending in one direction) with recursive function calls for each subspace. The algorithm to compute the second matrix-vector product, ~ u := B~v = B(B T α ~ ), is the transposed operation and is slightly more complex. Again, assembling the matrix would require to evaluate each basis function for each data point. To evaluate only the basis functions that are affected by a certain data point, the algorithm traverses the hierarchical tree of basis functions as for the function evaluations before. But this time, it cannot compute a weighted sum of basis functions in one run, but it needs to add each partial result vm ϕj (~xm ) to the j-th entry of the result vector ~ u during each tree traversal. Considering a straightforward parallelization over the data points, this requires either mutual exclusive access to avoid race conditions or a replication of the result vector for each thread which have to be aggregated to form the final result vector in the end.

Baseline Implementation

Any algorithm for the application of B and B T that aims to minimize the number of computations inevitably has to exploit the hierarchical and tensor product properties of the underlying basis. However, this requires a (multi-)recursive scheme, traversing the d-dimensional tree of basis functions. For spatially adaptive sparse grids and standard, hash-based data structures [23], this results in random access patterns to the corresponding vectors ~ u, ~v , and α ~. Both recursion and random access to memory are difficult scenarios for parallelization. In contrast to the elegant and algorithmically efficient recursive approach, a very simple alternative is to iteratively evaluate all basis functions for all instances of the training dataset S and to accumulate the results. Whereas this requires to deal with a lot of basis functions that are not affected by a certain data point, this leads to a straightforward implementation with nested loops that are well-suited for parallelization and that can be tuned for a particular compute platform through the standard set of optimization techniques.

5.1.1

Recursive Version

If we denote the recursive algorithm for the evaluation of a sparse grid function fN with surplus vector α ~ at point ~ xm as discussed in Sect. 5 by recEval (~ xm , α ~ , grid), the whole matrix-vector product B T α ~ can then be obtained by a loop over all data points ~ xm of the training dataset S. In each iteration of the loop, recEval has to be called once, see Alg. 1. The outer loop can be trivially parallelized over ~ xm . For the computation of B~v , we have the same outer loop as before. But we have to call a slightly modified version of recEval (denoted by recEvalT ). This function call does not return a single (function) value, but returns a vector of length N . To parallelize the outer loop, independent result vectors ~ um are required that later have to be summed up to obtain the final result vector ~ u = B~v , see Alg. 2.

5.1.2

Iterative Version

The iterative approach disregards the tensor product structure of the recursive algorithm and its hierarchical sparse grid basis. For the ~v = B T α ~ operation, nested loops over the grid points g ∈ grid and the dimensionality d replace the function call to recEval (see Alg. 3). The innermost

Algorithm 3 Iterative version of the operation ~v = B T α ~: Every instance ~xm is evaluated at every grid point g, replacing the algorithmically efficient function evaluation recEval. We denote level and index of grid point g in dimension k by glk and gik , and the corresponding coefficient by αg . for all ~xm ∈ S do vm ← 0 for all g ∈ grid do s ← αg for k = 1 to d do if glk = 1 then s ← s · ~1 else if gik = 1 then s ← s · max(2 − 2glk · ~ xmk ; 0) else if gik = 2glk − 1 then s ← s · max(2glk · ~ xmk − gik + 1; 0) else s ← s · max(1 − |2glk · ~ xmk − gik |; 0) end if end for vm ← vm + s end for end for

Level

Index

α

Algorithm 4 Iterative version of the operation ~ u = B~v . for all g ∈ grid do αg ← 0 for all ~ xm ∈ S do β ← vm for k = 1 to d do if glk = 1 then β ←β·1 else if gik = 1 then β ← β · max(2 − 2glk · ~ xmk ; 0) else if gik = 2glk − 1 then β ← β · max(2glk · ~ xmk − gik + 1; 0) else β ← β · max(1 − |2glk · ~ xmk − gik |; 0) end if end for αg ← αg + β end for end for

The operation ~ u = B~v can be realized similarly (see Alg. 4). Note that it is possible to interchange the three loops to avoid data dependencies. Hence, we swap the loop over all grid points with the loop running over all instances of the training dataset. This way, data-dependencies between the grid points can be removed, which is a huge advantage for parallelization. In general, both iterative operations are compute bound and can be regarded as embarrassingly parallel problems.

Sum Dataset

v

Figure 4: Data containers for both grid data and dataset for a 4-dimensional example.

Q loop computes s := αg ϕg (~ xm ) = dk=1 ϕglk ,gik (~ xmk ). Fig. 4 illustrates the data flows that take place during the execution of Alg. 3. This example shows the evaluation of a fourdimensional dataset and depicts the algorithm’s streaming properties. It is obvious that technologies like hardware prefetching are able to speed up the execution significantly. This way, instead of a computationally efficient recursive algorithm, a simple nested loop will cause the evaluation of a lot of basis functions with a value of zero. To give an impression: for the DR5 data set (see Sect. 6), up to 16 times more basis functions (depending on the refinement step) have to be evaluated in the iterative version than in the recursive algorithm. Our results will show that this is worthwhile, as several standard tuning tricks like cache-blocking and streaming support can be applied to further accelerate the iterative version. Furthermore, this algorithm offers possibilities for task- and data-parallel computation (e.g., through SIMD operations) by simultaneously evaluating one basis function at several different data points. Assuming a high number of training instances, this algorithm provides an enormous number of parallel tasks to feed a massively parallel processor such as modern GPUs and the Intel coprocessor.

5.2

Manual SIMD Vectorization

We now focus on the iterative versions of applying B and B T from Alg. 4 and 3. Similar to the recursive case, B T is parallelized over the number of instances in the dataset. To enable the use of the Intel coprocessor’s vector extensions, Alg. 3 needs some changes as shown in Alg. 5. The loop over instances is split into chunks with a size that is equal to the SIMD vector width of the underlying hardware (8 elements for DP, 16 elements for SP). No remainder handling is shown in the short pseudo-code as no remainders occur. We are dealing with very huge datasets that easily hide the overhead of padding the computation accordingly. Thus, at most veclength − 1 additional data points have to be handled during one application of B T . This is straightforward (and for huge datasets negligible).

5.3

OpenCL Implementation for GPUs

Since every instance can be evaluated independently, the operators’ iterative formulation can be regarded as an embarrassingly parallel workload. Hence, they perfectly match the requirements of today’s GPUs, which can be seen as many-core processors. It has been shown that GPUs are very popular for accelerating embarrassingly parallel kernels like dense linear algebra or Monte Carlo simulations [7, 26]. To enable GPUs for general purpose computing, a special programming language has to be used. We chose OpenCL as this is an open platform for GPUs from both NVIDIA and AMD as well as for the Intel Xeon Phi coprocessor. Considering GPUs, OpenCL is closely related to NVIDIA’s CUDA programming language and environ-

Algorithm 5 Parallelized iterative operation B T α ~ : Every basis function is evaluated for veclength data points in a vector formulation. These vector operations are parallelized with OpenMP. symbolizes a component-wise vector multiplication, broadcast sets all vector elements to a given scalar value. Require: veclength (=8/16, Xeon Phi DP/SP) ~v ← ~0 #pragma omp parallel for for m = 1 to M with stride veclength do for all g ∈ grid do ~s ← broadcast(αg , veclength) for k = 1 to d do g~l tmp ← broadcast(2glk , veclength) g~i tmp ← broadcast(gik , veclength) ~ttmp ← (~xmk , ~ xm+1k , . . . , ~xm+veclength−1k ) if glk = 1 then ~s ← ~s ~1 else if gik = 1 then ~s ← ~s max(~2 − g~l tmp ~ttmp ; ~0) else if gik = 2glk − 1 then ~s ← ~s max(~ gl tmp ~ttmp − g~i tmp + ~1; ~0) else ~s ← ~s max(~1 − |~ gl tmp ~ttmp − g~i tmp |; ~0) end if end for v[m : m+veclength−1] ← v[m : m+veclength−1]+~s end for end for

ment [19]. Both share the same ideas: kernels are written in a shader-style language and are executed on the accelerator. Buffers and messages ensure the communication between host and the GPU, and the memory model defines global and local sections. In contrast to CUDA, OpenCL programs can be provided as source code when compiling the host application (part of which runs on the CPU). Therefore, OpenCL can build kernels at runtime, which allows runtime code generation. Starting with the GPU implementation, we built a prototype to determine what GPU programming model, CUDA or OpenCL, fits best. In the end, the runtime compiler turned the balance towards OpenCL: it is no secret that the GPU’s local storage has to be used to fully exploit a GPU’s potential. We have implemented a code generator to generate OpenCL code from the C++ application in the setup phase of the GPU kernels. At this time, every detail about the grid and the training dataset is known, and an OpenCL kernel can be tailored to match the requirements (perfect amount of required local memory, unrolling of loops with small trip counts, etc.). Since all these parameters depend on parameters which are only known at runtime the runtime code generation capability of OpenCL turned out to be the most elegant way of generating basically platform independent code, except of the configuration of the local sizes, which is a heavily hardware dependent knob. Of course, such a code generation can also be done by using expression templates in C++, but this would required generating code for all possible combinations and selecting the correct code-path at runtime, based on the runtime parameters. By referring to [13]

it becomes obvious that the runtime employed code generation is able to gain at least a factor of 2x over a plain CUDA implementation (on the same GPU) without the mentioned specialization for runtime parameters. For the optimized version of the OpenCL kernel, next section, we observe a benefit up to 3x when this kind of unrolling is performed on Intel coprocessor. A shader-like kernel represents a so-called work item that belongs of a work group; a work group thus consists of several identical work items numbered by an index. The number of work items per work group is called work-group size. Dividing the workload size by the work-group size gives the number of work groups that are used to execute the workload on the GPU. In order to reach optimal runtime conditions on the Kepler architecture, the local size should be a multiple of 32, which is the minimal warp size of this GPU. The optimal multiple has to be carefully determined by analyzing the workload (how many registers are used by a work item) and several runtime tests. This can be achieved by auto-tuning approaches, for instance. Let us consider the implementation of the operation B T α ~ with OpenCL to show how OpenCL can be used to implement the iterative version of B T α ~ . Please note, that OpenCL has two ways to enable data parallelism: through shader-style kernels and through the built-in vector data types. As NVIDIA recommends shader-style kernels [20] and as it is quite similar to the C++ loop from Alg. 3, we decided to use a shader-style OpenCL kernel. The idea of the OpenCL kernel is as follows: a data point ~ xm is evaluated by one work item. Similarly to regular vectorization approaches, padding is required to assure optimal performance. As mentioned in the last paragraph, the local size should be a multiple of 32. Our tests have shown that a local size of 64 gives the best performance on the GPU. There are several applicable optimizations. First, the local storage of a work group can be used to prefetch data into the device’s caches. Second, the runtime compilation of OpenCL can be exploited for runtime code generation. Since the grid dimensions are known at runtime, the code generator can completely unroll all loops over dimensions to reduce the amount of control flow in the kernel. The implementation of operation B~v is nearly identical to that of B T α ~ . Due to the GPU architecture a potential pitfall arises that might inhibit optimal performance. Since the kernels working on B are parallelized over the number of grid points, the number of grid points needs to be a multiple of the local size for best performance. There are two ways to remedy this: (a) use a local size of 1 (i.e., a work group equals a work item), or (b) split the operator into a GPU and a CPU part. In the latter case, the GPU handles the multiple of 64 points that is smaller than the number of total grid points, and the CPU is responsible for the rest. With increasing grid sizes (spending more refinements), the CPU part becomes less and less important. We make use of the second approach as it provides a better performance.

5.4

Algorithmic Improvements: Masking

The parallelization over B~v also solves the problem with global synchronization by spanning the work items across

Algorithm 6 Iterative version of the operation ~ u = B~v using masking. {pre-computing on CPU} for all g ∈ grid do for k = 1 to d do if glk = 1 then lgk ← 0, igk ← 0, mgk ← 0x00...0, tgk ← 1 else if gik = 1 then lgk ← −2glk , igk ← 0, mgk ← 0x00...0, tgk ← 2 else if gik = 2glk − 1 then lgk ← 2glk , igk ← gik , mgk ← 0x00...0, tgk ← 1 else lgk ← 2glk , igk ← gik , mgk ← 0x80...0, tgk ← 1 end if end for end for {operator implementation on device} for all g ∈ grid do αg ← 0 for all ~xm ∈ S do β ← vm for k = 1 to d do β ← β · max(orbitwise (lgk · ~ xmk + igk , mgk ) + tgk ; 0) end for αg ← αg + β end for end for

the grid-points, which would be necessary when parallelizing over the data points, since each data point can generate a contribution for an arbitrary grid point. However using the grid points as parallelization domain does not come for free. It introduces branch-divergence as each work item may take another route through the mentioned if-branches. Fortunately, this problem can be solved by moving to a masked version of B~v : due to a smart pre-processing we are able to limit the work of all branches to the amount of work done in the else branch of Alg. 4. As shown in Alg. 6 we can pre-compute a modified grid-level (lgk ), grid-index (igk ), mask (mgk ), and offset (tgk ). While this requires a little more space on the device, it permits to easily hide the additional load operations behind the computation. Using these four pre-computed values, we are able to eliminate branch-divergence entirely on the device since the kernel β · max(orbitwise (lgk · ~ xmk + igk , mgk ) + tgk ; 0) can match all four branches by using the given host-side initialization of lgk , lgk , mgk and tgk . This improved version mainly constitutes the “OpenCL opt.” version discussed in Sect. 6.

5.5

Implementation for the Intel Coprocessor

For the Intel coprocessor, we started with the OpenCL algorihm for the GPU. In contrast to the native code, this version uses the AoS (Array of Structures) data layout. While AoS is well-suited for GPUs, SIMD-capable CPUs achieve greater performance with the SoA format that permits to move data into vector registers with SIMD load instructions and eliminate the need for gather operations. As future work, we plan to port the OpenCL version to SoA layout to provide a fairer comparison against the native implementation and to evaluate the performance differences to AoS. Before diving into the details of the specific optimizations

for the Intel coprocessor, let us recap the coprocessor architecture and relate that to OpenCL’s terminology. The Intel coprocessor combines many CPU cores onto a single chip and each core executes four hardware threads. All the cores of a coprocessor together with their hardware threads constitute a single OpenCL device. Each hardware thread is treated as an individual OpenMP compute unit. The OpenCL notion of data parallelism allows a kernel to execute concurrently on multiple work items. The Intel coprocessor provides an additional level of parallelism through its SIMD instruction set. The auto-vectorization capabilities of the Intel OpenCL compiler enables processing multiple work items with SIMD, which is key to achieving good performance at the workgroup level. A work group itself is the finest granularity of thread-level parallelism. Different threads process different work groups. A reasonable amount of computation per work item needs to be coupled with the right work group size as well as the resulting number of work groups that are available for parallel execution. The right balance is a key factor to achieve good scalability and performance on the coprocessor. The first optimization for the coprocessor is to adjust the numbers of work groups to saturate the 244 hardware treads. The number of work groups is essential for the algorithm’s early stages with only few number of data points, which have been refined. Similarly to GPUs, there is a recommended multiplier for the local size for the architecture of the Intel coprocessor. Specifically the best performance is achieved with local size being a multiple of 16, which matches the underlying SIMD width of the coprocessor. The optimal multiple has to be carefully determined by analyzing the workload; in many cases determining the optimal value can be left to run-time (via specification pointers to local size values as NULL), thereby providing performance portability of this optimization. The work-group size can also be optimized per platform and workload using auto-tuning techniques [25]. We reduced the work group size to sixteen work items, as this leaves enough headroom for the vectorization to generate SIMD code, while it creates at least 230 work groups even at the first iterations of the algorithm. This provides the best device saturation (Fig. 5) for the benchmarks we have tested and results in a speedup of 10-40% for depending on the test. With the value of sixteen for the first iterations the resulting work group count is slightly less than the total number of hardware threads. This can be seen as a suboptimal device saturation in the first 50 seconds of execution (approximately 5% of the total compute time, see Fig. 5 on the right). Still the picture is much better than for work group size of 64 (Fig. 5, left). In this particular case, saturating the coprocessor with the initial work group size of 64 takes almost 10x more time — around 500 sec vs. 50 sec for work group size of 16. A second optimization is to remove the usage of local memory for the Intel coprocessor. The local memory usage is a typical GPU-targeted optimization of explicit caching. For the Intel Xeon Phi coprocessor, all OpenCL memory objects are cached automatically by the hardware, so explicit caching by use of local memory introduces a moderate overhead. More precisely, the barrier(s) associated with the local

Figure 5: Aggregated utilization of the Intel coprocessor’s hardware threads on the time-line (collected with R Intel VTuneTM Amplifier XE 2013). First 50 seconds of the OpenCL execution are collected; the left screenshot is for the execution with a local size of 64, the right screenshot is for a local size of 16.

memory usage introduce the runtime costs. Removing the use of local memory improved performance of the Intel coprocessor by another 10-15%, depending on the test case (see Sect. 6).

Table 1: GFLOPS for both workloads for different variants and for SP/DP. system type

6.

PERFORMANCE EVALUATION**

Table 1 shows the performance numbers for the coprocessor OpenCL implementation of the algorithm and of the original GPU-optimized version that we discussed in the previous sections. For every device the value of the local size that gives the best performance is selected. For reference, we also provide native performance numbers, i. e., of the manually vectorized implementation, for the Intel Xeon Phi coprocessor. R

For evaluation, we use two test scenarios with a moderate number of dimensions d = 5 and distinct challenges. The first dataset contains 218 data points and classifies a regular 3 × · · · × 3 checkerboard pattern. The second one is a realworld dataset from astrophysics. It predicts spectroscopic redshifts of galaxies based on more than 430,000 photometric measurements. For both, excellent numerical results are obtained using our method (see [14] for details). Even though the Intel coprocessor generally works best with the data in SoA layout, the out-of-box performance of the original GPU-oriented AoS version of the code (“OCL basic”, Sect. 5.3) demonstrates relatively good results. The “OCL opt” version with both general optimizations from Sect. 5.4) and device-specific tuning (e.g., preferred local size, Sect. 5.5) gets close to the performance of the handtuned native version on the coprocessor and outperforms the NVIDIA K20X GPU. The OpenCL version computes the remainder parts of the grid on the host as described in Sect. 5.3, which skews the

chk.brd.

GFLOPS

GFLOPS

Xeon Phi or GPU

SP/DP

Xeon Phi Native

SP DP

441 227

513 209

Xeon Phi OCL basic

SP DP

51 38

74 51

Xeon Phi OCL opt.

SP DP

321 192

384 230

K20X OCL basic

SP DP

61 57

87 75

K20X OCL opt.

SP DP

182 107

233 136

R

The hardware for evaluation is a two-socket Intel Xeon processor E5-2670 with 128 GB (DDR3, 1660 MHz) running Red Hat Enterprise Linux 6.3 and Intel Composer XE 2013 v13.0.1.117. The Intel Xeon Phi coprocessor chip is B0 R ES2 silicon with driver version 2.1.4982 and Intel SDK for OpenCL Applications XE 2013 Beta v3.0.56860. The NVIDIA Tesla K20X in the system is a K20Xm with driver version 310.32 and CUDA 5.0.

DR5

results towards OpenCL (compared to native). This is true for the coprocessor and the GPU scores. We thus report the numbers that the unaligned remainder was computed on the CPU (with regular C++ code equipped with OpenMP parallelization). The influence of this padding is more visible for the SP version of the tests and can be observed as a rough 10% performance advantage for both devices. An interesting aspect of the performance on the Intel Xeon Phi coprocessor is the sustained threading scalability for the tests. To avoid the complexity of scalability in the presence of hyper-threading, we measured the performance while altering the number of cores that host the OpenCL work groups instead (each core running 4 HT threads). In our experiments going from 1 to 60 cores, the results show a very good scalability of about 50x to 52x for the SP version of the test and 45x to 46x for the DP version, respectively. The direct comparison of the Intel Xeon Phi coprocessor with the NVIDIA K20X reveals that the coprocessor outperforms the GPU. The version of the OpenCL kernel (“OCL opt.”) that was optimized for the coprocessor performs better than the original GPU-optimized version (“OCL basic”) even on the GPU itself. We attribute this to the fact that

optimizations that helped the coprocessor are generally also helpful for the GPU. Mainly reducing the SIMD (or warp in case of GPU) divergency by eliminating control flow described in section 5.4, but also other general optimizations like amortizing calculations amongst work groups (by doing the calculations once on the host), etc., also exhibit a performance boost on the GPU device. By adding more GPU-specific tricks to “OCL opt” kernels, we expect small additional performance gains for the GPU numbers. For instance, we estimate from other (simpler) data-mining tests that amortizing some data traffic in the kernel through usage of local memory will gain another 10% on the GPU. However, this is subject for future work.

7.

CONCLUSION

In this paper, we have shown the performance of a gridbased data-mining algorithms on the industry leading accelerators including the NVIDIA K20X and the Intel Xeon Phi coprocessor. We use OpenCL as an API of choice to keep kernel code the same for the different devices. We argue that performance portability is a feasible requirement when implementing in OpenCL. For example, it allows to easily adapt to runtime parameters that are optimal for a particular device (e. g., optimal work-group size) via simple host-side logic. A general contribution of the paper is the pure algorithmic (i. e., device-agnostic) improvement to the core data-mining routines that significantly boosts performance of both devices. For the new algorithm version which employs masking to reduce SIMD/warp divergency, even the out-of-box OpenCL performance of both the Intel Xeon Phi coprocessor and K20X devices is reasonable. Finally by applying relatively simple device-specific optimizations for Xeon Phi it is possible to outperform K20X and match the speed of the full-blown Native code on Xeon Phi for DP case which is in primary use.

8.

ACKNOWLEDGMENTS

Intel, the Intel logo, Cilk, Xeon, and Xeon Phi are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. * Other brands and names are the property of their respective owners. ** Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http:// www.intel.com/performance. System configuration used: R R R Supermicro X9DRG-HF baseboard with 2S Intel Xeon processor E5-2670 (128 GB DDR3 with 1600 MHz, Red Hat R R Enterprise Linux 6.3) and single Intel C600 IOH, Intel TM Xeon Phi coprocessor with B0 ES2 silicon (GDDR with 5.5 GT/sec, driver v2.1.4982, flash v2.1.05.0375, device OS R R v2.6.38.8-g32944d0), Intel Composer XE 2013 U1, Intel SDK for OpenCL Applications XE 2013 Beta v3.0.56860, and NVIDIA Tesla K20Xm (GDDR with 5.2 GHz, driver 310.32, CUDA v5.0 with OpenCL support).

9.

REFERENCES

[1] D.M. Allen. The Relationship between Variable Selection and Data Augmentation and a Method for Prediction. Technometrics, 16(1):125–127, 1974. [2] J. Benk, H.-J. Bungartz, A.-E. Nagy, and S. Schraufstetter. An Option Pricing Framework Based on Theta-Calculus and Sparse Grids. In Progress in Industrial Mathematics at ECMI 2010, October 2010. [3] H.-J. Bungartz and M. Griebel. Sparse Grids. Acta Numerica, 13:147–269, May 2004. [4] H.-J. Bungartz, D. Pfl¨ uger, and S. Zimmer. Adaptive Sparse Grid Techniques for Data Mining. In Proc. of the High Performance Scientific Computing 2006, pages 121–130, Hanoi, Vietnam, June 2008. R Xeon PhiTM Coprocessor [5] Intel Corporation. Intel System Software Developers Guide, November 2012. IBL document ID 488596. [6] P. Du, R. Weber, P. Luszczek, S. Tomov, G. Peterson, and J. Dongarra. From CUDA to OpenCL: Towards a Performance-portable Solution for Multi-platform GPU Programming. Parallel Comput., 38(8):391–407, August 2012. [7] A. Lee et. al. On the Utility of Graphics Cards to Perform Massively Parallel Simulation of Advanced Monte Carlo Methods. Journal of Computational and Graphical Statistics, 19(4):769–789, 2010. [8] T. Evgeniou, M. Pontil, and T. Poggio. Regularization Networks and Support Vector Machines. In Advances in Computational Mathematics, pages 1–50. MIT Press, 2000. [9] A. Gaikwad and I.M. Toke. GPU Based Sparse Grid Technique for Solving Multidimensional Options Pricing PDEs. In Proc. of the 2nd Workshop on High Performance Computational Finance, pages 1–9, Portland, OR, November 2009. [10] J. Garcke. Regression with the Optimised Combination Technique. In Proc. of the 23rd Intl. Conf. on Machine Learning, pages 321–328, Pittsburgh, PA, June 2006. [11] J. Garcke, M. Griebel, and M. Thess. Data Mining with Sparse Grids. Computing, 67(3):225–253, October 2001. [12] D. Grewe, Z. Wang, and M.F.P. O’Boyle. Portable Mapping of Data Parallel Programs to OpenCL for Heterogeneous Systems. In Proc. of the 11th Intl. Symp. on Code Generation and Optimization, Shenzen, China, February 2013. [13] A. Heinecke, M. Klemm, D. Pfl¨ uger, A. Bode, and H.-J. Bungartz. Extending a Highly Parallel Data R Mining Algorithm to the Intel Many Integrated Core Architecture. In Euro-Par 2011: Parallel Processing Workshops, pages 375–384, Bordeaux, France, August 2011. LNCS 7156. [14] A. Heinecke and D. Pfl¨ uger. Multi- and Many-Core Data Mining with Adaptive Sparse Grids. In Proc. of the 8th ACM Intl. Conf. on Computing Frontiers, pages 29:1—-29:10, Ischia, Italy, May 2011. [15] A. Heinecke, S. Schraufstetter, and H.-J. Bungartz. A Highly-parallel Black-Scholes Solver based on Adaptive Sparse Grids. Intl. Journal of Computer Mathematics, 89(9):1212–1238, June 2012. [16] A. Murarasu, G. Buse, D.Pfl¨ uger, J. Weidendorfer,

[17]

[18]

[19] [20] [21]

[22]

[23]

[24]

[25]

[26]

[27]

and A. Bode. Fastsg: A Fast Routines Library for Sparse Grids. Procedia CS, 9:354–363, 2012. A. Murarasu, J. Weidendorfer, and A. Bode. Workload Balancing on Heterogeneous Systems: A Case Study of Sparse Grid Interpolation. In Euro-Par 2011: Parallel Processing Workshops, pages 345–354. August 2012. LNCS 7156. A. Murarasu, J. Weidendorfer, G. Buse, D. Butnaru, and D. Pfl¨ uger. Compact Data Structure and Scalable Algorithms for the Sparse Grid Technique. In Proc. of the 16th ACM Symp. on Principles and Practice of Parallel Programming, February 2011. R CUDATM C Programming NVIDIA. NVIDIA Guide, 2011. NVIDIA. OpenCLTM Best Practices Guide, 2011. NVIDIA*. NVIDIA’s Next Generation CUDA* Compute Architecture: Kepler GK110*, v1.0, January 2013. version 1.0, http://www.nvidia.com/content/PDF/kepler/ NVIDIA-Kepler-GK110-Architecture-Whitepaper. pdf. D. Pfl¨ uger, B. Peherstorfer, and H.-J. Bungartz. Spatially Adaptive Sparse Grids for High-dimensional Data-driven Problems. Journal of Complexity, 26(5):508–522, October 2010. Dirk Pfl¨ uger. Spatially Adaptive Sparse Grids for High-Dimensional Problems. Verlag Dr. Hut, M¨ unchen, August 2010. C. Reisinger and G. Wittum. Efficient Hierarchical Approximation of High-Dimensional Option Pricing Problems. SIAM Journal on Scientific Computing, 29(1):440–458, 2007. K. Sato, H. Takizawa, K. Komatsu, and H. Kobayashi. Automatic Tuning of CUDA Execution Parameters for Stencil Processing. Software Automatic Tuning, pages 209–228, 2010. V. Volkov and J.W. Demmel. Benchmarking GPUs to Tune Dense Linear Algebra. In Proc. of the 2008 ACM/IEEE Conf. on Supercomputing, Austin, TX, November 2008. G. Widmer, R. Hiptmair, and C. Schwab. Sparse Adaptive Finite Elements for Radiative Transfer. Journal of Computational Physics, 227:6071–6105, June 2008.

Demonstrating Performance Portability of a Custom ...

Demonstrating Performance Portability of a Custom ...

Suggest Documents

Introduction Learning from history: portability ... - Performance Portability

Programmability and Performance Portability Aspects of

Performance and Portability of Accelerated Lattice Boltzmann ...

Performance Portability of a GPU Enabled Factorization with the

Performance Portability of a GPU Enabled Factorization - Innovative ...

Performance portability of a lattice Boltzmann code - ScicomP

Performance portability of a lattice Boltzmann code - ScicomP

Portability and Performance: Mentat Applications on ... - CiteSeerX

Improving Performance Portability and Exascale Software ... - Nabla

Software Autotuning for Sustainable Performance Portability

Manycore performance-portability: Kokkos multidimensional array library

PSyKAl: a Code Generation to Performance Portability - Workshop on ...

PSyKAl: a Code Generation to Performance Portability - Workshop on ...

Performance modeling of virtualized custom logic computations ...

An Investigation of the Performance Portability of ... - Semantic Scholar

Performance modeling of virtualized custom logic computations ...

Performance Analysis of Fixed, Reconfigurable, and Custom ...

Performance Analysis of Fixed, Reconfigurable, and Custom ...

High-level Abstractions for Performance, Portability and Continuity of ...

High-level Abstractions for Performance, Portability and Continuity of ...

Portability of performance with the BSPLib ... - Semantic Scholar

An Experimental Study on Performance Portability of OpenCL Kernels

Support of Collective Effort Towards Performance Portability - Usenix

Performance Portability through Semi-explicit Placement - School of