Computing SDK [4] are based on the C programming language with few extensions and have a general purpose nature. The next step will be the application of ...
NQueens on CUDA: Optimization Issues
Frank Feinbube Bernhard Rabe Martin von L¨owis Andreas Polze Hasso Plattner Institute at the University of Potsdam Prof.-Dr.-Helmert-Str. 2-3, 14482 Potsdam, Germany {frank.feinbube, bernhard.rabe, martin.vonloewis, andreas.polze}@hpi.uni-potsdam.de
Abstract—Todays commercial off-the-shelf computer systems are multicore computing systems as a combination of CPU, graphic processor (GPU) and custom devices. In comparison with CPU cores, graphic cards are capable to execute hundreds up to thousands compute units in parallel. To benefit from these GPU computing resources, applications have to be parallelized and adapted to the target architecture. In this paper we show our experience in applying the NQueens puzzle solution on GPUs using Nvidia’s CUDA (Compute Unified Device Architecture) technology. Using the example of memory usage and memory access, we demonstrate that optimizations of CUDA programs may have contrary results on different CUDA architectures. Evaluation results will point out, that it is not sufficient to use new programming languages or compilers to achieve best results with emerging graphic card computing. Keywords-GPGPU; memory access trade-off;
I. I NTRODUCTION A recent trend in microprocessor design is the increase of the number of processing cores on a single processor chip, where the resulting products are often called multicore or manycore processors. This trend originates from the desire to utilize the increased number of transistors which can be accommodated on a single chip, following the prediction of Moore’s law. Other strategies for utilizing more transistors, such as pipelined and superscalar execution, have mostly been exhausted, leaving the integration of many computing cores as the major strategy to provide an ongoing increase of computing power. This trend can be seen both in the system central processors as well as in graphics processors. For some years graphic cards were not only used to render pictures to screens, but also for numerical processing. In these applications, shader languages or vendor specific languages like AMD Brook+ [1], AMD Cal [1] or Nvidia Cg [2] were applied. Current frameworks like Nvidia CUDA [3] and AMD Stream Computing SDK [4] are based on the C programming language with few extensions and have a general purpose nature. The next step will be the application of the emerging OpenCL [5], [6] programming framework. It allows to write programs that use either the CPU or GPU as the underlying processing device. The OpenCL implementations that were available at time of writing were not fully functional. ATI provided an implementation that was only capable to use CPUs while Nvidia only supported the use of their CUDA-enabled graphic cards. Therefore we focus on CUDA which is a well established ”scalable parallel programming model and a software environment for parallel computing” [7]. The main contribution of this paper is a compilation of issues that we encountered when solving the NQueens problem using the CUDA framework. We first present the NQueens problem and the primary concepts of CUDA programming, report related work, and then report our own experiences with CUDA. The final
evaluation shows that the performance impact of code changes can vary heavily for different CUDA architectures. This makes it very difficult to optimize the runtime of programs for CUDA-enabled graphic cards. II. T HE NQ UEENS P ROBLEM In order to get a deep understanding of the CUDA programming model, we chose an active researched problem that is complex enough to demonstrate the abilities and limits of programming for GPU. The n queens puzzle fulfilled these requirements. Its goal is to find all ways to place a given number n of queens on a chessboard which has n times n fields. A configuration is only valid if no queen attacks another one. This holds if no queen is placed in a row, column or diagonal that is used by another queen.
Figure 1.
One of the 92 solutions for the 8 queens problem.
For a regular chess board of 8 by 8 fields, there are 92 possible configurations for 8 queens so that they do not attack each other. Because the problem state increases factorial with the number of queens, today the solutions for the puzzle are only known up to a number of 26 queens. Preußer et al. from the University of Dresden [8] calculated that there are 22,317,699,616,364,044 valid configurations and published their results on July, 11th 2009 [9]. III. R ELATED W ORK Queens@TUD (introduced in the previous section) utilized specialized FPGA (Field-programmable gate array) boards for the calculations. The overall problem was subdivided into subproblems that were calculated in parallel. It took 270 days to calculate the solution for 𝑁 =26. This is particular impressive if we take into account that only 26 Altera and Xilinx boards were used and that each of the boards was running at a very low clock frequency of 90 to 180 MHz. The great performance was achieved by minimizing the instruction set overhead. This example
also demonstrates that CPUs are not the only answer for highperformance parallel computing. Specialized hardware like FPGAs or graphic cards can be a better approach for some problem classes. The NQueens@Home project headed at the Universidad de Concepci´on de Chile [10] is a massive-parallel distributed system with constantly varying participants. A middleware provides and distributes work packages over the Internet. The actual calculations are done by the personal computers of registered users. In contrast to similar systems, NQueens@Home does not exploit graphic cards for the calculations. Both projects are using a heavily optimized sequential algorithm written in C-language by J. Somers [11] which we took as the basis for our algorithm as well. It is a backtracking algorithm and works as follows: For a new queen it uses a bit-mask to select a free place. After placing the queen it updates three lists of masks. One stores all occupied columns, the second all occupied positive1 diagonals and the last one all occupied negative diagonals. For the next row, the positive diagonal is shifted to the right and the negative one to the left. After that the placing bit-mask is recalculated and a new queen will be set. If no place is available or all queens were set, the algorithm removes the latest entries and sets the latest queen to a different position. In case that all queens were set, the solution counter is increased by one. In [12] the general purpose graphic programming language Nvidia Cg is used to solve the NQueens problem. The authors evaluated their solution with a Nvidia GeForce 6800 Ultra graphic card and compared the results with a Pentium M 2.00 GHz processor. The processor dramatically outperformed the graphic card. [13] discusses different implementations of NQueens solvers and evaluates selected approaches. For that, a Nvidia GeForce 9600 GT running CUDA 1.0 implementations and a Intel Quad Core 2.4 GHz running C++ implementations were used. This evaluation also shows that general purpose usage of graphic cards is slow. The authors also state that CUDA is even slower than Cg. Both publications study only a single architecture of graphic cards. We will show that there are significant differences in the way graphic cards with different CUDA-Architectures run the same code.
The CUDA programming model aims to enable programmers to develop parallel algorithms that scale to hundreds of cores using thousands of threads. CUDA developers should not need to think about the mechanics of a parallel programming language, but should be enabled to employ the CPU and the GPU in parallel [7]. Listing 1 shows the regular way to use CUDA. At first, memory is allocated on the host as well as on the device. Then the input parameters for the CUDA program are copied to the device. This is necessary because CUDA code cannot access host memory. The parts of the application that run on the graphic card are called kernels. A kernel is executed by a number of thread blocks that consist of a number of lightweight CUDA threads. The size of a thread block and the number of blocks can be configured at kernel launch time. The selection of these values is a non-trivial problem that is beyond the scope of this paper. CUDA threads that reside in the same block can communicate using slow device memory or fast shared memory. In addition they can be synchronized. The amount of shared memory per block can also be configured at kernel launch time. While the kernel is running, the CPU is free to process additional workloads. In this example it simply waits, till the calculations are finished using the cudaThreadSynchronize statement. The final step is to load the result data from the graphic card’s memory to the host memory. 1
3
5
7
9
Listing 1.
1 Positive
corner.
means facing from the upper left corner to the bottom right
CUDA usage model
Each kernel fulfills the same tasks, as listed in Listing 2. At first each thread calculates its unique thread identifier using some constructs provided by the CUDA environment (blockIdx, blockDim, threadIdx). This identifier can be used to make control decisions and to compute memory addresses to access input parameters. Then the actual calculations take place. In the last step of a CUDA kernel the calculated result is written back to the device memory which can be accessed by the host program afterwards.
IV. T HE CUDA-P ROGRAMMING M ODEL Nvidia CUDA [3] is a well established ”scalable parallel programming model and a software environment for parallel computing” [7]. It allows writing programs that run on general purpose graphic processors (GPGPU) by Nvidia. Due to the hardware architecture of these devices, many complex computational problems can be solved much faster compared to current CPUs. These problems include physical computations and video processing. Nowadays, development for CUDA is done using some C extensions. The code is precompiled with the nvcc tool provided by Nvidia and eventually compiled to binaries accessing CUDAenabled graphic drivers. Because CUDA works with all modern graphic cards from Nvidia, even Nvidia ION [14], it is widelyused in current computers. This makes it particularly interesting for research on parallel computing.
/ / 1 . a l l o c a t e memory on t h e g r a p h i c c a r d c u d a M a l l o c ( ( v o i d ∗ ∗ )&dataGpu , memSize ) ; / / 2 . c o p y t o d e v i c e memory o f t h e g r a p h i c c a r d cudaMemcpy ( pDataGpu , pData , memSize , cudaMemcpyHostToDevice ) ; / / 3 . run t h e c a l c u l a t i o n k e r n e l k e r n e l (pDataGpu ) ; / / 4 . wait f o r the graphic card t h r e a d s to finish calculations cudaThreadSynchronize ( ) ; / / 5 . c o p y f r o m d e v i c e memory o f t h e g r a p h i c c a r d b a c k t o h o s t memory cudaMemcpy ( pData , pDataGpu , memSize , cudaMemcpyDeviceToHost ) ;
2
4
6
8
/ / 1 . d e r i v e t h r e a d i d f r o m b l o c k i d and r e l a t i v e thread id i n t t h r e a d I d = b l o c k I d x . x∗ blockDim . x+ t h r e a d I d x . x ; / / 2 . read from array ( u s i n g t h r e a d i d as i n d e x ) i n t p a r a m e t e r = dataGpu [ t h r e a d I d ] ; / / 3. calculate the r e s u l t r e s u l t = parameter ∗ parameter ; / / 4 . w r i t e t o array ( u s i n g t h r e a d i d as i n d e x ) dataGpu [ t h r e a d I d ] = r e s u l t ; Listing 2.
CUDA kernel model
There is a strict separation of kernels and normal program routines. Kernel methods must not be recursive and must not use static
from the given setting. Applied to CUDA the first algorithm runs on the host and the second one as a kernel on the graphic card.
variables. On future CUDA-enabled architectures atomic doubleprecision floating point operations will be supported via hardware, but at the moment only single-precision floating point operations are feasible. Using double-precision floating point anyway results in slow performance due to the emulation with single-precision operations. The CUDA memory model is shown in Table I. Similar to the memory of the host, every CUDA device has its own memory. This so called device memory or global memory is the biggest memory region on the graphic device. It is off-chip, therefore relatively slow and can be accessed by host and by kernels. In addition it is persistent across kernel launches. The other extreme are the very fast registers of a thread. There are 16k registers2 for all threads on a device. Taken a kernel that is executed with 32 blocks and 128 threads per block, only 4 registers are available for each thread. Local variables that do not fit into registers reside in so called local memory. This memory is mapped into the off-chip device memory and therefore very slow compared to registers. In addition every thread block has a small amount (16 KB) of on-chip shared memory. That memory can be either used for communication and synchronization between threads or as a storage for local variables. Because it is so restricted, for a large thread count only a small amount of shared memory can be used per thread. In some cases it pays off to decrease the thread count and to use a greater amount of the fast shared memory for each thread. There are also some special cached read-only memories called texture and constant memory.
Figure 2. A precalculated board setting as a startup configuration for our parallel backtracking algorithm. The shaded area is occupied by the queens that are already placed on the board. The rest of the board is the solution space where the remaining 6 queens have to be placed by the worker thread.
The data layout of a precalculated board setting is shown in Figure 3. The data is represented as a single-dimensional array. Every thread has a chunk within this array that contains information about the precalculated rows. If we use 4 threads and have 3 precalculated rows per thread the layout would look exactly like in the figure. Every row setting is represented by a bitfield where the free places are marked by 1s and occupied ones by 0s.
V. A PPROACH Bitfield row 1 thread 1
After investigating the problem and different proof of concepts, we have chosen the promising algorithm of J. Somers [11] (see Section III) as basis for our CUDA solution. This decision founded on the fact that this algorithm does not use recursions, consumes little memory and is used by several other implementations, especially in the very successful ones. [8], [10] This NQueens algorithm is backtracking algorithm. The first step is to place a queen in the first row and remember occupied fields of the next row. The next queen is then placed in the second row, but cannot sit on in the same column or diagonal as the first queen. Now we derive the free columns in the third row and place a queen on a free place. This procedure is repeated until there is no free field left. If we filled all rows we have a solution, if not we can exclude a set of configurations. In both cases we back track to the previous row, place its queen on the next free field and go on with the procedure. Due to the fact that flipping a board setting over the y axis creates a new unique solution, it is sufficient to calculate only one-half of the solutions. For odd board sizes a special handling of the middle column is needed, because only one queen can sit in a vertical row.
2 Cards
from the GeForce 8 and 9 series only had 8k of registers.
Bitfield row 3 thread 1
Bitfield row 1 thread 2
Bitfield row 2 thread 2
Bitfield row 3 thread 2
Bitfield row 1 thread 3
Bitfield row 2 thread 3
Bitfield row 3 thread 3
Bitfield row 1 thread 4
Bitfield row 2 thread 4
Bitfield row 3 thread 4
Figure 3. Data layout of precalculated board settings. In this example we have 4 threads and precalculated 3 rows.
Listing 3 shows the host routine that calls the CUDA kernel. In the initData function we calculate the initial board settings using a given board size and a row count3 . Thereby we also determine the thread count, which is equal to the number of precalculated board settings. Next we allocate memory on the graphics card and fill it with our data. After that we start the kernel, wait for it to finish and finally sum up the solutions of the parallel kernel runs. Each thread stores its result at the first position of its input data array chunk.
A. Parallelization The first and most important step in order to use CUDA for the calculations was to parallelize Somers algorithm. Therefore it was modified in a way that allows the precalculation of all board settings for a given number of rows. Figures 2 shows such a precalculated board setting for two queens. These boards are used as the input for the algorithm which calculates all solutions starting
Bitfield row 2 thread 1
2
4
i n t t h r e a d C o u n t , mem size = 0 ; int ∗ data gpu ; i n t ∗ d a t a = i n i t D a t a ( boardsize , depth , & t h r e a d C o u n t , &mem size ) ; i n t blockSize = threadCount / t h r e a d s P e r B l o c k +1; 3 Which
is called depth in the listing.
memory type host memory device memory registers local memory shared memory constant memory texture memory
scope per host per device per thread per thread per block per device per device
host access R/W R/W None None None R/W R/W
kernel access None R/W R/W R/W R/W R R
speed medium medium very fast medium fast fast (cached) fast (cached)
size big big very small medium small medium medium
Table I CUDA MEMORY MODEL [3]
6
c u d a M a l l o c ( ( v o i d ∗ ∗ )&d a t a g p u , mem size ) ; cudaMemcpy ( d a t a g p u , d a t a , mem size , cudaMemcpyHostToDevice ) ;
8
10
17
19
NqueensCUDA( boardsize , threadCount , depth , data gpu ) ; cudaThreadSynchronize ( ) ;
21
cudaMemcpy ( d a t a , d a t a g p u , mem size , cudaMemcpyDeviceToHost ) ;
25
23
12
14
27
i n t a S t a c k [MAX BOARDSIZE + 2 ] ; r e g i s t e r int nStack ; r e g i s t e r i n t numrows = 0 ; r e g i s t e r unsigned i n t l s b ; /∗ l e a s t s i g . b i t ∗/ / ∗ b i t s w h i c h a r e s e t mark f r e e p o s i t i o n s ∗ / r e g i s t e r unsigned i n t b i t f i e l d = 0 ; i n t board minus = b o a r d s i z e − 1; / ∗ i f b o a r d s i z e i s N , mask c o n s i s t s o f N 1 ’ s ∗ / r e g i s t e r i n t mask = ( 1 1; / ∗ d i v i d e by two ∗ / /∗ f i l l in rightmost 1 ’ s in b i t f i e l d for half of board size . ∗/ b i t f i e l d = ( 1 >1; q B i t P o s D i a g [ numrows ] = ( q B i t P o s D i a g [ n ] ∣ l s b ) >1; q B i t P o s D i a g [ numrows ] = ( q B i t P o s D i a g [ n ] ∣ l s b )