Applying the Steps to a More Complex Example.............6. Hardware and Software
. Requirements. • MATLAB R2009b for 32-bit and 64-bit. Windows® or Linux® ...
A ppl i cat i o n G u i d e L i n e s
Accelerating MATLAB Code Using GPUs
Page 1
Using GPUs to Accelerate Complex Applications Until recently, Graphical Processing Units (GPUs) were used almost exclusively for 3D graphics rendering. Architectural advances and new software tools have made GPU hardware more amenable to general-purpose programming. As a result, GPUs can now be used to accelerate computationally intensive applications. The GPU’s ability to operate on large data sets in parallel makes it more effective than a general-purpose CPU for a range of complex image processing, computational finance, and other applications.
About this Guide This guide presents code and a procedure for connecting MATLAB® through the MATLAB MEX API to a GPU kernel written using NVIDIA® CUDA™. MEX files are dynamically linked subroutines produced from C, C++, or Fortran source code that, when compiled, can be run from within MATLAB in the same way as MATLAB files or functions. CUDA is a parallel computing architecture that leverages NVIDIA GPUs and can be programmed in derivative of C. The solution described here is intended for use on desktop computers. While it might be possible to apply the solution to clusters, that topic is outside the scope of this guide. This guide uses MATLAB R2009b or later. For information on MATLAB and CUDA integration using R2007b, visit http://developer.nvidia.com/object/matlab_cuda.html
Hardware and Software Requirements
Contents Applications that Can be Accelerated by GPUs...............3 MEX Files and CUDA....................................................3 Integrating MATLAB with a NVIDIA CUDA Kernel..............3 Getting Started.........................................................3 Creating a Hybrid MEX/CUDA Source File..................3 Compiling and Running the MEX/CUDA Source File....5 Applying the Steps to a More Complex Example.............6
• MATLAB R2009b for 32-bit and 64-bit Windows® or Linux®, or 32-bit Mac® OS1 • A recent NVIDIA graphics card (GeForce 8000 series or later) • CUDA version 2.3 device driver, toolkit, and SDK (download at www.nvidia.com/ object/cuda_get.html)
• A host compiler supported by CUDA, such as Microsoft Visual C++ 2008 on Windows and gcc on Linux and Mac
Download Code Samples www.mathworks.com/programs/techkits/cudaWhitePaper.zip http://www.mathworks.com/programs/techkits/matlab_gpu_conf.html
At the time of writing, CUDA is not available for 64-bit Mac.
1
Page 2
Applications that Can be Accelerated by GPUs GPU architecture is fundamentally different from that of a traditional CPU. GPUs are massively parallel and enjoy fast internal memory access, but application (kernel) size is limited. And because a GPU is attached to the host CPU via the PCI-Express bus, moving data between the CPU and the GPU can be a costly operation. For an application to be accelerated by the use of GPUs, it must be: Massively task-parallel—It can be broken down into hundreds or thousands of independent units of work. Computationally intensive—The time spent on computation exceeds the time spent on reading or writing to memory. Limited in kernel size—The kernel code is small enough to fit onto the GPU (generally, no more than a few kilobytes in size). Applications that do not satisfy these criteria might actually run slower on a GPU than on a CPU. MEX Files and CUDA When authoring a MEX file you must include a function named mexFunction(), which can be considered the ”main()” function of a MEX plug-in. CUDA source files are given the .cu file extension. They contain both CUDA code (to be executed on the GPU) and C code (to be executed on the host CPU). CUDA source files are compiled with the CUDA compiler, nvcc, provided in the NVIDIA toolkit. To integrate CUDA into MATLAB you create a .cu file that incorporates both the CUDA kernel and the C code that implements the MEX-File API. Integrating MATLAB with a NVIDIA CUDA Kernel In the following sections we use a simple timestwo function to illustrate the steps required to create and integrate a CUDA kernel that can be invoked from MATLAB code (timestwo takes a scalar input and doubles it). We then apply these steps to a real-world application and review performance results.
Getting Started
Check that both CUDA and MATLAB are successfully installed and running independently. For MATLAB: Run mex –setup and select your host compiler if you have not already done so. Confirm that MEX is operational by following the steps in the “Overview of Building the timestwo.c MEX-File” section of the MATLAB documentation (www.mathworks.com/timestwo-mex). For CUDA: Compile and run one or more of the examples provided in the SDK, such as the “DeviceQuery” example.
Creating a Hybrid MEX/CUDA Source File
Modify the timestwo MEX example file to perform its computation on the GPU using CUDA. The following code incorporates these modifications:2 // nvtimestwo.cu - Adaption of MEX timestwo example for CUDA // Copyright 2009 The MathWorks, Inc. #include ”mex.h” #include ”cuda.h” #include ”cuda_runtime.h”
Tip: If you will be editing .cu files in the MATLAB Editor, add “cu” as one of the C language files extensions (File->Preferences-> Editor->Language->C/C++). This turns on syntax coloring and smart indenting for files with a .cu file extension. 2
Page 3
// -------------- the kernel runs on the GPU --------------------------__global__ void timesTwoKernel( float *d_Operand1, float *d_Result1, int arraySize ) { int idx = blockDim.x * blockIdx.x + threadIdx.x; if (idx > arraySize) return; d_Result1[idx] = 2.0 * d_Operand1[idx]; } // ----------------- the MEX driver runs on the CPU -------------------void mexFunction( int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[] ) { mwSize mrows, ncols; /* Check for proper number of arguments. */ if(nrhs!=1) { mexErrMsgTxt(”One input required.”); } else if(nlhs>1) { mexErrMsgTxt(”Too many output arguments.”); } /* The input must be a noncomplex scalar single.*/ mrows = mxGetM(prhs[0]); ncols = mxGetN(prhs[0]); if( mxGetClassID(prhs[0]) != mxSINGLE_CLASS || mxIsComplex(prhs[0]) || !(mrows==1 && ncols==1) ) { mexErrMsgTxt(”Input must be a noncomplex scalar single.”); } /* Create matrix for the return argument. */ plhs[0] = mxCreateNumericMatrix(mrows, ncols, mxSINGLE_CLASS, mxREAL); int arraySize = mrows * ncols; int memSize = sizeof(float) * arraySize; float *d_Operand1, *d_Result1; if ( cudaMalloc( &d_Operand1, memSize ) != cudaSuccess ) mexErrMsgTxt(”Memory allocating failure on the GPU.”); if ( cudaMalloc( &d_Result1, memSize ) != cudaSuccess ) mexErrMsgTxt(”Memory allocating failure on the GPU.”); cudaMemcpy( d_Operand1, (float*) mxGetData(prhs[0]), memSize, cudaMemcpyHostToDevice); /* Run the kernel */ int blockSize = 1; dim3 block(blockSize); dim3 grid(ceil(arraySize/(float)blockSize)); timesTwoKernel>( d_Operand1, d_Result1, arraySize); /* Get results back from the GPU and free device memory */ udaMemcpy( (float*) mxGetData(plhs[0]), d_Result1, memSize, cudaMemcpyDeviceToHost); c
}
cudaFree( d_Operand1 ); cudaFree( d_Result1 );
Page 4
The most significant code changes were the following: • The MEX file now operates on the MATLAB single data type. This modification is desirable because early CUDA devices do not support double-precision floating point. In addition, single-precision floating point significantly outperforms double-precision floating point on current CUDA devices. • The GPU is initialized, and memory buffers are created. • The timestwo function is replaced with a CUDA kernel, timesTwoKernel.
Compiling and Running the MEX/CUDA Source File
In a typical MATLAB workflow, MEX file compilation is handled by the MATLAB command mex, which compiles and links the C/MEX file. As of R2009b, however, MATLAB does not support the .cu file extension or the CUDA compiler nvcc. To prepare a file for use in MATLAB, do the following: 1. Shell to the NVIDIA compiler to create the host object file. 2. Invoke mex with the object file, not the source file, as the parameter. A simple MATLAB program can manage both tasks. Note that path names might need to change depending on where the CUDA SDK and host compiler are installed on your computer: function nvmex(cuFileName) %NVMEX Compiles and links a CUDA file for MATLAB usage % NVMEX(FILENAME) will create a MEX-File (also with the name FILENAME) by % invoking the CUDA compiler, nvcc, and then linking with the MEX % function in MATLAB. % Copyright 2009 The MathWorks, Inc. % !!! Modify the paths below to fit your own installation !!! if ispc % Windows CUDA_LIB_Location = ’C:\CUDA\lib’; Host_Compiler_Location = ’-ccbin ”C:\Program Files\Microsoft Visual Studio 9.0\VC\bin”’; PIC_Option = ’’; else % Mac and Linux (assuming gcc is on the path) CUDA_LIB_Location = ’/usr/local/cuda/lib64’; Host_Compiler_Location = ’’; PIC_Option = ’ --compiler-options -fPIC ’; end % !!! End of things to modify !!! [~, filename] = fileparts(cuFileName); nvccCommandLine = [ ... ’nvcc --compile ’ cuFileName ’ ’ Host_Compiler_Location ’ ’ ... ’ -o ’ filename ’.o ’ ... PIC_Option ... ’ -I’ matlabroot ’/extern/include ’ ... ]; mexCommandLine = [’mex (’’’ filename ’.o’’, ’’-L’ CUDA_LIB_Location ’’’, ’’-lcudart’’)’]; disp(nvccCommandLine); status = system(nvccCommandLine); if status < 0 error ’Error invoking nvcc’; end
Page 5
disp(mexCommandLine); eval(mexCommandLine); end
Invoke the nvmex function within MATLAB to create the MEX file3 nvmex(’nvtimestwo.cu’)
The MEX function is now available for use. Since it takes single-precision data as inputs, be sure to pass a value of type single: nvtimestwo (single(5))
The MEX/CUDA function can now be tested against the original (CPU) MEX file: nvtimestwo(single(5)) == single(timestwo(5))
Setup time and the overhead of moving data to and from the GPU slow down performance of the simple example described in the previous sections. However, a more complex example, such as calculating a power series, can greatly benefit from GPU acceleration. Applying the Procedure to a More Complex Example Suppose we want to compute the value of ex using a power series for a vector of x values: x +— x2+ — x3+ ... ex = 1 + — 1! 2! 3! Applying Horner’s rule, MATLAB implementation is as follows: function a = powerSeries(x) %POWERSERIES Computes e^x % Y = POWERSERIES(X) returns the value of e^X % Copyright 2009 The MathWorks, Inc. a = 1 + x.*(1 + x.*(1 + x.*(1 + x.*(1 + x.*(1 + x.*(1 + x.*(1 + x.*(1 + x... /9)/8)/7)/6)/5)/4)/3)/2); end
The MEX/CUDA implementation is // nvpowerseries.cu - Calculate e^x using CUDA // Copyright 2009 The MathWorks, Inc. #include ”mex.h” #include ”cuda.h” #include ”cuda_runtime.h” // -------------- the kernel runs on the GPU --------------------------__global__ void powerSeriesKernel( float *d_Operand1, float *d_Result1, int arraySize ) { // Which index in the array is this thread going to operate on? This is // defined by my thread index and my block index. int idx = blockDim.x * blockIdx.x + threadIdx.x; // There will always be more threads that array elements so if this // happens to be a thread that doesn’t have an array element just exit
On Windows, warnings about a missing manifest can be ignored.
3
Page 6
if (idx > arraySize) return; d_Result1[idx] = }
1 (1 (1 (1 (1 (1 (1 (1 (1
+ + + + + + + + +
d_Operand1[idx]* d_Operand1[idx]* d_Operand1[idx]* d_Operand1[idx]* d_Operand1[idx]* d_Operand1[idx]* d_Operand1[idx]* d_Operand1[idx]* d_Operand1[idx]*/9)/8)/7)/6)/5)/4)/3)/2);
// ----------------- the MEX driver runs on the CPU -------------------void mexFunction( int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[] ) { mwSize mwSize mwSize // Get size_t
m = mxGetM(prhs[0]); n = mxGetN(prhs[0]); arraySize = m * n; the total memory needed for an array on the GPU (in bytes) mem_size = sizeof(float) * arraySize;
float *d_Operand1; float *d_Result1; // Allocate memory on the GPU to hold the input and output data if ( cudaMalloc( &d_Operand1, mem_size ) != cudaSuccess ) mexErrMsgTxt(”Memory allocating failure on the GPU.”); if ( cudaMalloc( &d_Result1, mem_size ) != cudaSuccess ) mexErrMsgTxt(”Memory allocating failure on the GPU.”); // Copy the input data across to the card cudaMemcpy( d_Operand1, (float*) mxGetData(prhs[0]), mem_size, cudaMemcpyHostToDevice);
}
// Define the block and grid size - this will have 512 threads per // thread block and a sufficient grid so that there is a GPU thread per // element of the input array. int blockSize = 512; dim3 block(blockSize); dim3 grid(ceil(arraySize/(float)blockSize)); // Use the CUDA runtime to run the kernel powerSeriesKernel>( d_Operand1, d_Result1, arraySize); // Create the output array for MATLAB plhs[0]=mxCreateNumericMatrix(m, n, mxSINGLE_CLASS, mxREAL); // Copy the data from the card into the MATLAB array cudaMemcpy((float*)mxGetData(plhs[0]), d_Result1, mem_size, cudaMemcpyDeviceToHost); // Free the data on the card cudaFree( d_Operand1 ); cudaFree( d_Result1 );
We compile and run the MEX implementation using the nvmex function nvmex(’nvPowerSeries.cu’);
and then test the CUDA implementation. A=rand(1000,1,’single’); rCPU = powerSeries(A); scatter(A,rCPU); rGPU= nvPowerSeries(A); all(rGPU == rCPU)
Page 7
It is likely that the call to all will return 0, indicating that the computational results are not identical. These differences are typically small and in most cases acceptable, but programmers concerned with numerical consistency should be aware of them. We can experiment with different problem sizes to determine when a problem is substantial enough to benefit from the GPU: % % % % %
Compare Host and GPU power series performance Performs a power series computation (calculating e^x) on both the CPU and NVIDIA GPU for a number of different problem sizes, then plots results. Copyright 2009 The MathWorks, Inc.
testSizes = 100000:10000:2000000; results = zeros(length(testSizes),2); for i=1:length(testSizes) A = rand(testSizes(i),1,’single’); tic; powerSeries(A); results(i,1) = toc; tic; nvPowerSeries(A); results(i,2) = toc; end scatter(testSizes, results(:,1), ’bo’); hold on scatter(testSizes, results(:,2), ’rx’); hold off title(’Time to Execute’); xlabel(’length’) ylabel(’time’) legend(’CPU’,’GPU’,’Location’,’NorthWest’)
The resulting plot4 demonstrates a tenfold performance difference between the CUDA/GPU and CPU implementations (Figure 1). Note that the difference in performance increases with the problem size, and that for very small problems sizes (shown on the extreme left of the plot), the implementation that uses the GPU is slower than the CPU-only implementation.
For More Information
Ken Atwell (
[email protected])
Arjav Chakravarti (
[email protected])
If you see an extreme outlier that “squishes” the chart plot, rerunning the experiment is recommended.
4
Page 8
Figure 1. Performance comparison between the CUDA/GPU and CPU implementations.
Resources visit www.mathworks.com Technical Support www.mathworks.com/support Online User Community www.mathworks.com/matlabcentral Demos www.mathworks.com/demos Training Services www.mathworks.com/training Third-Party Products and Services www.mathworks.com/connections Worldwide CONTACTS www.mathworks.com/contact e-mail
[email protected]
© 2010 The MathWorks, Inc. MATLAB and Simulink are registered trademarks of The MathWorks, Inc. See www.mathworks.com/trademarks for a list of additional trademarks. Other product or brand names may be trademarks or registered trademarks of their respective holders.
Accelerating the pace of engineering and science
91804v01 01/10