side during the development of a high performance version for a biomedical ... ings with Perl, Python, Ruby and R, through Biolib [1]. Keywordsâ Graphics ...
USING GRAPHICS PROCESSORS FOR A HIGH PERFORMANCE NORMALIZATION OF GENE EXPRESSIONS Andr´es Rodr´ıguez, Oswaldo Trelles, Manuel Ujald´on Computer Architecture Department University of Malaga Malaga, Spain Email: {andres,ots,ujaldon}@uma.es ABSTRACT With the arrival of CUDA, Graphics Processors (GPUs) have unveiled an extraordinary power to accelerate data intensive general purpose computing more and more as times goes by. In parallel with this perception, the huge volume of molecular data produced by current high-throughput technologies in modern molecular biology has increased at a similar pace the challenge in our capacity to process and understand data. This work takes these two emerging trends to benefit side by side during the development of a high performance version for a biomedical application of growing popularity: the gene expression quantile normalization for high density oligonucleotide array data. A variety of experimental issues are analyzed for this execution, including cost, performance and scalability of the graphics architecture. Our study reveals advantages and drawbacks of using the GPU as target platform, providing lessons to benefit a broad set of existing bioinformatics applications, either based on those pillars or having similarities with its procedures. Our ultimate goal is to provide mechanisms for an effective task scheduling, data partitioning and GPU mapping of developed biomedical routines into an open source parallel library which will be freely available for bindings with Perl, Python, Ruby and R, through Biolib [1]. Keywords— Graphics Processing Units (GPUs), CUDA Programming, High Performance Computing, Bioinformatics, Normalization of Gene Expressions. 1. INTRODUCTION We are witnessing the consolidation of GPUs in parallel computing. Represented by solid commercial products, GPUs provide a low cost road towards high performance whose scope spreads beyond graphical territory. Shaders programming has transformed vertex and pixel processors into configurable units willing to compute very diverse problems, and unified shaders have strength functionality while overcoming This work was supported by the Junta de Andaluc´ıa of Spain, under Project of Excellence P06-TIC-02109.
constraints. With the advent of CUDA and, more recently, OpenCL, GPUs are developing an extraordinary power to accelerate data intensive general purpose computing with remarkable scalability.
Nowadays, a natural scientific field to deploy this computational force is bioinformatics. The advent of the Human Genome Project has brought to the foreground of parallel computing a broad spectrum of data intensive biomedical applications where molecular biology and computer science join as a happy alliance between demanding software and powerful hardware. Since then, the bioinformatics community generates computational solutions to support genomic and post genomic research [2] in many subfields such as gene structure prediction, phylogenetic trees, protein docking, and sequence alignment, just to mention a few.
In this work, we face a high performance implementation of Q-norm [3], a quantile-based normalization method for high density oligonucleotide array data based on variance and bias. Q-norm is an increasingly popular method for a fast and easy to understand normalization of multiple gene-expression datasets under the assumption of sharing a common distribution. When running genetic experiments that involve multiple high density oligonucleotide arrays, it becomes crucial to remove sources of variation between samples of non-biological origin. Normalization is a process for reducing this variation.
The rest of the paper is organized as follows. Section 2 describes the Q-norm algorithmand its input data set. Section 3 outlines the CUDA programming model and hardware architecture for the GPU. Section 4 explains the parallelization of Q-norm on GPUs. Section 5 presents and discusses the performance numbers obtained when running our parallel Q-norm implementation on GPUs. Finally, Sections 6 and 7 conclude and provide some insights for future work.
2. Q-NORM: AN ALGORITHM FOR NORMALIZING OLIGONUCLEOTIDES 2.1. The input data set The high computational cost and memory requirements of the Q-norm method are derived from its huge input source, typically a matrix X composed of p > 6 millions of gene expression values and N > 1000 samples on a regular basis. N spreads over matrix columns and p does it along rows, where a single matrix element, X[i,j], indicates the intensity for the ith gene expression values into the j-th sample. Values of X are positive integers which are extracted by high density oligonucleotide microarray technology provided by the Affymetrix GeneChip infrastructure [4] widely used in many areas of biomedical research. Those integers are the target numbers to normalize, usually by means of some kind of average for every array element placed in the same quantile [5, 6]. In our experimental case, actual numbers are p = 6.553.600 gene expression values and N = 470 samples. The input dataset was taken from the GEO (Gene Expression Omnibus) Web repository [7] as submitted by Affymetrix under the GeneChip Human Mapping 500K Array Set (platform GPL3718). 2.2. The algorithm The ultimate goal of the Q-norm method is to make the distribution of probe intensities for each array in a set of samples the same [3], under the assumption that there is an underlying common distribution for all those samples. A QQplot tool can be used to compare if two datasets come from the same distribution by checking that the quantiles line up on the diagonal. This suggests that one could give two disparate datasets the same distribution by transforming the quantiles of each dataset to have the same value, which will be their average value. Extending this idea to N dimensions gives us a method of finding a common distribution from multiple data vectors. Let qk = (qk1 , ..., qkN ) for k = 1, ..., p be the vector of the k-th quantiles for all N array samples of length p which compose matrix X of dimensions p × N where each sample is a column. The quantile normalization goes as follows: 1. Sort each column of X to give Xsort . 2. Take the means across rows of Xsort and assign this mean to each element in the row to get X ′ sort . 3. Produce Xnorm as output by rearranging each column of X ′ sort to have the same ordering as original X. This method forces the values of quantiles to be equal, which may cause problems in the tails where it is possible for a probe to have the same value across all the array samples. However, this case is unrealistic since probeset expression measures are typically computed using the value of multiple probes.
Fig. 1. The CUDA hardware interface for the GeForce 9800 GX2 GPU from Nvidia used during our experiments. 3. THE CUDA PROGRAMMING MODEL AND HARDWARE INTERFACE Modern GPUs are powerful computing platforms that reside at the extreme end of the design spaces of throughputoriented architectures, allowing hardware scheduling of a parallel computation to be practical. Their increased core counts and hardware multithreading are two appealing issues that CPUs are quickly adopting, but until both models converge, we face a transition period of heterogeneous computing where a PC is seen as a bi-processor platform with the GPU acting as an accelerator of data-parallel code shipped from the CPU. This way, the CPU plays the role of a host holding a C program, where I/O is commanded, GPU transfers are controlled and GPU kernels are launched. Such kernels are developed in CUDA (Compute Unified Device Architecture)[8], a programming model for general-purpose computing on GPUs. As a hardware interface, CUDA started by transforming the G80 microarchitecture related to the GeForce 8 and 9 series from Nvidia into a parallel SIMD architecture endowed with up to 128 cores where a collection of threads may run in parallel. Figure 1 outlines the block diagram of this GPU architecture, particularized for the case of the GeForce 9800 GX2 model which we use throughout our experiments. For more details of hardware components, see Table 1. As a programming interface, CUDA consists of a set of C language library functions, and the CUDA-specific compiler generates the executable for the GPU from a source code where the following elements meet (see Figure 2): • A program is decomposed into blocks that run logically in parallel (physically only if there are resources avail-
Table 1. Summary of hardware features for the set of processors used during our CPU vs. GPU comparison. They belong to a similar time frame in the commodity PC marketplace as of 2009. Processor type Processor model Architecture Processor speed Memory speed Memory bus width Memory bandwidth Memory size (type)
CPU Q9450 (4 cores) Intel Core 2 2.66 GHz 2×900 MHz 128 bits 28.8 GB/sec. 4096 MB (DDR3)
GPU GeForce 9800 GX2 Nvidia G92 600 MHz 2×1 GHz 256 bits 64 GB/sec. 512 MB (GDDR3)
Table 2. The limitations for CUDA programming with respect to our GeForce 9800 GX2 GPU, together with its impact on performance.
Fig. 2. The CUDA programming model. able). Assembled by the developer, a block is a group of threads that is mapped to a single multiprocessor, where they can share 16 KB of memory. • All threads of concurrent blocks on a single multiprocessor divide the resources available equally amongst themselves. The data is also divided amongst all of the threads in a SIMD fashion with a decomposition explicitly managed by the developer. • A kernel is the code to be executed by each thread. Conditional execution of different operations can be achieved based on a unique thread ID. In the CUDA model, all of the threads can access all of the GPU memory, but, as expected, there is a performance boost when threads access data resident in shared memory, which is explicitly managed. In order to make the most efficient usage of the GPU’s computational resources, large data structures are stored in global memory and the shared memory should be prioritized for storing strategic, often-used data structures. 4. PARALLELIZATION OF Q-NORM ON GPUS USING CUDA 4.1. Mapping Q-norm onto the GPU Figure 3 shows an outline of our Q-norm implementation using CUDA. We illustrate the code running on the host (CPU side), where all major elements are present. Previous implementations of Q-norm were carried out on distributed memory and shared memory multiprocessors [9], where a preliminary stage was required to perform a dynamic
Parameter Multiprocessors per GPU Cores (SPs) / Multiprocessor Thread Blocks / Multiproc. Threads / Block Threads / Multiprocessor Registers / Multiprocessor Shared Memory / Multiproc.
Limit 16 8 8 512 768 8192 16 KB
Impact Low Low Medium Medium High High High
distribution over the set of columns that are concurrently sorted and then partially row averaged. Even though nearly optimal performance was attained following this idea in modern supercomputers [9], we decided to change the parallelization strategy for the GPU to favor finer grain parallelism. We believe this was a key decision behind our success for being competitive with a single graphics card against hardware infrastructure whose value exceeds 300000 e, that is, more than a thousand times the budget of our GPU. The GPU speed-up is also remarkable versus a counterpart quad-core CPU version running on the same PC as Section 5 demonstrates. Our GPU implementation was developed as follows: Every column is processed in parallel by as many threads and blocks as possible using the first CUDA kernel, QsortGPU, which allocates all cores and multiprocessors available on the GPU. Note that this kernel deals with more than six millions of gene expression values. This guarantees enough data for all cores, even on future GPUs like Fermi where the number of cores reaches 512. However, if we perform data partitioning using the set of columns, the number of columns (currently in the range 500-1000 and growing) may prevent from a good workload balance in the context of an increasing number of cores, which entails poor scalability on future architectures. Once the QsortGPU kernel is over, the RowAccum kernel allocates all GPU resources again and rows are partially accumulated with results from columns obtained on previous
__global__ void QSortGPU(int *dataIn, int *dIndex) { // Index to the element to be computed by each thread int idx = (blockIdx.x * blockDim.x) + threadIdx.x; // Sort the dataIn array and return dIndex as output permutation } __host__ void main() { // N is the number of input samples; p is the number of gene expression values // Bi (threads/block) and NBi (number of blocks) determine parallelism int N = LoadProject(fname,fList); // p is read from file similarly for (i=0; i call is a parallel kernel invocation that will launch P thread blocks of B threads each. Those values can be set up for tuning parallelism on different kernel launches, and an implicit global barrier guarantees the conclusion of the first set of thread blocks before the second one initiates. Hardware and software limitations in CUDA are listed
in Table 2 for the case of our GPU model, where we have ranked them according to their impact on the implementation and overall performance based on our own experience. When developing applications for GPUs with CUDA, the management of registers becomes important as a limiting factor for the amount of parallelism we can exploit. Each multiprocessor of a G80 GPU contains 8192 registers which will be split evenly among all the threads of the blocks assigned to that multiprocessor. Hence, the number of registers needed in the computation will affect the number of threads able to be executed simultaneously, given the constraints outlined in Table 2. For example, if a kernel (and therefore a thread) consumes 16 registers, only 512 threads can be assigned to a single multiprocessor, and this can be achieved by using 1 block with 512 threads, 2 blocks of 256 threads, and so on. However, if the thread only consumes 10 registers, a block may reach its maximum of 768 threads (768×10 is less than 8192, the maximum number of registers available), and 3 blocks of 256 threads is also a feasible combination for being scheduled onto a single multiprocessor. Our QSortGPU kernel consumes 14 registers, and therefore the maximum parallelism was found to run a single block of 512 threads on each GPU multiprocessor. This is also consistent with our use of 8320 bytes of the shared memory, as reported by our implementation, thus impeding the execution of two concurrent blocks on each multiprocessor (8320 × 2 exceeds the 16 KB. limit of share memory available per multiprocessor). The number of registers could not be decreased
due to the intrinsic nature of the algorithm, and optimizing the use of the shared memory reported a small benefit (around 5%) given that all remaining constraints were already preventing us from attaining a higher degree of parallelism. QSortGPU was implemented using the GPU Quicksort Library [10]. For code details, see [11]. Quicksort has long been considered one of the fastest sorting algorithms on CPUs, and recently has outperformed other GPU-based sorting algorithms such as GPUSort and radix sort. The implementation takes advantage of the high bandwidth of GPUs by minimizing the amount of bookkeeping and inter-thread synchronization needed. It can also take advantage of the atomic synchronization primitives found on newer hardware to improve performance and provide a promising scalability in the future. The other two GPU kernels are less relevant for the execution time. In RowAccum, rows are accumulated in parallel, but columns are serialized within the loop: we found it was not worth it rearranging the code to produce an intricate version for moving these accumulations outside such a short loop and then apply parallelism on a small sequence. Finally, the GlobalAvg kernel was implemented in parallel like a typical reduction operator requiring log2 (N ) steps [12]. 5. PERFORMANCE EVALUATION To demonstrate the effectiveness of our parallelization techniques, we have conducted a number of experiments on a regular PC (see Table 1 for hardware details). Stream processors (cores) outlined in Figure 1 are built on a hardwired design for a much faster clock frequency than the rest of the GPU silicon area (1.50 GHz versus 600 MHz), leading to a peak processing power exceeding half of a TFLOP. From now on, the following aspects are also considered: 1. CPU times are measured in Windows normal run mode, that is, time sharing is not disabled, but no other aplication is running except those processes required by the operating system, whose cost is accounted for. 2. The CPU was programmed using C++ with multimedia extensions enabled directly through HAL layer without any specific library in between. 3. The GPU was programmed using CUDA version 1.1.
5.1. The CUDA execution Table 3 reports GPU execution numbers on different block sizes. In general, the higher number of threads per block, the lower execution time, but this is not always the case: We have seen more often 256 threads/block to become the winner during our experience implementing codes on GPUs over the past five years. The reason for Q-norm not being
Table 3. Processing times (in seconds) for the Q-norm algorithm when using the GPU as co-processor depending on the CUDA block size selected for computing QsortGPU kernels. The problem size is 470 samples and 6.553.600 gene expression values. Threads per CUDA block 32 64 128 256 512
GPU time 204.38 170.24 141.21 122.13 114.92
Improvement Partial Accumulated Reference time 17% 17% 18% 31% 13% 40% 6% 44%
Table 4. I/O and communication times (in seconds) required by the Q-norm algorithm when it was partially executed on the GPU. Communication type 1. Disk to CPU (2 HDs, RAID 0) 2. CPU to GPU through PCI-e 3. GPU results back to CPU 4. CPU writes back to disk Overall I/O time (1+4) Overall CPU-GPU comm. (2+3) Total transfer time (1+2+3+4)
Baseline 218.06 17.06 14.14 16.70 234.76 31.20 265.96
Optimal 162.32 13.89 11.51 16.70 179.02 25.40 204.42
the case lies in the use that the QsortGPU kernels make of shared memory: By allocating 8320 bytes we prevent a second block from being launched on the same multiprocessor. This would equal the amount of parallelism with respect to the set up of 512 threads/block, and, additionally, a total of 16384 − 8320 = 8064 bytes of shared memory would be prevented from being wasted, surely boosting performance. 5.2. Input/output concerns Our application is fed from data on disk, where each genetic probe holds data on a single file of size 25 MB. Since we compute 470 probes, the total data volume read from disk is around 12 GB. Our I/O system consists of two Western Digital Raptors disks of 72 GB at 10000 RPM mounted on a RAID 0 configuration to deliver a peak bandwidth of 164 MB/s. The average bandwidth achieved by our application was 60.58 MB/s for all read operations, which is acceptable considering that the reading process (input for each probe) is interfered by writing operations coming from the previous probe (output results). The entire read time was initially 1023 seconds when using the original text format maintained by the application, but we manage to reduce it by almost a factor of five (218 seconds) by converting the data format to binary numeric datatypes instead. Table 4 already reflects this op-
Table 5. Elapsed time (in seconds) when running a C version of the Q-norm algorithm on a CPU versus our GPU-assisted version using CUDA. Improvement factors are depicted with and without considering optimal transfer times. Task performed Processing time Optimal I/O time CPU-GPU comm. Total time
Involving the GPU 114.92 179.02 25.40 319.34
Without the GPU 807.60 179.02 0.00 986.62
Improvement factor 7.03x
3.09x
timization in its baseline column, which acts as a departure point for further enhancements. These improvements lead to the optimal times shown in the last column of the table, where the following actions were taken: 1. First, the size for the interleaved disk blocks was tuned for the RAID 0 configuration of two devices, thus reducing the I/O time by a remarkable 24% (see the ”baseline” to ”optimal” gap in the first row). 2. Second, the communication time was decreased by a 19% in both ways (see second and third rows) when the motherboard of our PC was replaced to upgrade the CPU-GPU link from the original PCI-express to a PCIexpress 2 slot (the second generation for the interface, which presumably doubles the peak bandwidth).
5.3. Overall performance Table 5 faces our GPU times with the CPU times obtained on a quad-core CPU without using the GPU. For a fair comparison, we have selected a CPU of the same year and budget than our GeForce 9800 GX2 GPU (around 300 eas of 2009). Our algorithm involves actively the I/O system, which plays a decisive role in performance despite our efforts at hardware and programing levels to minimize this effect. Nevertheless, the improvement factor when using the GPU exceeds 7x during the computational phase of Q-norm due to the higher degree of parallelism exploited through a fine grain strategy. On the other hand, the coarse grain parallelism used by the CPU allows to benefit from larger caches, which extend along 12 MB. versus those tiny 16 KB. used on the GPU side. At this final stage of our work, we realize that Q-norm is a memory-bound problem, and that the efficiency of the entire memory hierarchy composed of cache, main memory and disk constitute a limiting factor to pursue higher accelerations.
6. SUMMARY AND CONCLUSIONS We present methods for computing a quantile normalization of high density oligonucleotide array data on GPUs. Our approach focuses on CUDA for exploiting parallelism and the processing power of a GPU, leading to speed-up factors exceeding 7x versus counterpart methods implemented on CPUs. This is extraordinarily valuable in the biomedical field where large-scale data sets and time constraints meet for a very demanding computation that is often unfeasible on a sequential computer. Current solutions of Q-norm based on R-bioconductor software [13] fail to execute on these huge data volumes, so we believe that our contribution represents a step forward to provide computational support to Q-norm as well as a high-performance alternative to supercomputers. Versus other biomedical applications, we benefit from a data intensive application, but at the same time, we see a penalty on the GPU for the low arithmetic intensity of Qnorm. This is partially overcome through an efficient handling of the memory hierarchy and, overall, by deploying a strategy for enhancing fine grain parallelism. This makes our implementation to become competitive against other coarse grain parallelism alternatives such as distributed-memory and shared-memory multicomputers. GPUs are highly scalable and are evolving towards general-purpose architectures [8], where we envision bioinformatics as one of the most exciting fields able to benefit from them. Additionally, tools like CUDA [14] may assist non-computer scientists with a more friendly interface for adapting general-purpose applications to GPUs. 7. FUTURE WORK At the same time frame in which this work was developed, OpenCL [15] was born to provide a more general way for programming graphics cards of all type of vendors. The initiative, supported by all major GPU manufacturers, is expected to establish as a standard for general-purpose GPU programming. Our implementation can be easily be ported to OpenCL as the CUDA pillars we have used are also available in OpenCL as a superset. Therefore, we do not expect a significant change in the behavior of our code, which will soon be available for running on many more graphics platforms like the Radeon saga or FireStream models from ATI/AMD. Overall, this effort is part of the developing of a library devoted to biomedical applications and accelerated using the GPU as co-processor. Future achievements include the implementation of a whole set of building blocks so that codes can be entirely executed on GPUs without incurring penalties from/to the CPU. Our plan also includes porting the code to TESLA nodes (high-end GPUs grouped on rack mounted chassis) and CPU/GPU clusters on the road to high-performance computing. Given that each generation of GPU evolution adds flexi-
bility to previous high-throughput GPU designs, software developers in many fields are likely to take interest in the extent to which CPU/GPU architectures and programming systems ultimately converge.
[13] The r-bioconductor web site. [Online]. Available: http://www.bioconductor.org
8. REFERENCES
[15] The Khronos Group. (2009) The OpenCL Core API Specification, Headers and Documentation. [Online]. Available: http://www.khronos.org/registry/cl
[1] (2009, Nov) Biolib: libraries for the bio* languages. [Online]. Available: http://biolib.open-bio.org [2] (2009, Nov) A web site dedicated to bioinformatics tools, links, resources and tutorials. [Online]. Available: http://www.roseindia.net/bioinformatics [3] B. Bolstad, R. Irizarry, M. Astrand, and T. Speed, “A comparison of normalization methods for high density oligonucleotide array data based on variance and bias,” Bioinformatics, vol. 19, no. 2, pp. 185–193, 2003. [4] J. Warrington, S. Dee, and M. Trulson, Large-Scale Genomic Analysis Using Affymetrix GeneChip, ser. Microarray Biochip Technologies. New York, USA: BioTechniques Books, 2000, ch. 6, pp. 119–148. [5] Affymetrix, “Statistical algorithms referende guide,” Affymetrix, Technical Report, 2001. [6] R. Irizarry, B. Hobbs, F. Colin, Y. Beazer-Barclay, K. Antonellis, U. Scherf, and T. Speed, “Exploration, normalization and summaries of high density oligonucleotide array probe level data,” Biostatistics, vol. 4, no. 2, pp. 249–264, 2003. [7] The genechip human mapping 500k array dataset submitted to geo by affymetrix. [Online]. Available: http://www.ncbi.nlm.nih.gov.proxy.biokemika.unifrankfurt.de/projects/geo/query/acc.cgi?acc=GPL3718 [8] GPGPU. (2009) General-purpose computation using graphics hardware. [Online]. Available: http://www.gpgpu.org [9] “Hidden reference to preserve paper anonimity. it will be provided upon paper acceptance.” [10] D. Cederman and F. Tsigas, “A practical quicksort algorithm for graphics processors,” in 16th Annual European Symposium on Algorithms (ESA 2008), ser. LNCS, D. Halperin and K. Mehlhorn, Eds., vol. LNCS, no. 5193, 2008, pp. 246–258. [11] (2007, Dec) The gpu quicksort library. [Online]. Available: http://www.cs.chalmers.se/ dcs/gpuqsortdcs.html [12] M. Harris, S. Sengupta, and J. Owens, Parallel Prefix Sum with CUDA, ser. GPU Gems 3. Addison-Wesley, August 2007.
[14] Nvidia. (2009, Nov) Cuda home page. [Online]. Available: http://developer.nvidia.com/object/cuda.html