Hierarchical Agglomerative Clustering Using Graphics ... - IEEE Xplore

0 downloads 0 Views 426KB Size Report
20 Clementi Avenue 1, Singapore ... NVIDIA, Compute Unified Device Architecture (CUDA), ... Grid computing and Polymorphous computing architecture.
2009 International Conference on Signal Processing Systems

Hierarchical Agglomerative Clustering Using Graphics Processor with Compute Unified Device Architecture

S.A. Arul Shalom, Asst. Prof. Manoranjan Dash

Minh Tue, Nithin Wilson

School of Computer Engineering Nanyang Technological University 50 Nanyang Avenue, Singapore [email protected]; [email protected]

NUS High School of Mathematics and Science 20 Clementi Avenue 1, Singapore Singapore [email protected]; [email protected] computation (GPGPU) is seen as a significant force that is changing the nature of graphics in the enterprise. The phenomenal growth in the computing power of GPU that can be measured as floating-point operations (FLOPS) [2] over the years is shown in Fig. 1. The internal memory bandwidth of NVIDIA GeForce 8800 GTX GPU is 86 Giga Bytes/second, whereas for a dual core 3.0 GHz Pentium IV CPU it is 8 Giga Bytes/second. It has peak performance of about 1300 GFLOPS with 128-bit floating-point precision compared to approximately 25 GFLOPS for the CPU. The field of GPGPU continues to grow and benefit from the raw computational power of the GPU in a desktop [12].

Abstract— We explore the use of today’s high-end Graphics processing units on desktops to perform hierarchical agglomerative clustering with the Compute Unified Device Architecture – CUDA of NVIDIA. Although the advancement in graphics cards has made the gaming industry to flourish, there is a lot more to be gained the field of scientific computing, high performance computing and their applications. Previous works have illustrated considerable speed gains on computing pair wise Euclidean distances between vectors, which is the fundamental operation in hierarchical clustering. We have used CUDA to implement the complete hierarchical agglomerative clustering algorithm and show almost double the speed gain using much cheaper desk top graphics card. In this paper we briefly explain the highly parallel and internally distributed programming structure of CUDA. We explore CUDA capabilities and propose methods to efficiently handle data within the graphics hardware for data intense, data independent, iterative or repetitive generalpurpose algorithms such as the hierarchical clustering. We achieved results with speed gains of about 30 to 65 times over the CPU implementation using micro array gene expressions.

A. Recent Advancements and Challenges in GPGPU The programming model of the GPU is harsh, constrained and is heavily linked to computer graphics. It is vital that computational algorithms have to be carefully designed and ported, to effectively suit the graphics environment. It continues to be a challenge to scientists and researchers to harness the power of GPU for applications based on general-purpose computations. It is not straightforward to port the CPU codes to the GPU and not a simple task to realize. Researchers have to learn graphics dedicated programming platforms such as OpenGL or DirectX, and convert the computational problem into a graphics problem [9], [10] and [11]. This requires tedious efforts and is time consuming. The new graphics API of NVIDIA, Compute Unified Device Architecture (CUDA), lightens the tasks of researchers who are interested in GPGPU. The standard functions allow the developers to directly access the GPU’s hardware components such as processors, memories and registers.

Keywords- CUDA hierarchical clustering, high performance computing, GPGPU, acceleration of computations, parallel computing

I.

INTRODUCTION

Today’s Graphics Processing Unit (GPU) on commodity desktops, gaming consoles, video processing desktops or play stations has become the most powerful and affordable computational hardware in the computer world. The hardware architecture of these processors, which are traditionally meant for graphics applications, inherently enables massively parallel vector processing with high memory bandwidth and low memory latency. The processing stages within these GPUs are programmable. Such characteristics of the GPU make it more effective and cost efficient to execute highly repetitive arithmetically intensive computational algorithms. Modern GPUs such as the NVIDIA GeForce 8800 GPU is an extremely flexible, highly programmable and powerful, precision processor with 128 parallel stream processors, which is also being used in the field of general-purpose computations. Over the past few years the programmable GPU has turned out into a machine of which the computational power has increased tremendously [1]. The use of GPUs for general-purpose 978-0-7695-3654-5/09 $25.00 © 2009 IEEE DOI 10.1109/ICSPS.2009.167

Figure 1. Comparison of Computational Growth (Courtesy: NVIDIA).

556

output matrix and one thread for entry in out’. Moreover, the GPU shared memory is used and the threads are synchronized at block level. Results show that speedup of 20 to 44 times is achieved in the GPU compared to the CPU implementations. It is important to note that this speed up is achieved only for the pair-wise distance computations and not for the complete HAC algorithm. In the second research work [7], CUDA based HAC algorithm on the GPU NVIDIA 8800 GTX is compared with the performance of commercial bioinformatics clustering applications running on the CPU. Based on the cluster parameters and setup used, speedup of about 10 to 14 times is achieved. The effectiveness of using the CUDA on GPU to cluster high dimensional vectors is shown. This is accomplished at an expense of organized memory ‘micromanagement’ within the GPU. In this paper, we exploit the parallel processing architecture, the large global memory space, and the programmability of the GPU to efficiently implement the traditional HAC algorithm completely. We use a relatively cheaper graphics card, NVIDIA 8800 GTS GPU which has lower specification than the GTX version. The cost of 8800GTX ranges from $500 to $700, whereas the 8800GTS ranges from $100 to $400. We use the latest graphics programming API; NVIDIA’s CUDA to implement the computations of HAC in the GPU. We mostly utilize the global memory than the shared memory in the GPU, which is relatively simple to be programmed, though the access is slower. We present the novelties of our research and recommend certain criteria for the selection of programming parameters of CUDA for a given type of computation. We explore the relations between the CUDA parameters such as block size, number of blocks versus size of data and dimensions, intending to find the optima where a given GPU configuration would peak its performance. We have implemented the single link method of HAC and tested using micro array gene expression data of yeast. We achieved results in the order of 30 to 65 times faster than the CPU, based on gene sizes and CUDA parameters.

B. Motivations Recent research in computer architecture shows a trend towards the fields of Streaming computations, Multi-core architectures, Heterogeneous architecture, Distributed and Grid computing and Polymorphous computing architecture. The architecture of GPU has a lot in common with these hot research topics. It is possible to offload more arithmetically intense computations from the CPU to the GPU for processing massive data sets even in desktops for applications such as scientific computing, physical simulations, image processing, computer vision, data mining etc [3]. Clustering has become a common technique in statistical data analysis, which has wide spread applications in many fields such as machine learning, data mining, text mining, pattern recognition, image processing and bioinformatics. The five major categories of clustering are k-partitioning algorithms, hierarchical algorithms, density-based, gridbased and mode-based. In the most popular k-partitioning method, the clusters are assumed to be spherical and the required number of partitions is pre-determined which may not be optimal. GPU implementations of such algorithms are available [10], [11]. In hierarchical clustering a complete hierarchical decomposition of the objects is created either by agglomeration or division and the required number of clusters can be obtained [5]. Although hierarchical agglomerative clustering (HAC) has been widely applied in various fields, it is predominantly used for document clustering and retrieval, cluster identification from micro array gene expressions and in image compression, searching and clustering where computationally intense and time consuming high throughput data processing is involved. The time complexity of the HAC algorithm is at least O(N2) and the overall complexity of the algorithm is O(N2*logN) where N is the number of objects to be clustered. Hierarchical Clustering algorithms have been implemented on the GPU using OpenGL and CUDA in the past [6], [7], [9]. The purpose of this research work is not to reduce the complexity of the algorithm but to look into simpler ways of using CUDA for complete HAC computations, understand the effects of CUDA programming and clustering parameters on scalability and speedup performances.

II.

THE TRADITIONAL HIERARCHICAL AGGLOMERATIVE CLUSTERING ALGORITHM

In this section we provide a brief description of the traditional HAC algorithm along with the steps of implementing the algorithm in the CPU, to which the GPU implementation steps can be to related later in section III.

C. Previous Developments in using CUDA for Hierarchical Clustering and our Intension Extensive literature search for CUDA based hierarchical clustering and distance computations yielded in two related works with significant contribution to this topic. The first work [6] deals with the implementation of the ‘pair-wise distance’ computation, which is one of the fundamental operations in HAC. The GPU NVIDIA 8800 GTX is employed and it uses CUDA programs to speed up the computations. Gene expression data is used and the Standard HAC is implemented using the half matrix for pair-wise distance calculation. Experimentations are done to evaluate the use of threads; ‘one thread for one row of the

A. Understanding the HAC Algorithm The objective of HAC is to generate high level multiple partitions in a given dataset. The groups of partitions of data vectors will denote the sets of clusters. In this bottom-up approach, each vector is treated as a singleton cluster to start with and then they are successively merged into pairs of clusters (agglomerative) until all vectors have merged into one single large cluster. The agglomeration of clusters results in a tree-like structure called the dendrogram. The

557

combination similarity between the merging clusters is highest at the lowest level of the dendrogram and it decreases as the clusters merge into the final single cluster. This structure can then be used to extract partitions of the data set by cutting the dendrogram at the required level of combination similarity or number of clusters expected. Fig. 2 shows such a dendrogram illustrating the HAC algorithm, where the singleton clusters p, q, r, s and t are progressively merged into one single large cluster. The fundamental assumption in HAC is that the merge is monotonic and descending, which means that the combination similarities s1, s2, s3, … sn-1 of successive merges of the vectors are only in descending order. The steps of HAC that will result in such a monotonic merge are listed in Table I. The minimum distance cluster pairs are merged which is called as the single linkage HAC method.

TABLE II.

B. HAC Implementation on the CPU We implemented the half similarity matrix based HAC in the CPU and used its computational time as reference to compare with the computational time taken by the GPU. Assuming that the vectors of size n and dimension d is stored in an array, the code listed in Table II will compute the halfmatrix of the pair-wise Euclidean distances and store it in array dist.

III.

double** dist; double start, end; start=clock(); dist=new double*[n-1]; for (int i=0; i

Suggest Documents