Accelerating Biomedical Signal Processing Algorithms with Parallel

0 downloads 0 Views 265KB Size Report
programming in the field of biomedical signal processing. The differences ... factor of 29 in comparison with pure C language. Therefore, ... Programming on Graphic Processor Units. Evdokimos ..... Digital Image, 2008, December, pp. 155-161.
Accelerating Biomedical Signal Processing Algorithms with Parallel Programming on Graphic Processor Units Evdokimos I. Konstantinidis, Christos A. Frantzidis, Lazaros Tzimkas, Costas Pappas, Panagiotis D. Bamidis

Abstract—This paper investigates the benefits derived by adopting the use of Graphics Processing Unit (GPU) parallel programming in the field of biomedical signal processing. The differences in execution time when computing the Correlation Dimension (CD) of multivariate neurophysiological recordings and the Skin Conductance Level (SCL) are reported by comparing several common programming environments. Moreover, as indicated in this study, the combination of parallel programming with special design techniques dealing with memory management issues such as data transfer between device memory and GPU may further accelerate the processing speed. So, the minimization achieved in the time execution by means of proper parallel architecture design may reach a factor of 29 in comparison with pure C language. Therefore, the role of parallel GPU programming environment may be beneficial for numerous biomedical applications within the sphere of biosignal processing.

Index Terms— Parallel, GPU, biomedical processing

I. INTRODUCTION

S

PEED and accuracy are key factors in the execution of signal processing algorithms [1]. Especially, in the biomedical field, fast and accurate signal processing algorithms are considered as vital. This stems from the fact that the analysis of biomedical signals, such as electrocardiograms, electroencephalograms, etc, has undoubtedly affected human life by means of online monitoring of patients, non-invasive diagnosis, improved rehabilitation of the elderly and sensory aids for the handicapped [2]. Moreover, real time monitoring and decision making [3] is more demanding in computational cost. Until nowadays, Moore's law describes a long-term trend in the history of computing hardware. The number of transistors that can be placed on an integrated circuit is doubling approximately every two years. Such increases led to higher clock rates, larger caches and increased utilization of instruction–level parallelism. Recently, however, such Manuscript received July 14, 2009. E. I. Konstantinidis is with the Lab of Medical Informatics, Medical School, Aristotle University of Thessaloniki. C. A. Frantzidis and L. Tzimkas are with the Lab of Medical Informatics, Medical School, Aristotle University of Thessaloniki. C. Pappas and P. D. Bamidis are with the Lab of Medical Informatics, Medical School, Aristotle University of Thessaloniki, PO Box 323, 54124, Thessaloniki, Greece, tel.: +30-2310-999310, fax: +30-2310-999263 (email: [email protected]).

improvements bring us to the era of multi-core CPUs [4]. What is more, graphics processing units (GPUs), specialized architectures traditionally designed for gaming applications, have recently transformed into powerful co-processors for general-purpose computing (GPGPU) [5]. Based on many multi-core processors, the recent architecture of the GPUs provides the capacity for general parallel processing. GPUs were originally designed for fast graphics in the gaming industry. Although special skills are demanded for graphics programming (OpenGL or DirectX), general-purpose programming technologies have now been published, such as Compute Unified Device Architecture (CUDA), developed by GPU manufacturer NVIDIA. Based on an extension of the C language, CUDA facilitates the programmer for parallel computing on GPUs. Tobias Preis et al. implemented a Monte Carlo simulation of the 2D and 3D Ising model by making use of the advantages of the GPU parallel programming. The results depict an increase in speed by 60 times [6]. Michael Boyer et al. demonstrated how a systems biology application— detection and tracking of white blood cells in video microscopy—can be accelerated by 200x using a CUDAcapable GPU [7]. Suchard and Rambaut implemented statistical phylogenetics models of algorithms on GPU that speed up their execution by 90 times [8]. Moreover, Liu et al. implemented fast blood flow visualization of highresolution laser speckle imaging data using GPU. Their approach accelerated by 60 times due to the parallel computing [9] techniques. The aforementioned papers demonstrate faster computational performances when compared with implementations on CPU. On the contrary computing on GPU encounters two main problems [10]. First, the programmer should master in-depth the device specific architecture details and constraints. Programs written without much attention paid to the architecture details are very likely to perform poorly. More specifically, memory management plays essential role in the overall efficiency of the system design (as it can be depicted to the results our approach). Second, data transfer between the CPU and GPU is considered an essential parameter in order to take advantage of efficiency of GPU. The main goal is to avoid this kind of data transfer as much as possible. In most CUDA applications the bandwidth between device memory and GPU is the main performance bottleneck. Moreover, the

computational time on GPU is much faster than on CPU [10]. So, in this paper the focus is placed on the significance of contemporary parallel programming for the biomedical engineering field, especially when applications that require long-term monitoring like that of epileptic patients [11] or for extracting emotional information used in emotion aware computing [12] [13]. Regarding these applications, the computational cost of the proposed techniques is a crucial factor. Moreover, there is a need for efficient time processing of a vast amount of data [14], which means that methods of parallel processing may be beneficial. Thus, the aim of this paper is, on one hand, to provide the feasibility grounds for CUDA and GPU architectures in biomedical signal processing, while, on the other hand, to test the boundaries of the obtained efficiency when implementing algorithms of interest to the authors’ research activities. So, the remainder of this paper is organized as follows. In Section II, we briefly introduce the GPU and CUDA architectures. In Section III we describe the algorithms and their implementations. Results of the algorithms’ implementation are presented in Section IV. Finally, the discussion of this paper appears in Section V.

B. CUDA (compute unified device architecture) CUDA is a technology for GPU computing from NVIDIA. As it is mentioned in the introduction, it is based on an extension of the C language. Its parallelization model is a slight abstraction of the NVIDIA’s G80 hardware [15]. Executed on only one multiprocessor, a block is consisted of threads (Fig. 3). Given that a thread block may consist of more threads than the number of processors in a multiprocessor, the hardware is responsible for scheduling the threads. Therefore, within a block, additional thread context and synchronization options exist, whereas no global synchronization is available between blocks. CUDA provides internal commands to facilitate the programmer in order to synchronize the threads within a block, allowing the threads to share data through the shared memory. A set of blocks, which is called grid (Fig. 3), constitute a SIMD (single instruction, multiple data) compute kernel. Kernel calls themselves are asynchronous to the host CPU: they return immediately after issuance. [15].

II. CUDA AND GPU ARCHITECTURE A. The NVIDIA GPU architecture Unlike CPU, the inside structure of a GPU is developed in such a way that more transistors are devoted to data processing rather than data caching and flow control (Fig. 1). A modern Nvidia GPU contains many multiprocessors, each composed of several cores (Stream Processors, SPs), as illustrated in Fig. 1. Namely, the GPU in the Tesla C1060 contains 30 multiprocessors, while each of them contains 8 SPs. As a consequence, a total of 240 cores are to service code execution. Within a multiprocessor, cores are allocated local registers and have access to a fast shared memory. In addition, each multiprocessor provides two small read-only caches: a constant cache, and a texture cache to speed up global device reads. The main storage on the card is the device memory, which is shared among all multiprocessors, is accessible by the CPU and has a relatively high latency [4]. At any given clock cycle, each processor of a multiprocessor executes the same instruction, but operates on different data.

Fig. 1 The GPU devotes more transistors to Data Processing [source: [20]]

Fig. 2 Visualization of the Nvidia graphic card’s architecture. Each multiprocessor has 8 steam processors which are sharing the fast shared memory.

The thread-block model facilitates the programmer so as not to have to be aware of the number of multiprocessors and stream processors from the CUDA kernel insofar as the block grid is partially serialized into batches [15]. Within the kernel thread code, each separated block is assigned by an one- or two-dimensional identifier, while each thread per block is assigned by an one-, two-, and three-dimensional identifier, which is convenient for the kind of numerical grid application since it avoids additional (costly) modulo operations inside the thread [15]. Threads support local variables and access to the shared memory, which is common for all threads within a block. [15] A special syntax is used by the host code in order for the kernel to be called. This syntax specifies the block grid size and threads per block, and can be synchronized back to the host via a runtime function. While the GPU operates asynchronous of the CPU code execution, the host CPU is able to perform independent computations or issue additional kernels [15]. This can be useful in the field of delayed analysis and output processes performed on the data,

although the bandwidth for transferring from the frame buffer to the host memory has to be taken into account [15].

The second algorithm computes the skin conductance level which is a tonic measure of the autonomic arousal [18]. This marker could be extracted from the electrodermal activity by applying a smoothing procedure with a relatively long time moving average window, where y is the smoothed waveform representing the skin conductance level and x is the raw electrodermal data:

y[i ] =

N 1 ∑ x[i + j ] 2 N + 1 j =− N

IV. RESULTS

Fig. 3 Every block is consisted of threads, while a set of blocks composes a grid [source: [20]].

III. ALGORITHM IMPLEMENTATION The first algorithm is the multichannel D2 [16][17], which is calculated from an EEG epoch consisting of 19 electrodes with 1250 samples per channel. The sampling frequency was set at 500 Hz. A series of vectors x(i), i={1…1250} is constructed using these data. The coordinates of the vectors are the sample values of the 19 channels at one of the 1250 discrete time points. The multichannel D2 is calculated as follows: • One of the vectors is taken as a reference. • A distance r (taking values from 1 to 100) is chosen • The number of points that lie closer to the reference point than a distance r is computed and then divided by the total number of points. • The same procedure is repeated for all of the points of the attractor and for a range of values for r. • Thus for each value of r, an average value for the fraction of points lying closer together than r can be calculated and this fraction is called the correlation integral Cr with a formula as below:

1 C r (r ) = 2 N

N

N

∑ ∑ H (r − x(i) − x( j ) ) i =1 j = i +1

Where N is the number of vectors, H is the Heaviside function, which is 1 if the distance between the two vectors is smaller than r and 0 if the distance is larger than r. It is known that the D2 can be calculated from the correlation integral as the slope of log(Cr) versus log(r):

D 2( r ) =

Δ[log(C r (r ))] Δ[log(r )]

As it is mentioned in Section III, the two algorithms were implemented in language C, C Sharp (C #), Matlab and CUDA. The execution of CUDA implementation was run in three CUDA-enabled devices: GeForce 8600GT which includes 32 SP, GeForce 9600GT which includes 64 SP and the GeForce GTS250 model which includes 128 SP. The different ways of execution were sustained by the same PC, with dual core CPU at 3.17GHz and 4GB of RAM (Table 1). Table 1 Execution time for each algorithm and programming language.

Programming language C C# MatLab CUDA 8600GT CUDA 9600GT CUDA GTS250

Execution times for each Algorithm tested (ms) Multichannel-D2 Smoothing SCR 2254.48 2954.00 78000.04 815.90 236.37 219.63

2603.89 3504.00 627000.30 1769.98 469.78 397.92

Especially for the Smoothing Algorithm of the SCR signal we implemented the algorithm both with careful CUDA memory management design and without. As it is depicted in Table 2 both implementations are faster in execution that the non-CUDA ways, although the one with the careful CUDA memory management design is comparatively much faster. Table 2 Execution time for Smoothing SCR algorithm with and without carefully memory management.

Programming language CUDA 8600GT CUDA 9600GT CUDA GTS250

Execution times (ms) SmoothingSCR SmoothingSCR (CUDA memory management) 1769.98 ms 420.09 ms 469.78 ms 173.46 ms 397.92 ms 89.75 ms

V. DISCUSSION This study indicates that using the GPU parallel programming may accelerate the biomedical signal

processing algorithms execution. Regarding the algorithms implemented [19], execution time may be decreased up to a factor of 29 in comparison with C language in case of proper parallel design. According to the Results Section, there are significant differences in execution time among the three different programming platforms and CUDA. The C language is faster than C# and MatLab but it comes behind CUDA programming language regarding the time execution. Although the main aim of the MatLab environment is not the fast time execution but to provide an easy way of technical computing, it is included in the comparison since it is widely used in biomedical signal analysis. Although the benefits stemming from the usage of GPU programming in biomedical data analysis are meaningful, special design techniques must accompany the memory management design since it greatly affects the overall system’s time efficiency. On the other hand, as it is shown in Table 2, in case of not adopting these techniques, the overall time is at least doubled. So, the data transfer between device memory and GPU should be minimized by properly designing the programming architecture in order to optimize the CUDA time performance. Limitations of GPU parallel programming may include the need of special hardware equipment. Moreover, the vast majority of programmers face severe difficulties to adopt this programming style since there is much unfamiliarization with parallel thinking. Besides that, some algorithms cannot be parallelized at all. However, it has to be mentioned, the the CUDA environment provides much convenience nowadays when compared to classic parallel programming. In this content the rationale and achievements of this paper are import: aiming to contribute to the wide spread use and development of parallel programming, this paper demonstrates the feasibility and suitability of a parallel CUDA-based hardware platform in the field of biomedical signal processing. The overall speedup is considered remarkable in case of batch processing of large amounts of data. REFERENCES [1] J. Y. Hong, M. A. Wang, H. Wallace, High speed processing of biomedical images using programmable GPU, International Conference on Image Processing, ICIP ’04, 2004, Vol. 4, pp. 2455-2458 [2] P. Jiapu, T. J. Willis, A Real-Time QRS Detection Algorithm, IEEE Transactions on Biomedical Engineering, 1985, Issue 3 [3] C.A Frantzidis et al., On the classification of emotional biosignals evoked by affective pictures: an integrated data mining based approach, Transactions of Information Technology in Biomedicine, under review [4] K. Barros, R. Babich, R. Brower, M. A. Clark, C. Rebbi, Blasting through lattice calculations using CUDA, PoS LATTICE2008:045,2008 [5] B. He, K. Yang, R. Fang, M. Lu, N. K. Govindaraju, Q. Luo, P. V. Sander, Relational Joins on Graphics Processors, SIGMOD’08, Vancouver, BC, Canada [6] T. Preisa, P. Virnaua, W. Paula, J. J. Schneidera, GPU accelerated Monte Carlo simulation of the 2D and 3D Ising model, Journal of Computational Physics, Volume 228, Issue 12, July 2009, pp. 4468-4477 [7] M. Boyer, D. Tarjan, S. T. Acton, K. Skadron, Accelerating Leukocyte Tracking using CUDA: A Case Study in Leveraging Manycore

Coprocessors, 23rd International Parallel and Distributed Processing Symposium, Rome, Italy, May 2009 [8] M. A. Suchard, A. Rambaut, Many-core algorithms for statistical phylogenetics, in Bioinformatics, 2009 Jun, 1;25 (11) : 1370 – 6, Epub 2009 Apr 15 [9] S. Liu, P. Li, Q. Luo, Fast blood flow visualization of high-resolution laser speckle imaging data using graphics processing unit, Opt. Express 16, 14321-14329 (2008) [10] H. Jang, A. Park, K. Jung, Neural Network Implementation Using CUDA and OpenMP, Computing: Techniques and Applications, 2008. DICTA '08.Digital Image, 2008, December, pp. 155-161 [11] G. E. Polychronaki et al., Comparison of Fractal Dimension Estimation Algorithms for Epileptic Seizure Onset Detection, In Proceedings of the 8th IEEE International Conference on BioInformatics and BioEngineering BIBE 2008, Athens, Greece, IEEE, 1–6 [12] C.A. Frantzidis et al., Towards emotion aware computing: A study of arousal modulation with multichannel event-related potentials, delta oscillatory activity and skin conductivity responses, In Proceedings of the 8th IEEE International Conference on BioInformatics and BioEngineering BIBE 2008, Athens, Greece, IEEE, 1–6. [13] P. D. Bamidis, A. Luneski, A. Vivas, C. Papadelis, N. Maglaveras, C. Pappas, Multi-channel physiological sensing of human emotion: insights into emotion-aware computing using affective protocols, avatars and emotion specifications, Studies in health technology and informatics. 2007 ;129(Pt 2):1068–72. [14] E. Konstantinidis, P. D. Bamidis and D. Koufogiannis, Development of a Generic and Flexible Human Body Wireless Sensor Network, in Proceedings of the 6th European Symposium on Biomedical Engineering (ESBME), 2008. [15] B. Zink, A general relativistic evolution code onCUDA architectures, CCT Technical Report Series, Center for Computation and Technology, Louisiana State University [16] K. J. Stam, D. L. Tvy, Br. Jelles, H. A. Achtereekte, J. P. J. Slaets, R. W. M. Keumen, Non-Linear Dyanmical Analysis of Multicanner EEG: Clinical Applications in Dementia and Parkinson’s Disease, Brain Topography, Vol. 7, No 2, pp. 141-150, 1994. [17] C. J. Stam, Nonlinear dynamical analysis of EEG and MEG: Review of an emerging field, Clinical Neurophysiology 116, p. 2266-2301, 2005. [18] R. W. Picard, J. Scheiner, The Galvactivator: A Glove that Senses and Communicates Skin Conductivity, Proceedings from the 9th International Conference on Human-Computer Interaction, New Orleans, August 2001, New Orleans [19] http://lomiweb.med.auth.gr/gan, 11/07/2009 [20] CUDA Programming Guide, Available http:// http://www.nvidia.com/object/cuda_home.html

Suggest Documents