Survey of using GPU CUDA programming model in

Accepted Manuscript Survey of using GPU CUDA programming model in medical image analysis T. Kalaiselvi, P. Sriramakrishnan, K. Somasundaram PII:

S2352-9148(17)30045-X

DOI:

10.1016/j.imu.2017.08.001

Reference:

IMU 56

To appear in:

Informatics in Medicine Unlocked

Received Date: 10 May 2017 Revised Date:

7 July 2017

Accepted Date: 4 August 2017

Please cite this article as: Kalaiselvi T, Sriramakrishnan P, Somasundaram K, Survey of using GPU CUDA programming model in medical image analysis, Informatics in Medicine Unlocked (2017), doi: 10.1016/j.imu.2017.08.001. This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Survey of using GPU CUDA Programming Model in Medical Image Analysis T. Kalaiselvi, P. Sriramakrishnan and K. Somasundaram Department of Computer Science and Applications, The Gandhigram Rural Institute - Deemed University, Gandhigram, Tamilnadu, India

RI PT

[email protected], [email protected], [email protected]

Abstract

M AN U

SC

With the technology development of medical industry, processing data to be exploded and computation time also increases due to many factors like 3D, 4D treatment planning, the increasing sophistication of MRI pulse sequences and the growing complexity of algorithms. Graphics processing unit (GPU) addresses these problems and gives the solutions for using their features such as, high computation throughput, high memory bandwidth, support for floating-point arithmetic and low cost. Compute unified device architecture (CUDA) is a popular GPU programming model introduced by NVIDIA for parallel computing. This review paper briefly discusses the need of GPU CUDA computing in the medical image analysis. The GPU performances of existing algorithms are analyzed and the computational gain is discussed. Few open issues, hardware configurations and optimization principles of existing methods are discussed. This survey concludes the few optimization techniques with the medical imaging algorithms on GPU. Finally, limitation and future scope of GPU programming are added.

1. INTRODUCTION

TE D

Keywords GPU CUDA, Medical Imaging, Parallel Computing, Denoising, Segmentation, Visualization

AC C

EP

Computed tomography (CT), magnetic resonance imaging (MRI), positron emission tomography (PET) and ultrasound are famous medical modalities that produce the 2D, 3D and 4D types of medical images which are guiding the diagnosis process and treatment planning. The medical image processing and analysis are computationally expensive while medical imaging data dimension increasing [1]. The conventional CPU with limited multi-core is not sufficient to process these types of huge data. Graphics processing unit (GPU) is a new technology capable for finding out solutions to the computational problems in all the engineering and medical fields. In the medical industry, GPU is more suitable for processing the higher dimension data. GPU computation has provided a huge edge over the central processing units (CPU) with respect to computation speed. GPU is highly parallel, multithread, multiple core processors and has high memory bandwidth to give the solution to the computational problems [2]. The main reason for the evolution of powerful GPUs is the constant demand for greater realism in computer games. During the past few decades, the computational performance of GPUs has increased much more quickly than that of conventional CPUs. Hence it plays a major role in the field of modern industrial research and development. GPU has already achieved a significant speed (2x-1000x) than CPU implementation on various fields [3][4][5].

ACCEPTED MANUSCRIPT GPU is well suitable to implement the program execution with the different data elements. This process is called as data parallelism. Data parallelism is maps data elements to parallel threads available in GPU [6]. Data parallelism gives high gains in independent processes between data elements. The prime areas of data parallelism are 3D rendering, stereo vision, pattern recognition, image, video and medical industry applications.

M AN U

SC

RI PT

A large performance gap occurs between GPU and general purpose multi-core CPU. Architectural level comparison of CPU and GPU are given in Fig.1. The design of a CPU is optimized for sequential programming. It makes use of sophisticated control logic to allow instructions from a single thread of execution to execute in parallel or even out of their sequential order while maintaining the appearance of sequential execution. Modern CPU microprocessors typically have four large processor cores designed to deliver strong sequential code performance but not enough to process the huge data. A basic model of GPU has large number of processor cores, ALU’s, control units and various types of memories. In general, heterogeneous CPU and GPU computation is appreciable instead of standalone CPU or GPU implementation. The dependent processes are recommended in CPU and the independent processes can be accelerated by the GPU. GPU with high amount of threads give better performance.

TE D

This paper reviews the implication of GPU programming model in medical image analysis and illustrated some applications with examples. The general framework of medical image analysis pipeline is given in Fig. 2. The computational complexities of all these fields are increasing exponentially while handling higher dimension data. This paper analyzes the existing works of GPU based medical image algorithms reduce the processing time in recent years. Finally this paper elaborates the optimization techniques which help to archive additional speedup with GPU.

AC C

EP

The rest of the paper is organized as follows: section 2 discusses the overview of GPU computing model in terms of its architectural paradigm and software. Section 3 contains GPU computing on medical image analysis and results of some major algorithm. Section 4 discusses the optimization techniques on medical imaging algorithms and section 5 concludes the article.

Fig.1. Architecture overview of CPU and GPU

ACCEPTED MANUSCRIPT

Medical Image Volume

Image Denoising

RI PT

Registration

Segmentation

Render Image

SC

Visualization

Fig.2. Medical image analysis pipeline

M AN U

2. OVERVIEW OF GPU COMPUTING MODEL – CUDA

EP

TE D

A rapid development of NVIDIA GPU with various architectures is given in Table 1. NVIDIA has introduced its own massively parallel architecture called compute unified device architecture (CUDA) in 2006 and made the evolution in GPU programming model. CUDA is a parallel programming model and its instruction set architecture uses parallel compute engine from NVIDIA GPU to solve large computational problems. CUDA is an open source and extension of the C programming language. CUDA program contains two phases that are executed in either host (CPU) or device (GPU). There is no data parallelism in the host code. The phases that exhibit rich amount of data parallelism are implemented in the device code. A CUDA program uses NVIDIA C compiler (NVCC) that separates the two phases during the compilation process. The host code is ANSI C code and device code is ANSI C code with extended keywords. The windows user can compile the CUDA programming with Microsoft visual studio 2008 onwards using NVIDIA Nsight. Eclipse IDE support for Linux and Mac platform. Some other languages or programming interfaces to support the parallel computing are OpenCL (Open Computing Library), DirectX Compute and FORTRAN [2]. Table 1. Evolution of NVIDIA GPU architectures

AC C

Configurations

Micro Architecture Number of Processors Global Memory size (GB) Memory Bandwidth (GB/Sec) Processor Clock (GHz) Memory Clock (GHz)

GeForce GT 320 Tesla 72 1 25.3 1.3 0.8

GeForce GTX 590 Fermi 1024 3 327.7 1.2 1.7

GeForce GTX 690 Kepler 3072 4 384 1.0 6.0

ACCEPTED MANUSCRIPT 2.1 GPU - CUDA architecture The GPU-CUDA model contains three types of architecture hierarchy. They are programming model, memory model and CUDA work flow. 2.1.1 GPU - CUDA programming model

AC C

EP

TE D

M AN U

SC

RI PT

GPU - CUDA hardware builds with three main parts to utilize effectively the full computation capability of GPU [2]. The grids, blocks and threads build the CUDA architecture as shown in the Fig. 3. CUDA is capable to execute large number of parallel threads. Threads are grouped by blocks and group of blocks by thread. In these three levels hierarchal architecture the execution is independent among the entities of same level. A grid is a set of thread blocks that may each be executed independently. Blocks are organized as a 3D array of threads and each block has unique block ID (blockIdx). Threads are executed by the kernel function and each thread has a unique thread ID (threadIdx). The total size of a block is limited to 1024 threads.

Fig. 3 Programming model of GPU- CUDA model

2.1.2 GPU- CUDA Memory model The GPU device has multiple memories for their thread execution. Fig.4 shows the various CUDA device memory organizations. CUDA memory types and its properties are given in the Table 2 [4]. A GPU has M number of streaming multiprocessors (SM) and N number of streaming processor cores (SPs) to each SM. Each thread can access variables from the local memory and registers. Registers have the largest bandwidth and frequently accessed variables are stored in the

ACCEPTED MANUSCRIPT registers. Each block has its own shared memory of size 16KB or 48 KB that can be accessed by all threads within the block.

AC C

EP

TE D

M AN U

SC

RI PT

The constant memory supports short-latency, high-bandwidth and read-only access by the device when all threads simultaneously access the same location. The constant cache size is limited to 64KB. Texture memory can be used as a form of cache to avoid global memory bandwidth limitations and handle some small irregular memory accesses. Texture memory is used in visualization process and the size is 32KB per multiprocessor. The global memory is a largest memory from device and supports read and write operations with low bandwidth. The global and constant memories support the host code that can transfer data to and from the device, as illustrated by the bidirectional arrows between these memories and the host. All thread can access the global memory simultaneously.

Fig.4 GPU – CUDA memory model

The unified Level 2 (L2) cache is introduced with compute capability 2.0 and higher for fast data store by all the multiprocessors. L2 cache helps to avoid the bottleneck problem in GPU. Any thread can update or access the value in the L2 cache. The Level 1 (L1) cache has 64KB memory that can be partitioned to favor shared memory to perform the read/write operations.

ACCEPTED MANUSCRIPT Table 2. CUDA memory types and properties. (R- read access, W- write access, N/A – not applicable) Access

Register Local Shared Global Constant Texture

R/W R/W R/W R/W R R

Cache d N/A No N/A No Yes Yes

Scope One thread One thread All threads in a block All threads + host All threads + host All threads + host

Lifetime

Bandwidth

Location

Thread Thread Block Application Application Application

~8.0 TB/s ~3.4 TB/s ~336 GB/s -

On-chip Off-chip On-chip Off-chip Off-chip Off-chip

2.1.3 CUDA work flow model

RI PT

Memory

SC

The CUDA execution flow is shown in Fig.5. GPU threads are lighter weight than CPU threads. The CUDA program starts with host execution. The kernel function generates the large amount of threads to execute data parallelism. Before starting the kernel, all the necessary data is transferred from host to allocated device memory. CPU kick starts the kernel function then execution flow is moved to device. Resultant data will transfer back to host for further processing.

M AN U

2.2 GPU - CUDA Real Time Software

TE D

In the world wide, lot of simulation and accelerated real time software are developed for using CUDA in the medical domain. AxRecon is a image reconstruction solution for the medical imaging to eliminate image-reconstruction bottlenecks using CUDA in CT scanner [7]. ELEKS helps a medical device to reduce patient assessment time by accelerating MRI scanner post-processing software with CUDA [8]. ELEKS implements the parallelized single vector decomposition on MRI scanner and reduces the reconstruction time upto 155x. EGSnrc is a well-known Monte Carlo simulation package for coupled electron-photon transport that is widely used in medical physics applications [9]. EGSnrc working with CUDA for acceleration and archived the speedup ratio upto 20x - 40x. The Aetina M3N970M-MN is 4D ultrasound equipment using CUDA cores to perform advanced 3D visualization of ultrasound data using the latest phase shift harmonic imaging [10]. 3. GPU COMPUTATION FOR MEDICAL IMAGE ANALYSIS

AC C

EP

Nowadays, the modern medical industry produces large quantity of data and processes them with complex algorithms. Generally 2D, 3D and 4D volumes are generated by the medical image modalities to diagnosis and surgical planning processes. These factors instigate the need of high performance computing system with huge computational power and hardware configuration [11]. The major techniques involved in medical image analysis are denoising, registration, segmentation and visualization as shown in Fig. 2. The designing filter and registration are famous preprocessing areas. Segmentation is to simplify and change the representation of the image into something that is more meaningful and easier to analyze and diagnosis. Visualization contains famous post processing method in the medical image representation. Some of the latest GPU computation on these medical image analyses are given in Table 3 and elaborated in the following sections.

M AN U

SC

RI PT

ACCEPTED MANUSCRIPT

Fig. 5 CUDA work flow

3.1 Image Denoising

AC C

EP

TE D

Medical images obtained from MRI are generally affected by random noise that arises during the image acquisition, measurement and transmission processes. Solution of this problem may lead to improve the diagnosis and surgical procedures. Image denoising is an important task in medical imaging applications in order to enhance and recover hidden details from the data [12]. Image registration and segmentation algorithms give expected accuracy while using the denoising algorithms. The most commonly used denoising algorithms in the medical domain are adaptive filtering, anisotropic diffusion, bilateral filtering and non-local means filter [13]. All theses algorithms are partially or fully support to data parallelism using pixel or voxel per thread scheme [14][15][16]. 3.1.1 Adaptive filtering

The denoising approach uses an adaptive filter introduced by Knutsson et al., in 1983 [17]. The adaptive filter is a self-modifying digital filter that adjusts its filter coefficients in an attempt to minimize an error function. The error function measures the distance between the reference signal and the output of the adaptive filter [18]. The input–output relationship of adaptive filter is described by, y ( k ) = ∑i=0 wi x (k −i ) N

where,

(1)

ACCEPTED MANUSCRIPT N is the filter order, x(k) is a vector, x (k) = [x (k), x (k −1),..., x( k − N)]

w is filter coefficient, w = [w0 , w1 ,...., wN ]

,

T

RI PT

Adaptive filter is a directed method that does not need to be iterated. Eklund et al., proposed a parallel denoising technique on 4D data using adaptive filter [19]. The GPU algorithm was applied to a 4D CT heart dataset in the resolution of 512×512×445×20. The GPU implementation of spatial filtering and Fast Fourier Transform (FFT) based filtering taken 25 minutes and 8 minutes respectively. Multithreaded CPU implementation took few days for spatial filtering and about 50 minutes for FFT based filtering. Table 3. Existing works using GPU CUDA model on Medical Image Analysis System Authors, Year & Ref

Method

CPU Name

3D Angiogram images

27 – 74

2.5 GHz

NVIDIA GT 635

96

2D Mammogram images

20 – 26

AMD Althlon IIX2 240

2.8 GHz

NVIDIA GTS 250

128

2D Ultrasound images

42 – 55

-

-

NVIDIA Quadro Plex 2200-S4

960

3D MRI brain images

5

NVIDIA Tesla S2050

448

Intel Xeon

2.8 GHz NVIDIA Tesla C2050

448

Intel Xeon E5520

NVIDIA Tesla C1060 2.3 GHz & NVIDIA Tesla C2070

240 & 448

Attia, et al., 2015 [22]

Anisotropic diffusion

Intel core I5-3210M

Jiang et al., 2011 [25]

Bilateral filter

Li et al., 2015 [36]

Intel Xeon E5620

TE D

Massanes et al., 2011 [35]

2.4 GHz

3D MRI brain images 3D MRI brain images High definition television images

6-103 148-510 2001110

-

-

NVIDIA Tesla K20

2496

3D lungs CT dataset

-

Rigid transform

Intel Core 2 Quad

2.4GHz + 4 Cores

NVIDIA GT 8800

112

3D Point sets

60

Olmedo et al., 2012 [38]

Threshold

AMD Phenom II Quad-core

3.2 GHz

NVIDIA GT 430

96

Digital images

65

Region growing

Intel Core i5-3570

3.4 GHz + 4 cores

NVIDIA GTX 285

240

3D CT datasets

7 - 32

Region growing

Intel Xeon X5650 Intel - I5 2500 Intel Xeon E5-2650

448

PLI human brain data

20

1536

Binary images

2 - 46

2496

Binary images

70 -

AC C

EP

Tamaki et al., 2010 [37]

Park et al., 2014 [40]

Westhoff et al., 2014 [41] Kalaiselvi et al., 2016 [44] Koay et al., 2016 [45]

Visualizatio

240

2.8 GHz

Nguyen et al., 2016 [31]

Segmentatio n

NVIDIA GTX 275

AMD Phenom II X6 1055 T

Cuomo et al., 2014 [30]

Registration

14 - 230

Anisotropic diffusion

Bilateral filter & Anisotropic diffusion Non local means Non local means Block matching algorithm Block matching algorithm

Cores 4D CT heart dataset

Wang, 2013 [21]

Howison, 2010 [26]

Speed Up Gain (X)

512

Intel Xeon

Denoising

Materials

NVIDIA GTX 580

Adaptive filter

M AN U

Eklund et al., 2011 [19]

Speed 2.4 GHz + 4 cores

GPU Name

SC

Topic

Morphology Morphology

Pan et al., 2008 [47]

Watershed

Vitor et al., 2009 [48]

Watershed

Smistad et al., 2011

Marching

Intel Pentium D AMD Phenom II X3 Intel I5

2.7 GHz 2.9 GHz 2.3 GHz

NVIDIA Tesla M2050, M2070 NVIDIA Quadro K5000 NVIDIA Tesla K20M

2.8 GHz

NVIDIA GT 8500

16

3D abdomen and brain dataset

2.6 GHZ

NVIDIA GTX 295

480

-

2

2.7 GHz

NVIDIA GTX 460

336

3D

-

ACCEPTED MANUSCRIPT n

[53]

cubes

750

angiography and brain dataset

Weinlich et al., 2008 [55]

Ray casting

Intel Xeon E5410

2.3 GHz

Zhang, et al., 2009 [56]

Ray casting

Intel Pentium D 820

2.8 GHz

NVIDIA GTX 8800, QuadroFX 5600. NVIDIA 6800 GS, NVIDIA 7900 GTO & NVIDIA 8800 GTS

575

CT images

148

17, 32 & 96

4D cardiac datasets

127 &179

g (∇ I ) =

1+

(

1

2  || ∇ I ||  −   K 

M AN U

g (∇ I ) = e

SC

RI PT

3.1.2 Anisotropic diffusion Anisotropic diffusion filter is an iterative algorithm introduced by Perona and Malik in 1987. The algorithm aims at reducing image noise without affecting significant parts of the image content, edges, regions, lines or other details [20]. The diffusion equation reduces to the heat equation for removing noise. It encourages the diffusion process within regions and forbid it across strong edges. Therefore the edges can be preserved while removing the noise. This process is called as Perona-Malik diffusion or inhomogeneous and nonlinear diffusion. Perona and Malik defined the following two diffusion functions:

|| ∇ I || K

)

(2)

2

(3)

The partial differential equation (PDE) for anisotropic diffusion is defined as:

where,

λ ∑ g (| ∇I S , P |)∇I S , P | η s | P∈η

TE D

I t +1( S ) = I t ( S ) +

(4)

s

AC C

EP

Input noisy image I, Diffusion coefficient g , Gradient threshold parameter K, S denotes the pixel position in the discrete 2D grid, Iteration step t, Rate of diffusion λ ∈ (0, 1), Gradient operator ∇, ∇I S , P = I t (P ) − I t (S ) ,

η s is 4 neighborhood of pixel S and p ∈ η s = {East, West, North, South}. The PDE used in the diffusion process required more number of time steps. Here GPU can be used efficiently to process each element in parallel manner either pixel or voxel per thread. Wang et al., realized the anisotropic diffusion algorithm required more computational power and low execution rates on high resolution medical imaging volumes [21]. They proposed a vessel enhancing diffusion algorithm on angiogram images using GPU-CUDA. This method reduced the computing time upto 27X. Attia et al., presented a method that explained the effect of memory optimization with the

ACCEPTED MANUSCRIPT performance of CUDA accelerated anisotropic diffusion algorithm on mammogram images [22]. They got high computation gain when used the texture and shared memory effectively. 3.1.3 Bilateral filtering

1 2∏σ

2

 x2 + y2   exp  − 2 σ 2  

The bilateral filter is defined as where,

1 ∑ y∈ N (x ) Gσ d (x, y ) Gσ r (| I (x ) − I ( y ) |) I ( y ) C

σ d is domain variance, σ r is range variance,

(5)

(6)

M AN U

I ( x) =

SC

G σ (x , y ) =

RI PT

Bilateral filter is a non-linear, edge-preserving and noise-reducing filter introduced by Tomasi and Manduchi in 1998 [23]. The intensity value of each pixel depends on weighted average values of neighborhood pixels. The weight is determined by the Gaussian distribution as well as spatial and intensity distance. The spatial and intensity distance are done by the domain and range filters respectively. The term domain refers to the pixel location and range related to the pixel value in the image. Mathematically, at a pixel location x, the output of a bilateral filter is a combined form of shift invariant domain filter with Gaussian range filter [24]. The Gaussian function is defined as

AC C

EP

TE D

N(x) is a spatial neighborhood of pixel I (x) and C is a normalization constant. Jiang et al., have proposed a method for parallel speckle detection and mask production with bilateral filter using CUDA [25]. They have used thread per row and column to speckle detection with shared memory using kernel 1 and 2. The thread per pixel scheme is implemented for mask production using kernel 3 and 4. The adaptive bilateral filtering is done by kernel 5 and 6. This method achieved 44 times faster than the unoptimized CPU implementation on C programming. Howison has made a comparison on GPU implementations of bilateral and anisotropic diffusion filters for 3D MRI brain datasets [26]. The noisy MRI brain images are gathered from simulated BrainWeb repository [27]. The parameters used between two filters, such as total runtime, memory bandwidth, computational throughput, and mean squared errors are discussed. For similar result, the bilateral approach is about five times faster than diffusion filter in GPU.

3.1.4 Non local means

Non-local means (NLM) is a nonlinear filter that allows two important attributes of the denoising filters: noise removal and edge preservation. NLM is introduced by Buades et al., takes similarity of local patches in the image to determine the pixel weights [28] [29]. Each pixel p of the output image is computed from the weighted average of all pixel intensity in the image v defined as (7) NL( p ) = ∑ w ( p, q ) v (q ) where, v is the noisy image, w( p, q) is the weight that meets the following conditions:

ACCEPTED MANUSCRIPT 0 ≤ w(p, q) ≤ 1 and ∑q w(p, q) = 1. The weights are computed as w(p, q ) =

− 1 e Z (p)

d ( p ,q ) h2

(8)

d ( p, q ) = ||v( N p ) −v( Nq ) ||22,F

where

RI PT

The weights w(p,q) reflect the similarity between neighborhoods Np and Nq. The distance d(p,q) between the pixels Np and Nq is calculated using given Euclidean definition. (9)

Z (p ) =

∑

−

e

(10)

M AN U

q

d h2

SC

Nk denotes a square neighborhood of fixed size and centered at a pixel k. F > 0 is a standard deviation of Gaussian kernel Z(p) is the Euclidean normalizing constant and given in Eqn. 10.

h is the weight-decay control parameter.

3.2 Registration

EP

TE D

Cuomo et al., have implemented a 3D nonlocal means parallel approach on GPU and multi-GPUs architecture [30]. They have observed optimal results obtained when the number of threads per block is between 128 and 256. Nguyen et al., addressed the computational problem of non local means and proposed a parallel denoising algorithm for 3D MRI brain volumes [31]. This method enhances the performance of non local means with the combination of voxels preselection and symmetric weights computation using multi-threading CPU and multi-GPUs implementation. They have used voxel per thread scheme. Their implementation showed that limited threads per block (128) given optimal gain and increased the memory occupancy. They used 18944 registers per block and utilized the hardware upto 50 percentage only. Optimized multicore (32 cores) CPU and multi-GPUs (4 GPU’s) implementation have reduced the time factor upto 148 and 520 times respectively than the unoptimized CPU implementation. Among these four filters, non local means gives better quality image but takes more computation time.

AC C

The term medical image registration determines the spatial alignment between reference image and spatial transformed image [32]. The reference image and transformed image have acquired from same or different modalities. Image registration uses the interpolation to determine each voxel in the transformed image using the corresponding intensity in the reference image. GPU is commonly used in the medical image registration process because of GPU hardware supports to the linear interpolation [33]. Initially GPU is made for accelerate interpolation operation in the computer graphic games. Two popular registration algorithms are block matching algorithm (BMA) and rigid transformation estimation (RTE).

ACCEPTED MANUSCRIPT 3.2.1 Block Matching Algorithm The block-matching algorithm (BMA) is the most popular method for the motion estimation from the image sequence. This method splits an image into blocks and estimates the displacement to each block. This algorithm works by performing a local brute force search for the block Bi,j in the reference image that best matches each block Br' ,s within the moving image [34]. BMA computes the given below.

C

r,s i, j

=

N 2

s+

N 2

∑ ∑

N N a=r − b=s− 2 2

'

B i , j × B a ,b σ (B i , j )× σ B a' , b

where, B i , j is a mean of Bi , j and σ is standard deviation.

(

)

(11)

SC

r+

RI PT

similarity criterion C i ,rj,s between Bi,j and Br' ,s and gives the best match. The similarity criterion C i ,rj,s is

M AN U

BMA gives a list of matches which represent a list of vectors with their associated score. The kth match gives (i, j )k and (r , s )k vectors and the similarity criterion score (C i ,rj,s ) k . The steps of BMA are

EP

TE D

shown in Fig. 6.

Fig.6 Steps of block matching algorithm (a) reference image (b) moving image (c) matches overview

AC C

Massanes et al., have introduced a parallel BMA for motion estimation on high definition television images using multi GPU-CUDA computing engine [35]. BMA uses summed absolute difference (SAD) error criterion and full grid search (FS) for finding optimal block displacement. Their results showed that the GPU reduces the computing time by a factor of 200 times for integer and 1000 times for non integer search grid. For the qualitative analysis, small differences occur in output images on CPU and GPU implementation due to GPU not follow IEEE-74 floating point precision. Nowadays fermi architecture onwards supports IEEE-74 standard. Li et al., presented a parallel block matching algorithm for lung CT image registration using GPU-CUDA model [36]. This method has achieved a fast implementation and accurate registration.

ACCEPTED MANUSCRIPT 3.3.2 Rigid Transformation Estimation

RI PT

Rigid Transformation Estimation (RTE) is one of the simplest forms of image registration in the medical imaging. Medical image registration algorithm has a property of six degree of freedom in the transformation called ‘rigid body’ and that transformation in three dimensions includes three translation and three rotations. RTE allows finding the transformation between references and moving images with the support of vectors given by the BMA. A rigid transformation (T) represents the linear and/or angular displacement of a rigid body; it can be formally defined in Eqn. (12), (12) T : V → R.V + t

SC

Subject to R T = R − 1 and det ( R) = 1 where, V is a vector, T(V ) is the transformed vector, R is a rotation matrix and t is a translation vector.

3.3 Segmentation

TE D

M AN U

Notice that the second constraint excludes reflections, as reflections are represented by orthogonal matrices with determinant −1. So, the rigid transformation has 6 degrees of liberty with three translations and three rotations. The similarity transformation is not included in the rigid transformation. Tamaki et al., introduced a CUDA-based method for two 3D point sets registration algorithms using Softassign and EM-ICP [37]. They have achieved EM-ICP aligns less than 7 seconds on a GeForce 8800GT whereas optimized CPU implementation in OpenMP on Intel Core 2 Quad would take 7 minutes.

3.3.1 Thresholding

EP

Many segmentation methods are computationally expensive while running on large amount of dataset produced by the medical modalities. Segmentation of image data before or during the operation has to be fast and accurate in the clinical environment. Image segmentation in medical imaging is often used to segment brain structures, blood vessels, tumors and bones. The famous segmentation methods have used in the areas of thresholding, region growing, morphology and watershed.

AC C

Thresholding is a process to segment each pixel or voxel using one or more threshold values. Thresholding is a simplest technique to implement the data parallelism using voxel per thread in 3D image or pixel per thread in 2D image. The simplest binary threshold is given as: 1 if I ( x, y)> T S (x, y ) = 0 Otherwise

where, T is the threshold value, I is input image, S is output image, x and y are coordinate positions.

(13)

RI PT

ACCEPTED MANUSCRIPT

(a) (b) Fig. 7 Simple thresholding on MRI (a) Sample Image (b) Threshold image (T=128)

M AN U

SC

From the Eqn. (13), we understand that thresholding method at each pixel or voxel completely independent to one another. GPU is capable to create number of threads which is equal to number of pixels or voxels in the image to support data parallelism. The sample MRI brain image and its thresholded image (T=128) are shown in Fig. 7. Olmedo et al., had done a comparison on thresholding technique with CUDA and OpenCV (Open Source Computer Vision) [38]. CUDA gives better performance than OpenCV in most cases of 2D images with various sizes. 3.3.2 Region Growing

AC C

EP

TE D

Region growing is commonly used medical image segmentation technique [39]. Region growing starts with initial seed point from object which is given by either manually or automatically using prior knowledge. The operation starts from the seed point and connects the neighboring pixel which is similar to the seed point based on some criteria. The criteria can be, for example, pixel intensity, grayscale texture or color. The computation time of seeded region growing is directly proportional to the size of the segmented region in the 3D volume. The sample brain MRI and gray matter extractions with different iterations for a seed point are shown in Fig. 8.

(a)

(b)

(e)

(c)

(f)

(d)

(g)

ACCEPTED MANUSCRIPT Fig.8. Skull stripping and gray matter extraction using region growing algorithm (a) Sample T1 brain MRI (b) Initial seed point (c) Iteration=200 (d) Iteration=1000 (e) Iteration=2000 (f) Iteration=5000 (g) Iteration=10000

M AN U

SC

RI PT

Park et al., proposed a novel method for parallelizing the seeded region growing using CUDA [40]. The proposed method performance compared with single and quad-core CPU using OpenMP on lung and colon images. In the CUDA implementation, they refer information from neighboring voxels using eight threads due to limitation of available threads. When the segmented-region size increased, the single and quad-core CPU methods required considerably increasing computation time, whereas the CUDA exhibited a constant computation time. Westhoff presented a parallel seeded region growing algorithm for medical images obtained from polarized light imaging (PLI) [41]. Due to the very high resolution at sub-millimeter scale, an immense amount of image data has to be reconstructed three-dimensionally before it can be analyzed. They choose region growing for segmentation and accelerated the algorithm by a factor of about 20 using CUDA. They achieved high accelerating gain while creating 448 threads per block. 3.3.3 Morphology

AC C

EP

TE D

Morphological image processing is a structure based analysis method with the combination of some segmentation methods [42]. These operations based on set theory with binary images introduced by Serra [43]. The fundamental morphological operations are dilation and erosion that support the expansion and shrinking properties of images respectively. These operations use a small matrix mask called structuring element. The matrix filled with ones and zeros and various shapes like diamond, disk, octagon, square, arbitrary and line. These morphological operations fully pixel based independent operations that support parallel processing in CUDA.

0 1 0

1 1 1

0 1 0

(a) (b) (c) (d) (e) Fig. 9. Morphological operations on brain MRI (a) Sample MRI (b) Ridler thresholding (c) Dilation (d) Erosion (e) Disk structuring element

A sample MRI brain gray image, Ridler thresholded binary image, dilated image, eroded image and corresponding structuring element are shown in Fig. 9. Various works has been done on the implementation of the morphological operations using CUDA. Kalaiselvi et al., implemented a parallel morphological operations technique using CUDA on general images [44]. They have taken various sizes of images to their experimental study. They implemented the operations in C++ and MATLAB

ACCEPTED MANUSCRIPT for CPU implementation and CUDA for GPU programming. They concluded that the GPU implementation improved the performance when the image size is increased.

RI PT

Koay et al., proposed a method for parallel implementation of morphological operations on binary images using CUDA [45]. This method contains combined form of bitmap representation and van Herk/Gil-Werman (vHGW) algorithm. This method includes separable structuring element to reduce the time complexity from nxy to n(x+y), where n is the number of pixels in the input image and x and y are the structuring element sizes. They achieved 70 times speed up with horizontal erosion operation and 20 time speed up with vertical erosion operation while compared with existing methods. 3.3.4 Watershed

M AN U

SC

The gray scale image can be viewed as a topographic surface and treated as three dimensional object in the watershed segmentation. Here the third dimension is an intensity value of the pixel [46]. In the gray scale image, high intensity considers as a hill (watersheds ridge line) and low intensity treated as valley (catchment basins). Watershed transform aimed to search for regions of watersheds ridge line that divide neighbored catchment basins. Watershed algorithm is more useful for segmenting the objects that are touching one another. One of the drawback is watershed lead to an over segmentation when given image affected by the noise with valley.

TE D

Pan et al., implements few medical image segmentation algorithms using CUDA [47]. They used multi degree watershed segmentation on abdomen and brain images. Vitor et al., proposed two parallel algorithms for the watershed transform focused on fast image segmentation using off-the-shelf GPUs [48]. These algorithms constructed with the heterogeneous implementation of both serial and parallel techniques and showed optimal results. The performance gain reached upto 14% when image size increases. 3.4 Visualization

AC C

EP

Medical image processing combined with visualization makes new way to diagnoses and to evaluate the effect of treatment given to the patient more accurate and reliable by using computers. In many medical imaging applications, visualization is essential for medical diagnosis and surgical planning to mine the important information included in 2D/3D imaging datasets given by the various modalities [49]. Creating 3D visualization of large medical datasets using serial processing on the CPU is very time consuming and inefficient. GPU helps this medical diagnostic process and it highly depends on volumetric imaging methods that must be visualized in real-time. Image visualization categorized into two groups: surface rendering and volume rendering. 3.4.1 Surface Rendering

Surface rendering constructs the polygonal surface from the given medical dataset and render the surfaces [50]. Surface rendering techniques require contour extraction that defines the surface of the structure to be visualized. A surface rendering model is created from contour extraction process on edges using 3D Doctor for BRATS tumor dataset and is shown in Fig. 10 (a) [51]. An algorithm is applied to place surface patches or tiles at each contour point. The surface rendered after the shading and hidden surface removal. Standard computer graphics process can be applied for object shading.

ACCEPTED MANUSCRIPT GPU accelerates the process of geometric transformation and rendering processes. GPUs were originally made to speed up the memory-intensive calculations in demanding 3D computer games. These devices are now increasingly used to accelerate the numerical computations like texture mapping, rendering polygons and coordinate transformation.

RI PT

Marching cubes (MC) algorithm was introduced by Lorensen and Cline for creating a 3D surface consisting of triangles from a volumetric dataset of scalars [52]. The algorithm uses a parameter, called the iso-value, to classify the points in the dataset either inside or outside the surface. The dataset is divided into a grid such that a number of cubes are formed. The corner of each cube is represented by the data point in the dataset. Smistad et al., proposed a data parallel marching cube algorithm for surface rendering using GPU [53]. They used OpenCL and CUDA for GPU programming. OpenCL gives better performance than CUDA because the largest dataset makes memory exhaustion.

SC

3.4.2 Volume Rendering

AC C

EP

TE D

M AN U

Volume rendering resolves the issues of accurate representation of surfaces detection and used to visualize the three-dimensional data. Volume rendering is used to visualize the three spatial dimensions in the help of 2D projection with the semi transparent volume. The major application area of volume rendering is medical imaging. One of the most popular volume rendering techniques is raycasting algorithm. Ray casting is not implemented to any geometric structure and solve the limitations in the surface extraction. Ray casting solves a major limitation of surface extraction namely they fail to project a thin shell in the acquisition plane [54]. It needs a random search in three dimensional dataset, and that requires a large amount of computational power and bandwidth. CUDA gives solution to these problems.

(a) (b) Fig.10 The post processing of BRATS tumor MRI dataset (a) Surface rendering using contour extraction (b) Volume rendering using ray-tracing

A parallel ray casting method for forward projection is proposed by Weinlich et al., using CUDA and OpenGL (Open Graphics Language) [55]. They used two GPUs (Geforce 8800 GTX and Quadrofx 5600) for CUDA and OpenGL implementations. The implementation result shows that OpenGL will perform better than CUDA 1.1. They realized CUDA 2.0 is three times faster than lower

ACCEPTED MANUSCRIPT

RI PT

version of CUDA 1.1. Final outcome of this work is time reduction, upto 148 times than unoptimized CPU. Zhang et al., presents a new algorithm to synchronize the phases of the dynamic heart to clinical ECG signals to calculate and compensate for latencies in the visualization pipeline using GPU [56]. They have implemented 4D cardiac visualization using three algorithms that are 3D texture mapping (3DTM), software-based ray casting (SOFTRC) and hardware-accelerated ray casting (HWRC). They used three GPU to compare the performance gain. Kalaiselvi et al., proposed a method for tumor boundary extraction and produced its respective volume rendering process using 3D Doctor as shown in Fig. 10 (b) [57] [58]. 4. DISCUSSION

M AN U

SC

We have presented a compendious survey of medical image analysis required GPU computation. The main findings of this review are summarized in Table 3. In this table, four areas of medical imaging pipeline on GPU are described. The few open issues, limitations, system configurations, materials and speedup gain of each method are discussed. This discussion section contains some of the findings and suggestions to optimize the medical imaging algorithms and implementations on GPUs to achieve additional gain.

TE D

The optimized GPU implementation on medical imaging algorithms gives additional speedup and attractive results. The denoising algorithms process each pixel independently and access the data from their small neighborhood using the convolution mask. Every convolution mask with rank one is separable. The separable filtering is a technique that decomposes the mask into 1D filter along x axis (horizontal pass) and 1D filter along y axis (vertical pass) respectively. Finally the separable filters saving of order 2N operations instead of N2 for non separable 2D convolution. But this is greater if we are considering larger convolution mask. Medical image denoising algorithms are required these separable filtering technique for optimizing higher dimension data in GPU.

EP

Programmers need to carefully optimize their memory access patterns to achieve high GPU performance. Shared memory latency is 100X lower than the global memory latency. The shared memory is very limited (48KB per multiprocessor) and faster (~1.7TB/sec) which helps to reduce the more registers and global memory access. GPU implementation with the shared memory yields high throughput. Generally, optimization is not a focus of medical image registration research on the GPU due to its low computational intensity and lack of computations that can be parallelized [33].

AC C

All the segmentation methods are used the medical imaging data in the form of 2D to 4D. CUDA supports the three dimensional thread creation and not for four dimensional. The 4D or higher dimensional medical imaging data required to use efficient 3D thread indexing approach to segmentation process. But NVIDIA continuously developed highly optimized libraries for CUDA SDK. Therefore it is important that new libraries continue to be produced and that existing one is continuously improved. Likewise debugging and optimization also improved significantly. Modern CUDA can support deep learning for medical images [59]. The libraries like cuDNN, GIE, cuBLAS, cuSPARSE and NCCL support the deep learning in CUDA. Some of the CUDA libraries (CUFFT and NPP) are supported for 4D processing.

ACCEPTED MANUSCRIPT

RI PT

From our implementation, we observed some facts while calculating the computation time on CPU and GPU. Computation speed of CPU implementation is fully language dependent (MATLAB is much slower than C). Optimized CPU (multithreading) implementations are desirable to compare with GPU. Computation time is calculated averagely by executing the algorithms in 10 times atleast with the time function in appropriate places on the serial and parallel code. Unknown factors like console operations, file operations, dynamic array allocations, unnecessary loop execution and unwanted thread creation make the algorithm slower. Data transferring time should be included in the GPU time calculations. Mostly suggested 128 or 256 threads per block as a good choice as a balance between memory latency, registers and threads. Heterogeneous CPU and GPU implementation is appropriable for time complex algorithms.

M AN U

SC

The CUDA programming has few limitations that are given follow. CUDA supports NVIDIA GPU’s only and little hard to understand and debug the errors. It’s floating point operations makes performance degradation than integer operations. It has very limited shared memory (48KB per multiprocessor) and high access latency of global memory. Data transferring bottleneck occurred between CPU and GPU due to PCI-Express bus bandwidth and latency.

5. CONCLUSION

TE D

In future, one of the emerging GPU fields is content based image retrieval (CBIR). CBIR is one of the available solutions to find similar images automatically for query from large quantities of images. The GPU based hardware acceleration could be improved the retrieval efficiency in the medical imaging while huge data to compare [60]. CUDA can give a solution with data mining for knowledge discovering associations between biological entities in future. On this field, a method named as artificial immune system (AIS) uses CUDA to refine explanatory models and successfully identify highly complex association between genotypes and diseases [61].

EP

In this review, we discussed most common areas on GPU computing in medical imaging analysis. The existing works of medical image analysis are investigated and performance gains with the CUDA programming are discussed. This investigation intimates the important of GPU computing in the area of medical industry. Finally few optimization concepts are suggested for medical image algorithms. Few facts are discussed for calculating the speedup ratio between CPU and GPU. Limitations and future scopes of GPU programming are also discussed.

AC C

REFERENCES

[1] Rodger J A. Discovery of medical Big Data analytics: Improving the prediction of traumatic brain injury survival rates by data mining Patient Informatics Processing Software Hybrid Hadoop Hive, Informatics in Medicine Unlocked, vol. 1, pp. 17–26, 2015. [2] CUDA C Programming Guide, Technical report, Version 8.0, NVIDIA, 2017. [3] Ghorpade J, Parande J, Kulkarni M and Bawaskar A. GPGPU Processing in CUDA Architecture, Advanced Computing: An International Journal (ACIJ), vol. 3, no.1, pp.105-120, 2012. [4] Rob Farber, CUDA Application Design and Development, Elsevier, 1st Ed., pp. 1-336, 2011. [5] https://streamhpc.com/our-experience/medical-technology, Last accessed on 21 June 2017.

ACCEPTED MANUSCRIPT

AC C

EP

TE D

M AN U

SC

RI PT

[6] David BK and Kaufmann WH. In Praise of Programming Massively Parallel Processors : A Hands-on Approach, 2nd Ed. Elsevier, pp.1-514, 2012. [7] http://www.acceleware.com/node/529, Last accessed on 21 June 2017. [8] http://eleks.com/pdf/accelerated-image-processing-for-healthcare.pdf, Last accessed on 21 June 2017. [9] Lippuner J and Elbakri I A. A GPU Implementation of EGSnrc's Monte Carlo photon Transport for Imaging Applications, Physics in Medicine and Biology, vol. 56, no.22, pp. 7145–7162, 2011. [10] http://www.innodisk.com Last accessed on 21 June 2017. [11] Deserno T M, Handels H, Maier-Hein KH, Mersmann S, Palm C, Tolxdorff T, Wagenknecht G and Wittenberg T. Viewpoints on Medical Image Processing: From Science to Application, Current Medical Imaging Reviews, vol. 9, no.2, pp. 79-88, 2013. [12] Ouahabi A. A Review of Wavelet Denosing in Medical Imaging, Proceedings of the 8th International Workshop on Systems, Signal Processing and their Applications, IEEE, pp.19-26, 2013. [13] Eklund A, Dufort P, Forsberg D and LaConte S M. Medical Image Processing on the GPU – Past, Present and Future, Medical Image Analysis, vol. 17, no. 8, pp. 1073 – 1094, 2013. [14] Li C Y and Chang H H. CUDA-based acceleration of collateral filtering in brain MR images, In Eighth International Conference on Graphic and Image Processing, International Society for Optics and Photonics, vol. 10225, 2017. [15] Jaros M, Strakos P, Karasek T, Ríha L, Vasatova A, Jarosova M and Kozubek T. Implementation of K-means segmentation algorithm on Intel Xeon Phi and GPU: Application in medical imaging, Advances in Engineering Software, vol.103, pp. 21–28, 2017. [16] Keceli A S, Can A B, and Kaya A. A GPU-Based Approach for Automatic Segmentation of White Matter Lesions, IETE Journal of Research, vol.63, no.3, pp. 461-472, 2017. [17] Knutsson H E, Wilson R, and Granlund G H. Anisotropic Non-Stationary Image Estimation and its Applications-part I: restoration of noisy images, IEEE Transactions on Communications, vol. 31, no.3, pp.388–397, 1983 [18] Apolinario J A. and Netto S L. Introduction to Adaptive Filters, QRS-RLS Adaptive filtering, Springer, Chapter 2, pp. 1-27, 2009 [19] Eklund A, Andersson M, and Knutsson H. True 4D Image Denoising on the GPU, International Journal of Biomedical Imaging, pp. 1-16, 2011. [20] Perona P and Malik J. Scale-Space and Edge Detection using Anisotropic Diffusion, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.12, no. 7, pp. 629-639, 1990. [21] Wang N, Chen W, and Feng Q. Angiogram Images Enhancement Method Based on GPU, World Congress on Medical Physics and Biomedical Engineering, vol.39, pp.868–871, 2012. [22] Attia M H, Elshehaby S A and Elmaghraby A S. Implementation of Edge-Enhancement Nonlinear Anisotropic Diffusion Filtering Using Different CUDA Memory Models, Proceedings of the International Symposium on Signal Processing and Information Technology (ISSPIT), IEEE, pp. 501-504, 2016. [23] Tomasi C and Manduchi R. Bilateral Filtering for Gray and Colour Images, Proceedings of the International Conference on Computer Vision, IEEE, pp.839-846, 1998.

ACCEPTED MANUSCRIPT

AC C

EP

TE D

M AN U

SC

RI PT

[24] Staal L K. Bilateral Filtering with CUDA, University of Aarhus, 2012 [25] Jiang F, Shi D and Liu D C. Fast Adaptive Ultrasound Speckle Reduction with Bilateral Filter on CUDA, Proceedings of the International Conference on Bioinformatics and Biomedical Engineering, IEEE, 2011. [26] Howison M. Comparing GPU Implementations of Bilateral and Anisotropic Diffusion Filters for 3D Biomedical Datasets, SIAM Conferences of Imaging Science, 2010. [27] McConnel Brain Imaging Center, http://www.bic.mni.mcgill.ca/brainweb, Last accessed on 21st June 2017. [28] Bovik A. The Essential Guide to Video Processing, 1st ed. USA: Academic Press, pp.1-778, 2009. [29] Buades A, Coll, B and Morel M, Image denoising methods. A new nonlocal principle, SIAM review, vol. 52, no.1, pp. 113-147, 2010. [30] Cuomo S, Michele P D, and Piccialli F. 3D Data Denoising via Nonlocal Means Filter by Using Parallel GPU Strategies, Computational Mathematical Methods in Medicine, 1-14, 2014. [31] Nguyen T, Nakib A and Nguyen H. Medical Image Denoising via Optimal Implementation of Non-Local Means on Hybrid Parallel Architecture, Computer Methods and Programs in Biomedicine, vol. 129, pp. 29-39, 2016. [32] Hill D L G, Batchelor P G, Holden M and Hawkes D J. Medical Image Registration, Physics in Medicine and Biology, vol. 46, no.3, pp. R1-R45, 2001. [33] Fluck O, Vetter C, Wein W, Kamen A, Preim B and Westermann R. A Survey of Medical Image Registration on Graphics Hardware, Computer Methods and Programs in biomedicine, vol. 104, no.3, pp. e45–e57, 2011. [34] Coatelen J, Qin Y, Dowson N, Barra V and Caux J. Image registration on GPU, ISIMA University of Blaise Pascal - CSIRO, Technical report, pp.1-47, 2011. [35] Massanes F, Cadennes M and Brankov J G. Compute-Unified Device Architecture Implementation of a Block-Matching Algorithm for Multiple Graphical Processing Unit Cards, Journal of Electronic Imaging, vol. 20, no.3, pp. 1-10, 2011. [36] Li M, Xiang Z, Xiao L, Castillo E, Castillo R and Guerrero T. GPU-Accelerated Block Matching Algorithm for Deformable Registration of Lung CT Images, Proceedings of the International Conference on Progress in Informatics and Computing, pp. 292 - 295, 2016. [37] Tamaki T, Abe M, Raytchev B, Kaneda K. Softassign and EM-ICP on GPU, Proceedings of the International Conference on Networking and Computing, IEEE, pp.179-183, 2010. [38] Olmedo E, Calleja J, Benitez A and Medina M A. Point to Point Processing of Digital Images using Parallel Computing, IJCSI International Journal of Computer Science Issues, vol.9, no.3, pp.1- 10, 2012. [39] Pratt W K. Digital Image Processing, 4th ed. John Wiley & Sons, Inc., Los Altos, California, 2007. [40] Park S, Lee J, Lee H, Shin J, Seo J, Lee K H, Shin Y and Kim B. Parallelized Seeded Region Growing Using CUDA, pp. 1-10, 2014. [41] Westhoff A M. Hybrid Parallelization of a Seeded Region Growing Segmentation of Brain Images for a GPU Cluster, Proceedings of the International Conferences on Architecture of Computing Systems, 2014.

ACCEPTED MANUSCRIPT

AC C

EP

TE D

M AN U

SC

RI PT

[42] Ravi S, Khan A M. Morphological Operations for Image Processing: Understanding and its Applications, Proceedings of the National Conference on VLSI, Signal processing & Communications, pp.17-19, 2013. [43] Serra J. Introduction to Mathematical Morphology, Computer Vision, Graphics, and Image Processing, vol. 35, no.3, pp. 283-305, 1986. [44] Kalaiselvi T, Sriramakrishnan P and Somasundaram K. Performance Analysis of Morphological Operations in CPU and GPU for Accelerating Digital Image Applications, International Journal of Computational Science & Information Technology, vol. 4, no. 1, pp. 15-27, 2016. [45] Koay J M, Chang Y C, Tahir S M and Sreeramula S. Parallel Implementation of Morphological Operations on Binary Images Using CUDA, Advances in Machine Learning and Signal Processing, vol. 387, pp. 163-173, 2016. [46] Vincent L, Soille P. Watersheds in Digital Spaces: an Efficient Algorithm based on Immersion Simulations, IEEE Transaction on Pattern Analysis and Machine Intelligence, vol. 13, no. 6, pp.583–598, 1991. [47] Pan L, Gu L, Xu J. Implementation of Medical Image Segmentation in CUDA, Proceedings of the International Conference on Technology and Applications in Biomedicine, IEEE, pp. 82–85, 2008. [48] Vitor G, Ferreira J, Korbes, A. Fast Image Segmentation by Watershed Transform on Graphical Hardware, Proceedings of 30th CILAMCE, 2009. [49] Shi L, Liu W, Zhang H, Xie Y and Wang D. A Survey of GPU-Based Medical Image Computing Techniques, Quantitative Imaging in Medicine and Surgery, vol. 2, no.3, pp. 188–206, 2012. [50] Jayaram K. Udupa, Hung H, and Chuang K. Surface and Volume Rendering in ThreeDimensional Imaging: A Comparison, Journal of Digital Imaging, vol. 4, no.3, pp. 159-168, 1991. [51] Kalaiselvi T, Sriramakrishnan P and Nagaraja P. Brain Tumor Boundary Detection by Edge Indication Map using Bi-Modal Fuzzy Histogram Thresholding Technique from MRI T2Weighted Scans, International Journal of Image, Graphics and Signal Processing, vol. 8, no.9, pp.51-59, 2016. [52] Lorensen W and Cline H. Marching cubes: A High Resolution 3D Surface Construction Algorithm, Proceedings of the 14th annual Conference on Computer Graphics and Interactive Techniques, vol. 21, no.4, pp.163-169, 1987. [53] Smistad E, Elster A C, and Lindseth F. Fast Surface Extraction and Visualization of Medical Images using OpenCL and GPUs, Workshop on High Performance and Distributed Computing for Medical Imaging, 2011. [54] Ling T and Zhi-Yu Q. An Improved Fast Ray Casting Volume Rendering Algorithm of Medical Image, Proceedings of the International Conference on Biomedical Engineering and Informatics, IEEE, pp. 109-112, 2011. [55] Weinlich A, Keck B, Scherl H, Kowarschik M and Hornegger J. Comparison of High-Speed Ray Casting on GPU using CUDA and OpenGL, Proceedings of the International Workshop on New Frontiers in High-performance & Hardware-aware Computing, pp.25-30, 2008. [56] Zhang Q, Eagleson R and Peters T M. Dynamic real-time 4D cardiac MDCT image display using GPU-accelerated volume rendering, Computerized Medical Imaging and Graphics, vol. 33, no.6, pp. 461–476, 2009.

ACCEPTED MANUSCRIPT

AC C

EP

TE D

M AN U

SC

RI PT

[57] BRATS 2012 database, http://www2.imm.dtu.dk/projects/BRATS2012/, Last accessed 21st June 2017. [58] 3D Doctor, Software purchased under DST project sanction, Principle Investigator, Kalaiselvi T, Department of Computer Science and Applications, The Gandhigram Rural Institute. [59] https://developer.NVIDIA.com/deep-learning-software, Last accessed on 21st June 2017. [60] Zhu L, Accelerating Content-Based Image Retrieval via GPU-Adaptive Index Structure, Scientific World Journal, pp. 1-12, 2014. [61] Sinnott-Armstrong N A, Granizo-Mackenzie D and Moore J H. High Performance Parallel Disease Detection: an Artificial Immune System for Graphics Processing Units, Computational Genetics Laboratory Dartmouth Medical School Lebanon, 2010.

Survey of using GPU CUDA programming model in

Survey of using GPU CUDA programming model in

Suggest Documents

GPU: CUDA Programming Model - NanoCAD Group

Efficient Data Compression using CUDA programming in GPU

Efficient Data Compression using CUDA programming in GPU

Efficient Data Compression using CUDA programming in GPU

The CUDA Programming Model

CUDA-lite: Reducing GPU Programming Complexity - CiteSeerX

CUDA-lite: Reducing GPU Programming Complexity - IMPACT

GPU Parallel Computing Architecture and CUDA Programming ...

High-Productivity CUDA Programming | GTC 2013 - GPU ...

aes on gpu using cuda - wseas.us

GPU CUDA Accelerated Image Inpainting using ...

GPU Computing with CUDA

CUDA Programming

GPU Implementation of Belief Propagation Using CUDA for Cloud ...

performance speedup of fuzzy logic systems using gpu and cuda

Lecture 4: CUDA Programming Model - NYU

Analysis of RSA algorithm using GPU programming

Analysis of RSA algorithm using GPU programming

Ultrasound goes GPU: real-time simulation using CUDA - CAMP, TUM

Analysis of RSA algorithm using GPU programming

Takagi Factorization on GPU using CUDA - Symposium on Application

Analyzing CUDA Workloads Using a Detailed GPU Simulator

String Matching on a multicore GPU using CUDA

NVIDIA CUDA Programming Guide