IMPLEMENTATION OF USUAL COMPUTERIZED ...

3 downloads 0 Views 405KB Size Report
[1] Markus Fenn, Stefan Kunis, and Daniel Potts. On the. Computation of the Polar FFT. Applied and Computa- tional Harmonic Analysis, 22(2):257 – 263, March ...
IMPLEMENTATION OF USUAL COMPUTERIZED TOMOGRAPHY METHODS ON GPU USING THE COMPUTE UNIFIED DEVICE ARCHITECTURE (CUDA) Benoit Recur, Pascal Desbarats, Jean-Philippe Domenger LaBRI, Bordeaux 1 University, 351 cours de la lib´eration 33400 Talence, France Abstract: CUDA (Compute Unified Device Architecture) is an efficient architecture developed by NVIDIA to compute parallel algorithms on Graphic Processing Unit (GPU). Using the API associated with this architecture, we develop fast parallel algorithms to compute standard methods for computerized tomography. Computation times are compared to their similar implementations on CPU to illustrate the efficiency of GPU implementation. Some limitations are highlighted and we develop different GPU-computation strategies induced by the size of used/computed data.

1

Nθ × Nρ W ×H T (sec)

BFP % % 4.081

FST % ≈ 7.51

SART % % 7.156

Table 1: Computation time in seconds to reconstruct a 2562 image from a 180 × 512 sinogram (performed in parallel on a quadcore CPU clocked at 3GHz). It increases significantly (%) or not (≈) according to Nθ × Nρ or W × H.

Introduction

Most of the methods used in tomography are time consuming and it prevents them to be used in real-time applications. In the past years, parallel computation has became an usual solution to increase their computational speed. Following this, we describe in this paper parallel implementations of standard methods used in tomography that utilize the last generation of consumer graphic hardware. A recent API called CUDA (Compute Unified Device Architecture) [10] is developed since 2007 by NVIDIA and optimized for parallel computation on GPU. This is particularly suited to most of standard reconstruction methods. So, we develop them on GPU using CUDA to check if they achieve a significant speedup compared to the similar parallel implementations available on CPU. Methods such as the Backprojection of Filtered Projections (BFP), techniques based on the Fourier Slice Theorem (FST) [15] and iterative methods like the Simultaneous Algebraic Reconstruction Technique (SART) [4] are all parallelisable. So, they are compatible with the implementation proposed by the GPU [7] [8] [13]. A method comparison is proposed in table 1. We denote for each method the acquisition size (Nθ projections ×Nρ samples per projection) and the image size (W × H) influences on the CPU computation time. Results (figure 2) correspond to 2562 pixel image reconstructions from a sinogram containing 180 projections of 512 samples (on the right in figure 1). Acquisition is done from the original Shepp Logan phantom [14] (on the left in figure 1). The results shown in figure 2 qualitatively differ. However,

Figure 1: On the left : original representation of the Shepp Logan phantom. On the right : acquisition of 180 projections and 512 samples per projection represented in a sinogram.

Figure 2: 2562 pixel images reconstructed from the previous sinogram. From the left to the right : BFP, FST and SART results.

the method chosen to reconstruct depends on the user requirements : the expected quality, computation time, the available data used to compute. Consequently in this article, we do not focus on a particular method and we overview a GPU implementation of the BFP, FST and SART methods. We firstly detail the different methods. In a second step, we explain the 1

GPU functionalities. Then, we implement the methods on graphic device using CUDA. We finally show the interest of implementing tomographic methods on GPU. This discussion is based on the computation time.

2

Methods

2.1

2.3

Radon Transform

The Radon transform R maps a 2D function f into a 1D projection for a given angle θ and a module ρ [12, 15]. This transform is defined by : Z



Z



f (x, y)δ(ρ − x cos θ − y sin θ)dxdy

Rθ (ρ) = −∞

(1)

−∞

where θ and ρ are respectively the angular and radial coordinates of the projection line (θ, ρ), and δ(·) is the Dirac impulse function. The sinogram denoted R is made of a set of projection lines. Each projection line corresponds to an acquisition using equation 1. From an image I sized W ×H, the discrete Radon transform R is [15]: Rθ (ρ) =

Figure 3: BFP reconstruction algorithm.

W −1 H−1 X X

Fourier Slice Theorem

The Fourier slice theorem (FST) states that the 1D Fourier transform F1D (Rθ ) of a projection Rθ of a 2D domain f according to an angle θ is a line of the 2D Fourier transform F2D (X, Y ) of f along the same angle θ. This property allows the recovery of f from the projections. In discrete domain, 1D Fourier projection transforms give a discrete polar grid. This grid is included on the cartesian grid representing the 2D Fourier domain of the reconstructed image. An interpolation step is needed between both grids. The overall FST algorithm is summarized by the following scheme 4 :

Figure 4: FST reconstruction algorithm. I(i, j)pk(ρ − i cos θ − j sin θ)

(2)

i=0 j=0

The interpolation step is the most tricky part of the algowhere pk defines the weight value between a pixel and a prorithm and many solutions have been developed [15, 1, 5]. In jection. In the following, the weight value depending on pixel order to efficiently compare CPU and GPU implementation p = (i, j) and projection line l = (θ, ρ) is denoted : and to show the interest of GPU using, we use the bilinear inαijθρ = αpl = pk(ρ − i cos θ − j sin θ) (3) terpolation. The inverse radon transform (retroprojection) recovers the original domain from the projections. It is the original theory of computerized tomography. From an acquisition R sized Nθ × Nρ , the discrete inverse Radon transform R−1 is : Nθ −1 Nρ −1

R−1 (i, j) = I(i, j) =

X X

Rθ (ρ)αijθρ

(4)

2.4

SART Method

The SART [4] method is an iterative reconstruction in k ∈ [0 · · · Niter [. Each sub-iteration s, 0 ≤ s < Nθ , updates each pixel of the reconstructed image I k,s by comparing the original projection Rθs with Rθks (performed from I k,s−1 ). A superiteration k is done when all the projections have been treated. The SART performs as follows :

iθ =0 iρ =0 Nρ −1

where iθ and iρ are respectively the indexes of angle θ and position ρ.

X I

k,s

(i, j) = I

k,s−1

(i, j) + λ

iρ =0

" αθs ρij

Rθs (ρ) − Rθks (ρ) Dθs (ρ)

Nρ −1

X

2.2

Backprojection of Filtered Projections

#

αθs ρij

iρ =0

(5)

The Radon transform and its inverse are exact in continuous where : domain. In computerized tomography, we have to use its discrete and approximative implementation (equations 2 and 4) to • λ is a relaxation parameter valued 1 by default, compute a discrete image from a subset of continuous projecH−1 −1 XW X tion lines. This process induces a result similar to a low pass • Dθ (ρ) = αθρij is a normalization coefficient. filter on the reconstructed image. i=0 j=0 Usually, a high pass filter denoted |ν| [15] is applied on the We denote by D the matrix that contains the coefficients 1D Fourier domain of each projection to counteract the low for an image. It could be computed only one time in pass filter. Filtered projections are then used to reconstruct the same way the sinogram R is obtained using an imwith R−1 . This method is called the backprojection of filtered age where each pixel has value 1. projections (BFP). It gives the left image result in figure 2. The corresponding algorithm is summarized by the following The order in which the projections are used into a superscheme : iteration is determined by a Projection Access Scheme (PAS).

Usual PAS are the sequential access scheme (SAS), the random permutation scheme (RPS) and multilevel scheme (MLS) [2, 3]. SART needs an initial image I 0 usually initialized with ¯ of the original sinogram. The SART rethe average value R sult given in figure 2 and its CPU computation time (table 1) is obtained after 5 k-iterations using MLS. We denote θs = P AS(s) the function giving the projection according to the chosen PAS. The SART algorithm performs as follows :

The 256−bit interface allows simultaneous access to 32−bit, 64−bit (or more) words, which increases the throughput. However, a read/write access is not negligible in time and usually, the memory access pattern used is essential to optimize the computation time. Global data have an optimized read-only access if they are cached through the texture memory (cf. figure 6).

Figure 5: SART reconstruction algorithm.

2.5

CPU parallelization of the algorithms

The inverse Radon transform used in BFP algorithm can be computed in parallel because each reconstructed pixel is independent to the others. Similarly, the interpolation step of the FST algorithm is done in parallel. In the same way in the SART algorithm, projection line (resp. pixel) values Rθks (ρ) (resp. I k,s (p)) are performed in parallel because they are independent to the others.

3

GPU Functionalities

We introduce now the functionalities available on the GPU through CUDA and used to implement tomographic methods on graphic device.

3.1

GPU generalities

Implementation of the method on the GPU is done with CUDA [10]. The GPU is viewed as a computation unit using threads to parallel computing. Those threads execute the same code defined in a kernel. When it is launched, the kernel is organized into blocks and grids of threads. A set of threads composes a block. Each thread has an identity number (threadIdx). The size of a block (blockDim) is decided by the programmer. A set of blocks composes the kernel execution grid. Each block of the grid is accessed by a block identifier (blockIdx). The NVIDIA Geforce 9800 GT used for our implementation is composed of 112 processors organized in 14 multiprocessors of 8 processors. An overview of its architecture is represented in figure 6 (inspired from [11]). Each processor is clocked at 1500 MHz and accesses four specific on-chip memories detailed in [11]. The device manages a global memory (GPU RAM) interfaced on 256 bits, clocked at 900 MHz and sized 512 MB.

Figure 6: NVIDIA Geforce 9xxx architecture restricted to the needed of our implementations. On the Geforce 9800GT, M = 14 and N = 8. Inspired from NVIDIA specifications given in [11].

3.2

Global memory using

A weight value αlp (see equation 3) is constant for a given projection line l and a given pixel p. Then, instead of recomputing each one when necessary, we can firstly perform or load precomputed weights and we store them in a matrix A. Suppose an acquisition with 180 projections of 512 samples used to reconstruct a 2562 image. Each weight αlp is given with a float precision, stored on 4 Bytes. The total size needed to store A is Nθ × Nρ × W × H × 4 = 5.6250 GB. Since a lot of values are null, we optimize the memory cost using sparse matrix. Only not null values and their corresponding indexes are kept. For example, the following matrix : 0

0.8 A=@ 0 0

0 1 0

1 0 0 A becomes : {(0, 0.8), (5, 1), (8, 0.4)} 0.4

A is then stored as an array of couples (a, α) where a is the 1D coordinate position and α the weight value. With this implementation the memory needed to store A in the previous example decreases to 215MB (26× less costly). Since the A values are read only one time with BFP and one time per k iteration with SART, the matrix A is not textured to avoid the penalization of others frequently accessed data. At hardware level, the GPU memory interface is much larger than a A value size. So, we can read several A values at once in order to decrease the computation time. For example,

on the GeForce 9800 GT, the available interface allows to read 4 A values at once.

3.3

Texture memory using

The SART computation accesses multiple times in read-only the matrix D and the original sinogram R. They can be stored in global memory and in order to increase access time, they are also cached in the texture memory. The Rθks computation accesses the image I k,s−1 multiple times in read only mode. Similarly, I k,s computation accesses the Rθks values multiple times in read-only. So, it is interesting to cache I k,s−1 (resp. Rθks ) in texture memory during Rθks (resp. I k,s ) computation. A bilinear interpolation between the polar and cartesian grids is needed for FST. The texture memory allows access to floating point indexes. In this case, the returned value is the linear interpolation of the two (for a 1D texture), four (2D texture) or eight (3D texture) values whose texture coordinates are closest to the given floating point indexes [11]. Then, the FST bilinear interpolation is done by texturing the polar grid to compute the cartesian one.

3.4

4.1

are read a once in algorithm 1 according to the available GPU memory interface.

4.2

FST GPU Implementation

The FST GPU implementation simply consists in computing 1D Fourier (with CUFFT) of each projection and storing them in texture memory. Interpolation is directly done using the real point position and associated linear interpolation allowed by the texture memory. The 2D inverse CUFFT gives the result image. The corresponding algorithm is summarized on the following figure 7 :

CUFFT Library

The CUFFT is the NVIDIA CUDA Fast Fourier Transform library. It provides an efficient implementation and a simple interface for computing parallel FFT on the GPU. Especially, it supports 1D and 2D transforms which are necessary for computing BFP and FST.

4

Algorithm 2: GP U BF P (R) load R and A on the device; foreach θ do Rθ ← cuf f t−1 1D (GP U F [Nρ ](cuf f t1D (Rθ ))); end GP U R−1 [W × H](R);

Implementation on GPU BFP GPU Implementation

The kernel computing the inverse Radon transform on the GPU is shown in algorithm 1. Each of its executions performs only one pixel. The pixel to handle is obtained from the kernel identifiers threadIdx, blockDim and blockIdx. The reader can refer to the NVIDIA CUDA programming guide [11] to see how to obtain pixel indexes from thread identifiers. We denote T → X the data X accessed through the texture memory. Algorithm 1: GP U R−1 (R) p = (i, j) ← p(threadIdx, blockDim, blockIdx); foreach l = (θ, ρ) such as l traverses p do I(i, j) ← I(i, j) + αpl T → Rθ (ρ); end

Figure 7: FST reconstruction algorithm on GPU.

4.3

SART GPU Implementation

Two kernels are needed to compute SART algorithm on the GPU. The GP U SinoK (algorithm 3) is used to compute Rθks from I k,s−1 . Each of its executions performs only one projection line value. The GP U ImageK (algorithm 4) performs I k,s from Rθks and computes one pixel value per execution. In both cases, the projection line l or the pixel p to handle is obtained from the kernel identifiers. Algorithm 3: GP U SinoK(θs ) l = (θs , ρ) ← l(threadIdx, blockDim, blockIdx); Rθs (ρ) ← 0; read several at once A values; foreach p such as p traversed by l do Rθks (ρ) ← Rθks (ρ) + T → I k,s−1 (p)αpl ; end

The main iterations of the algorithm are managed by the CPU which alternatively launches on the GPU the kernels described by the algorithms 3 and 4. The program execution is Suppose GP U F is the kernel computing the high pass fil- done as shown in the algorithm 5. The number of each kernel ter on each Fourier projection. BFP algorithm performs as de- launching is given in brackets. tailed in algorithm 2. The values in brackets define the number The available memory size of the Geforce 9800 GT is of kernel executions launched on the GPU. cuf f t functions enough to store A of the example used in this paper. If necdefine Fourier computations using CUFFT library. essary, using multiple graphic devices in parallel (SLI) we can The weight coefficients can be computed when necessary or increase the global memory size to 2 GB, which is enough in stored on the GPU memory. If A matrix is used, several values most reconstruction cases.

Algorithm 4: GP U ImageK(θs ) p ← p(threadIdx, blockDim, blockIdx); up ← 0, norm ← 0; read several at once A values; foreach l = (θs , ρ) such as l traverses p do T →Rθ (ρ)−T →Rk (ρ)

θs s ; up ← up + αpl × T →Dθs (ρ) norm ← norm + αpl ; end up I k,s (p) ← max{I k,s−1 (p) + λ norm , 0};

Algorithm 5: GP U SART load R, D and A on device; Texture(D) and Texture(R); ¯ and load I on device; I←R for iter = 0 to Niter − 1 do for s = 0 to Nθ − 1 do θs ← P AS(s); Texture(I k,s−1 ); GP U SinoK[Nρ ](θs ); UnTexture(I k,s−1 ), Texture(Rk,s ); GP U ImageK[H × W ](θs ); UnTexture(R); end end

T (sec) CPU T (sec) GPU gain

BFP 4.081 0.0289 141.21×

FST 7.51 0.076 98.81×

SART 7.156 0.691 10.35×

Table 2: Computation time T (seconds) of the algorithms executed on GPU. Comparison with CPU execution time (done in parallel on a quadcore clocked at 3GHz). Gain between CPU and GPU. similarly. Moreover, CUDA includes usual interpolations and a very efficient FFT implementation. The SART gain shows the efficiency of time costly iterative methods. Globally, we remark that the most time costly method (SART) on GPU is still better than the most accurate implementation (BFP) on the CPU.

5.2

About the GPU memory

On the GPU, SART computation can suffer from the memory latency and available size. So, we discuss now about the strategy to use according to the original data and the available GPU memory sizes. The graph in the figure 8 gives the computation time needed to perform each of ten consecutive iterations of SART methods on the GPU.

However, some particular acquisitions need more than 2 GB. Then, we can iterate kernel computations. Each iteration uses a subset of the matrix A loaded when it is required on the GPU memory. For example using four subdivisions, algorithm 4 does not compute entirely the image at once but in four iterations based on four subsets of A. It allows to compute SART algorithm on GPU without memory limitation. But it is less efficient because it implies reloading sub-matrices at each iteration.

5

Results and Discussion

We detail now the global GPU computation time results. Moreover, we discuss the SART implementation on GPU and the efficiency of weight matrix use. Indeed, knowing that the GPU memory suffers from access latency and the GPU processors are really efficient to arithmetic computation, one can wonder if it is not more effective to recompute when necessary each weight (contrary to the general accepted idea) then to use precomputed weights stored in global memory.

5.1

GPU vs CPU computation time

The GPU results shown in the following table 2 highlight the efficiency of GPU computation whatever the method. Indeed, implementation of tomographic algorithms on graphic devices using CUDA obtains better execution times than on CPU. With BFP, each pixel is reconstructed independently of the others. It allows an efficient parallelism on GPU and algorithm computes 141× faster than on CPU. FST method is computed

Figure 8: Computation time of each iteration of the SART methods (for ten iterations).

Three SART methods are compared. SART (cf. figure 8) does not use the weight storage and recomputes each weight when necessary. SART Matrix (star points on figure 8) uses the A matrix. SART Sub-Matrices computes using matrix and supposing the memory size is not enough to store all the matrix at once. Sub-matrices are reloaded when necessary on the device at each iteration. Firstly, we denote that the iteration index do not influence on the iteration computation time. Each iteration of each algorithm is quasi-constant in time during a process. Consequently,

similarly than on CPU, the global computation time of each [4] method is proportional to the number of k-iterations. The SART Matrix method obtains the better iteration (and consequently the better global) computation time. The difference between SART and SART Matrix curves shows that an [5] efficient memory access pattern to the data stored in the global memory can hide the memory latency. So, even if the memory access time corresponds to a bottleneck on the GPU, it allows nevertheless a better computation time than the method which recomputes weights. [6] Inversely, the SART Sub-Matrices suffers from the loading time needed to copy each sub-matrix on the device. This loading time is represented by the difference between SART Matrix and SART Sub-Matrices curves. Since the loading time is [7] repeated for each iteration, it is added iteratively on the total computation time. Consequently, if the matrix can not be completely stored on the device, it is preferable to perform SART recomputing weights. [8] Moreover, the kernel 3 needs a lot of weight values because a projection line is computed from a large number of pixels. Inversely, the kernel 4 computes a pixel from a few of projection θ lines. So, the first one can take advantages of the memory bandwidth whereas the second can not. For example, [9] the better obtained SART computation time (given in the table 2) is in fact a hybrid SART method using weight matrix to perform kernel 3 and recomputing values when computes kernel 4. In comparison, the worst computation time obtained for the [10] same reconstruction is 0.93 seconds (≈ +33%).

6

Conclusion and Perspectives

We have highlighted the interest of implementing algorithms for computerized tomography on GPU with CUDA. Comparison of computation time between CPU and GPU for several methods confirms the efficiency of graphic device implementation. However, iterative methods as SART shows the limitations of general public GPU memory space and latency. The size of data used to compute determines the strategy of iterative method computation.

References [1] Markus Fenn, Stefan Kunis, and Daniel Potts. On the Computation of the Polar FFT. Applied and Computational Harmonic Analysis, 22(2):257 – 263, March 2007. [2] Huaiqun Guan and Richard Gordon. A projection access order for speedy convergence of ART (algebraic reconstruction technique) : a multilevel scheme for computed tomography. Physics in Medecine and Biology, 39:2005 – 2022, 1994. [3] Huaiqun Guan and Richard Gordon. Computed tomography using algebraic reconstruction techniques (ARTs) with different projection access schemes : a comparison study under practical situations. Physics in Medecine and Biology, 41:1727 – 1743, 1996.

Ming Jiang and Ge Wang. Convergence of the Simultaneous Algebraic Reconstruction Technique (SART). IEEE Transactions On Image Processing, 12(8):957 – 961, August 2003. Ronald Jones, tristrom cooke, and Nicholas J. Redding. Implementation of the Radon Transform Using Nonequispaced Discrete Fourier Transforms. DSTO Information Science Laboratory, April 2004. Klaus Mueller and F Xu. Accelerating popular tomographic reconstruction algorithms on commodity PC graphic hardware. IEEE Transaction of Nuclear Science, 2005. Klaus Mueller and F Xu. Practical considerations for GPU accelerated CT. IEEE International Symposium on Biomedical Imaging, 2006. Washington D.C. Klaus Mueller, F Xu, and N Neophytou. Why do commodity graphics hardware boards (gpus) work so well for accelerating computed tomography? SPIE Medical Imaging, 2007. Klaus Mueller and Roni Yagel. On the Use of Graphics Hardware to Accelerate Algebraic Reconstruction Methods. SPIE Medical Imaging Conference, Physics of Medical Imaging, 1999. NVIDIA. NVIDIA CUDA home http://www.nvidia.com/object/cuda.html.

page.

[11] NVIDIA. NVIDIA CUDA Compute Unified Device Architecture - Programming Guide, July 2008. Version 2.0. [12] Johann Radon. Uber die Bestimmung von Funktionen durch ihre Integralwerte langs gewisser Mannigfaltigkeiten. Ber. Ver. Sachs. Akad. Wiss. Leipzig, MathPhys. Kl, 69:262–277, April 1917. In German. An ensligh translation can be found in S. R. Deans : The Radon Transform and Some of Its Applications. [13] Holger Scherl, Benjamin Keck, Markus Kowarschik, and Joachim Hornegger. Fast gpu-based ct reconstruction using the common unified device architecture (cuda). IEEE Nuclear Science Symposium Conference Record, 6:4464–4466, november 2007. [14] L. Shepp and B. Logan. The Fourier Reconstruction of a Head Section. IEEE Transactions in Nuclear Science, 21(2):21 – 43, 1974. [15] Peter Toft. The Radon Transform : Theory and Implementation. PhD thesis, Department of Mathematical Modelling, Section for Digital Signal Processing, Technical University of Denmark, 1996.