GPU Accelerated Generation of Digitally Reconstructed ... - IEEE Xplore

2594

IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 59, NO. 9, SEPTEMBER 2012

GPU Accelerated Generation of Digitally Reconstructed Radiographs for 2-D/3-D Image Registration Osama M. Dorgham, Stephen D. Laycock, and Mark H. Fisher∗

Abstract—Recent advances in programming languages for graphics processing units (GPUs) provide developers with a convenient way of implementing applications which can be executed on the CPU and GPU interchangeably. GPUs are becoming relatively cheap, powerful, and widely available hardware components, which can be used to perform intensive calculations. The last decade of hardware performance developments shows that GPU-based computation is progressing significantly faster than CPU-based computation, particularly if one considers the execution of highly parallelisable algorithms. Future predictions illustrate that this trend is likely to continue. In this paper, we introduce a way of accelerating 2-D/3-D image registration by developing a hybrid system which executes on the CPU and utilizes the GPU for parallelizing the generation of digitally reconstructed radiographs (DRRs). Based on the advancements of the GPU over the CPU, it is timely to exploit the benefits of many-core GPU technology by developing algorithms for DRR generation. Although some previous work has investigated the rendering of DRRs using the GPU, this paper investigates approximations which reduce the computational overhead while still maintaining a quality consistent with that needed for 2-D/3-D registration with sufficient accuracy to be clinically acceptable in certain applications of radiation oncology. Furthermore, by comparing implementations of 2-D/3-D registration on the CPU and GPU, we investigate current performance and propose an optimal framework for PC implementations addressing the rigid registration problem. Using this framework, we are able to render DRR images from a 256 × 256 × 133 CT volume in ∼24 ms using an NVidia GeForce 8800 GTX and in ∼2 ms using NVidia GeForce GTX 580. In addition to applications requiring fast automatic patient setup, these levels of performance suggest image-guided radiation therapy at video frame rates is technically feasible using relatively low cost PC architecture. Index Terms—2-D/3-D image registration, CUDA, Digitally Reconstructed Radiographs, GPU accelerated.

I. INTRODUCTION DIGITALLY reconstructed radiograph (DRR) is a 2-D simulated approximation of an X-ray image, rendered from medical datasets, such as those created by computed tomography (CT). Rendering DRR images is important in many medical applications where 3-D data are used to plan and augment surgical procedures such as keyhole and radio-surgery [1]. In radiation oncology, DRR images have traditionally been used for patient positioning using portal images [2]–[4], cone-beam CT [5], and digital tomosynthesis [6]. Ionizing radiation is damaging to both healthy and malignant tissue so accurately targeting radiation is critical to a successful treatment outcome. To spare healthy organs radiation therapy has traditionally been delivered in fractions over a period of weeks by multiple beams that intersect at the tumor site. X-ray images are needed to accurately reposition a patient between treatment sessions and are also used by a number of recently developed techniques, collectively known as image-guided radiotherapy (IGRT), that aim to compensate for patient movement during each treatment fraction. 2-D/3-D image registration is key to both of these techniques as it allows patient and (with some uncertainty) tumor position [7] to be determined by aligning 3-D CT data used to plan the treatment with the X-ray images. 2-D/3-D registration involves solving an optimization problem which in turn may require many DRR images to be rendered. Accurate rendering DRR images at high speed is important for the efficient delivery of conventional radiotherapy and eliminates the need for precomputation of DRRs for IGRT. DRRs are generated from the CT data by computing the attenuation of a monoenergetic beam due to different anatomic material (e.g., bone, muscle tissue, etc.) using Beer’s Law [8]

A

I = I0 exp Manuscript received November 9, 2011; revised February 27, 2012; accepted June 24, 2012. Date of publication July 11, 2012; date of current version August 16, 2012. This work was supported by European Commission (EU) Sixth Framework Programme (FP6) under Project LSHT-CT-2004-503564, Methods and Advanced Equipment for Simulation and Treatment in Radio-Oncology (MAESTRO). Asterisk indicates corresponding author. O. M. Dorgham is with the Department of Computer Information Systems, Al-Balqa Applied University, Al Salt 19117, Jordan (e-mail: osama.dorgham@ gmail.com). S. D. Laycock is with the School of Computing Sciences, University of East Anglia, Norwich, NR4 7TJ, U.K. (e-mail: [email protected]). ∗ M. H. Fisher is with the School of Computing Sciences, University of East Anglia, Norwich, NR4 7TJ, U.K. (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TBME.2012.2207898

−

D 0

μ(x) dx

(1)

where I0 is the initial X-ray intensity, μ is the linear attenuation coefficient of the material through which the ray is cast, and D is the length of the X-ray path. Discretising (1) in the context of a CT volume gives [8] I = I0 ∗ expΣ−μ i x i

(2)

where I0 is the initial X-ray intensity, μi is the linear attenuation coefficient for the voxel (material), through which the ray is cast, x is the length of the X-ray path, and the subscript i denotes the voxel index along the path of the ray.

0018-9294/$31.00 © 2012 IEEE

DORGHAM et al.: GPU ACCELERATED GENERATION OF DIGITALLY RECONSTRUCTED RADIOGRAPHS FOR 2-D/3-D IMAGE REGISTRATION

2595

DRRs using the GPU [19]–[23]; however, our work focuses on approximate methods which reduce the computational overhead while still maintaining a quality that is sufficient for 2-D/3-D registration with accuracies that are clinically acceptable in radiation oncology. The next section discusses the related work, focusing particularly on algorithms which have been implemented on the GPU. Following from this, we present our CUDA-based rendering algorithm for the generation of DRRs. Section IV analyzes the performance of our algorithms in terms of speed and quality. Finally, Section V details the structure of our hybrid registration system and evaluates its performance using full resolution (FR) and reduced resolution (RR)-DRRs. II. RELATED WORK

Fig. 1. Illustration of the difference between visualising a 3-D data volume and a DRR rendered from a 3-D data volume using a ray casting method. (a) illustrates a ray casting method for rendering a surface from a 3-D volume, (b) illustrates a ray casting method for rendering a DRR from 3-D volume data, (c) illustrates an example of a surface rendered from volume data [9], and (d) illustrates an example of a DRR.

The attenuation coefficient of the material comprising each voxel can be recovered by [8] CTnumber = 1000 ∗ [(μi − μw )/μw ]

(3)

where μw is the linear attenuation coefficient of water for the average energy in the CT beam. Since DRR rendering is a special case of volume rendering, techniques for efficiently rendering DRR images can be grouped into two categories, known as image-based (backwardprojective) or object-based (forward projective). Ray casting is a technique which belongs to the former category. When ray casting is used render simulated X-ray images as opposed to visualize 3-D surfaces, it is a relatively expensive process, because all the voxel coefficients contribute to the pixel opacity (see Fig. 1). As a result, rendering DRR images is seen as a bottleneck in 2-D/3-D registration [10] and this has motivated research into more efficient algorithms. Shear-warp factorization [11] and texture mapping [12] represent early examples of methods classified as forward projective. Attenuation fields [13] and transform methods [14] are examples of preprocessing optimizations that have been applied within backward and forward projective frameworks, respectively, but are of limited use in the context of 2-D/3-D registration. Recent advancements in programmable graphics processing unit (GPU) architectures have motivated new implementations of both backward and forward projective volume rendering [15]–[17]. In this context, it is particularly timely to exploit the benefits of the many-core GPU architecture by developing parallel DRR generation algorithms. To achieve this a parallel computing architecture such as the Compute Unified Device Architecture (CUDA [18]) programming environment or OpenCL may be used. Previous work has investigated the rendering of

One of the earliest GPU-based acceleration studies for 2-D/ 3-D image registration was introduced by LaRose [24] for iterative 2-D/3-D registration. The technique relies on an OpenGL extension implemented on an NVidia GeForce 3 graphics card. Using the accumulating algorithm, they were able to accelerate the generation of the DRRs to render them at a size of 512 × 512 pixels from a CT volume of size 256 × 256 × 256 voxels. The rendering was achieved at a speed of approximately 14 Hz (71 ms/DRR). During the last ten years there have been significant enhancements in the GPU industry. Grabner et al. [25], Ino et al. [26], and Yan et al. [21] exploit programmable vertex and fragment processors in the GPU to achieve accelerated performance and more recently Lu et al. [27] and Mori et al. [28] use CUDA to accelerate the rendering of DRR images within a traditional ray casting paradigm. Bethune and Stewart [29] address the rendering of DRRs for image registration, planning of orthopedic surgery, and viewing intraarticular features which are invisible in surface-shaded CT images. Theirs is a forward-projective technique that uses the hardware capabilities of the GPU to render volumetric data stored as a 3-D texture on the graphics card. Empty voxels (defined as those that contain attenuation coefficients that are of no interest) in the CT volume are mapped to zero and not processed (it is assumed this is done during preprocessing) and their experimental results showed that this strategy accelerates the process by a factor of up to 2.9 times without any loss in the DRR quality. They demonstrate a DRR of size 512 × 512 pixels rendered from a CT volume of size 512 × 512 × 128 voxels with frame rate of 5.4 frames/s from the entire volume and with a frame rate of 22.5 frames/s (approximately 44 ms per DRR) from a sparse volume (accelerated method) using a 2.6 GHz Pentium PC with a NVidia GeForce 5900 containing 256 MB of video memory. The latest work by Spöerk et al. [22], [30] presents a GPU-based high speed rendering algorithm using a forward projection technique called wobbled splatting [31] for DRR generation. Results showed that DRRs could be rendered from a CT volume of size 256 × 196 × 196 voxels in about 13 ms using space skipping and subsampling to reduce the number of voxels. Prior to the launch of CUDA in 2007 most GPU accelerated DRR rendering adopted an object-based paradigm but since

2596


2006 the CUDA API popularized multithreaded ray casting. Object-based GPU implementations often outperform equivalent ray casting approaches because they make use of space skipping approximations and achieve speedup due to subsampling. Subsampling and preintegration are popular optimizations approximating the rendering integral which have been the subject of more formal mathematical analysis [32], [33]. Although experimental results have demonstrated that high quality surface rendering can be achieved with relatively low-resolution volume data, noise, and aliasing artifacts are a problem when composing DRR images. This observation motivates an investigation of approximations which may be applied in the context of ray casting. The next section describes a number of approximate algorithms implemented on the CUDA platform for generating DRRs on the GPU.

Fig. 2. 3-D illustration of the incremental grid algorithm which ensures voxels are only sampled once.

III. CUDA-BASED RENDERING FOR DIGITALLY RECONSTRUCTED RADIOGRAPHS In recent years several methods have become available for programmers to harness the power of the advancements in GPU technology. CUDA was introduced by NVidia in November 2006, as a general purpose parallel computing architecture. It is a scalable parallel programming model, where programs written using CUDA can be executed on any number of processors without recompilation. The core of CUDA consists of three abstractions: thread groups, shared memory, and barrier synchronization, which provide different levels of data and thread parallelism. These different levels of parallelism make it possible to decompose the programming problem into subproblems which can be solved in parallel using blocks of threads, then into finer pieces that can be solved cooperatively in parallel by all threads within the block which will preserve CUDA scalability. The main reason for choosing CUDA for the implementation of the algorithms presented in this paper is that CUDA supports general purpose programming [34], unlike Cg or GLSL which is mainly useful for graphics applications. An additional motivation for choosing CUDA is the ability to switch between executing code on the CPU and the GPU. Some modifications must be considered when implementing algorithms for the GPU. The next section details an accurate method of rendering DRRs using CUDA. Following from this is an approximate rendering algorithm which can be computed more efficiently without incurring a significant loss in quality ensuring the results can be used within a registration algorithm. A. Accurate Method of Rendering DRRs Using CUDA This section aims to establish a benchmark approach for rendering DRR images which we will subsequently use as a reference to compare subsampled approximations. The problem of traversing a uniform grid or volume (i.e., regular structure) has been addressed by Siddon [35] and by Amanatides and Woo [36]. Both determine which voxels are intersected by a ray moving through a given volume but the cost of grid traversal in the latter has been shown to be linear with respect to the number of voxels visited [37]. Fig. 2 illustrates the concept of the voxel traversal algorithm. A single ray, emanating from an X-

ray source, with a given direction is shown to cut the voxel grid. The algorithm uses some simple steps to compute the voxels the ray passes through, shown by bold lines in the same figure. The algorithm by Amanatides and Woo consists of two phases, initialization and incremental traversal. The initialization phase calculates the parameter values which will cause the ray to move into a new slice of the grid. The variables tM axX, tM axY , and tM axZ are used to store the parameter values for movements in the x-, y-, and z-axes, respectively. Δx, Δy, and Δz are used to store the parameter values required to move along the ray such that a distance of one cell wide, one cell high, and one cell deep is traversed. Then, the incremental traversal phase can determine all the voxels that intersect with the X-rays by incrementing the exact movement along each ray. The algorithm starts from the first box the ray cuts and looks at the smallest parameter value stored in tM axX, tM axY , and tM axZ to determine the next box which must be chosen. The variables are updated and the algorithm repeats until the next selected box no longer resides in the voxel grid. This algorithm is one of the fastest for ray casting, as it requires very few floating point operations and is the most accurate since it ensures each voxel can be sampled only once. A framework for DRR generation using CUDA is presented in pseudocode in Algorithm 1. CUDA supports transfer of CT data between the main memory (host memory) and the local hardware memory (device memory). This operation is performed once during the DRR rendering process (line 2 in Algorithm 1), which eliminates the time required to move the data for each ray calculation. In CUDA shared memory abstraction, threads can access data from multiple memory spaces during their execution. All threads have access to the same global memory, but each thread has its private local memory. Also, each block of a grid has its shared memory which is visible to all threads of the block. Memory resources are limited for each core. Therefore, the number of threads per block is


2597

Fig. 3. 2-D illustration of the artifact which occurs when X-rays in a cone beam intersect with a CT volume. The ellipse in the figure highlights where sample points are located within the same voxel.

limited as well (line 5 in Algorithm 1), since all threads of a block are expected to be located on the same processor core. In practice, for our beam geometry, we found that the test used by Amanatides and Woo to ensure voxels are sampled only once could be relaxed without incurring a significant loss in quality. The next section explains this simplification and introduces two sampling strategies which reduce the computational load while producing approximate DRR results that are of acceptable quality. B. Sampling Methods for DRR Generation A faster approach to ray-voxel traversal can be achieved by simply sampling the rays at regular intervals as they pass through the voxel grid. This is achieved by sampling along the X-ray path → (u + t v , 0 ≤ t ≤ 1), where u is the position for the start of the → ray and v is the direction of the ray. The sampling is initialized at the first point the ray cuts the CT volume tin and terminates when it exits the CT volume tout . By incrementing from tin by Δt all the intersection points (voxels) along the X-ray path can be obtained, assuming all the rays are aligned to the voxel grid. However, in our application, the X-rays are fired in a cone beam and this causes sampling errors. The problem lies in the fact that for some X-rays, which are not parallel to the direction of the voxels inside the CT volume, Δt may result in intersecting some voxels more than once, as illustrated in Fig. 3. Perspective rendering of DRRs with fixed sampling suffers from this artifact; however, as shown in Section IV, the artifact does not cause a significant loss in quality or registration accuracy. Our DRRs are implemented in two steps. The first step is a robust ray–box intersection, based on the algorithm by Williams et al. [38], which is used to determine which rays intersect the CT volume and determine the points of intersection of each parameterised ray (in terms of In/Out). For efficiency we compute tin and tout only for the ray used for the center pixel in the DRR. The same parameter values are then used for the other rays. The second step is a fixed interval sampling algorithm which is based on a point-based rendering algorithm for DRRs originally developed by Shen and Luo [39]. Our implementation is very similar to the conventional method of ray casting

Fig. 4.

Illustration of the different methods for DRR generation.

using the sampling method except that the sampling distances are set according to the required DRR resolution. The approach is presented as Algorithm 2. Considerable gains in speed can be made by reducing the sampling frequency and rendering DRRs using fewer samples. Section IV explores 2-D/3-D registration using RR DRRs and presents results in terms of speed and accuracy. For comparison, we implement a FR DRR (FR-DRR), computed with all n × m rays in the rendering field (where n and m are the parameters for the size of the detector plane) and using a fixed sampling interval along each ray, chosen to sample each voxel. The RR DRR is computed by reducing the sampling in one of two ways. The first method, denoted RR1-DRR, reduces the number of rays in the rendering field and the second method, denoted RR2-DRR, works by reducing the number of sampling points inside the CT volume (by moving in increments greater than the voxel’s size). Fig. 4 provides an illustration of the different DRR generation methods. When generating RR1-DRRs, only a subset of the rays required for FR-DRRs is cast and the missing pixel values are calculated based on a nearest neighbor interpolation scheme [40]. A balance can be made between the quality of the resulting DRR and the efficiency by controlling the number of rays which intersects the CT volume. For example, one might generate an RR1-DRR by using 25% of the rays that are used to generate the FR-DRRs. In this case each ray’s intensity value is used to draw 4 pixels in the RR1-DRR, as illustrated in Fig. 5(a). The resulting RR1-DRR can be seen in Fig. 5(c). The error incurred

2598


Fig. 5. Example for rendering an RR1-DRR using interpolated rays where (a) shows a sample of an FR-DRR, (b) shows a grid of pixels (the green and gray areas indicate calculated and interpolated pixels in the DRR). 25% of the rays needed to generate an FR-DRR are used for the nearest neighbor interpolation, (c) shows an RR1-DRR from the same direction as the FR-DRR, and (d) shows a (scaled) difference image formed by pixel by pixel subtraction between image (a) and (c) (white pixel = no difference).

Fig. 6. Example for RR2-DRR generation using reduced sampling where (a) shows a sample of a lung FR-DRR, (b) and (c) shows an RR2-DRR image generated by reducing the sampling points by 50% and 75%, respectively, (d) shows a (rescaled) difference image formed by pixel by pixel subtraction between image (a) and (b) (white pixel = no difference) while (e) shows a (rescaled) difference image formed by pixel by pixel subtraction between image (a) and image (c) (white pixel = no difference).

by this approximation can be quantified using the peak signal to noise ratio (PSNR) between the maximum power in the original image and the power in the noise artifacts present in its approximation (4). Another method of rendering the RR-DRR (RR2-DRR) involves reducing the number of intersection points which is achieved by changing the value of the variable Δt to double or triple the voxel sizes depending on the percentage of the required sampling of the DRRs. Resulting RR2-DRRs are presented in Fig. 6. The vertical artifacts in difference images Fig. 6(d) and (e) are relatively larger error terms due to samples in voxels within the metal couch, clustered due to the cone beam geometry. Our implementation geometry renders DRRs from an X-ray source located on the surface of a sphere located at the patient isocenter. The motion of the detector plane is coupled to the

Fig. 7. Illustration of the spherical motion (directions) of the ray source and detector plane around the CT Volume. The diagram incorporates possible locations for the ray source and the detector plane.

Fig. 8. Comparison of the speed of the GPU-based method versus the CPUbased method of rendering DRRs using different sizes of CT volumes. CPUbased DRRs were rendered using a Precision Workstation T5500 Dual Quad core, Intel 2.3 GHz. GPU-based DRRs were generated using an NVidia GeForce 8800 GTX.

motion of the ray source but offset by 180◦ as illustrated in Fig. 7. This geometry simulates movement of both the couch (usually adjustable in 6 degrees-of-freedom) and X-ray C arm. IV. PERFORMANCE EVALUATION OF GPU-BASED RENDERING OF DRRS A. Rendering Time (Speed) To compare CPU and GPU implementations the same algorithm for rendering DRRs is implemented in C++ for the CPU and in CUDA for the GPU. Results show that the speed is significantly reduced for the GPU-based method compared to that of the CPU, as illustrated in Fig. 8. This result is not too surprising given the parallel CUDA implementation of the DRR generation algorithm. However, it is interesting to see how close the times become to interactive rates, enabling DRR generation to be integrated within an interactive registration tool for CT datasets of modest size. Table I shows the timing results of rendering different size and resolution DRRs using a GPU (Nvidia GeForce 8800 GTX).


TABLE I RESULTS IN MILLISECONDS OF RUNNING THE GPU-BASED RENDERING ALGORITHM FOR RR DRRS USING THE LUNG AND PELVIS CT DATASETS. DRRS WERE RENDERED WITH AN ACCURATE METHOD AND REDUCTION IN THE SAMPLED POINTS BY 50% AND 75% OF THE FR-DRR USING AN NVIDIA GEFORCE 8800 GTX, WHERE S REFERS TO SAMPLING

2599

TABLE III SPEEDUP RATIO FOR THE GPU-BASED METHOD OVER THE CPU-BASED METHOD, AND THE PSNR RATIO BETWEEN DRR IMAGES RENDERED USING BOTH METHODS. CPU-BASED DRR IMAGES WERE RENDERED USING A PRECISION WORKSTATION T5500 DUAL QUAD CORE, INTEL 2.3 GHZ. GPU-BASED DRR IMAGES WERE RENDERED USING AN NVIDIA GEFORCE GTX 580

TABLE II RESULTS IN MILLISECONDS OF RUNNING THE GPU-BASED RENDERING ALGORITHM OF RR DRR IMAGES USING DIFFERENT CT VOLUMES (PELVIS AND LUNG). DRR IMAGES WERE RENDERED WITH REDUCTION IN THE SAMPLED POINTS BY 50% AND 75% OF THE FR-DRR IMAGE USING NVIDIA GEFORCE GTX 580, WHERE S REFERS TO SAMPLING

Fig. 9. Sample of DRRs rendered from pelvis CT volume data, (a) shows an accurate DRR rendered using Amanatides and Woo [36], (b) is a sampled FR-DRR, and (c) (rescaled) difference image (PSNR = 32.6 dB; white pixel = no difference).

B. Image Quality The GeForce 8800 GTX contains 128 streaming processor cores running at a frequency of 575 MHz and has a total memory of 768 MB. Table II shows the timing results of rendering different size and resolution DRRs using a more recent GPU (Nvidia GeForce GTX 580). The GeForce GTX 580 contains 512 streaming processor cores running at a frequency of 1544 MHz and has a total memory of 1536 MB. We are able to render a DRR from a 512 × 512 × 267 CT volume in approximately 184 ms. This is a worst case figure since for our geometry all of the rays intersect the CT volume which means we must find all the intersection points for each ray. The resulting times of rendering the DRRs which are presented in Tables I and II include the calculation times required for performing other internal operations in the DRR rendering pipeline like copying memory from the host to the device, calculating ray directions and detector plane transformations. When rendering small DRRs based on small CT volumes we observed that the computation time becomes clamped at 11 ms for the NVidia 8800 GTX and 1 ms for the NVidia GTX 580 because these other processes tend to dominate. We also note the results show that the GPU-based rendering required approximately 2 ms to render a FR DRR from a 256 × 256 × 133 CT volume, which gives a speed-up ratio of 98 times over that of an equivalent CPU-based approach (using OpenMP with eight cores). More results for the speed-up ratio between the GPU-based rendering and CPU-based rendering of DRRs are presented in Table III. The structure of the whole registration process is presented in the next section.

By visual examination of DRRs in Figs. 5 and 6, it is hard to see the difference between the FR-DRRs and RR-DRRs, but a quantitative comparison can be obtained by computing the PSNR IM AX PSNR = 20 log10 √ MSE

(4)

where IM AX is the largest pixel value that can be represented and √ MSE the RMS difference between an accurate DRR computed by [36] and its approximation. Fig. 9 illustrates an example of an accurate DRR compared to an FR-DRR. The images are created from CT volume data of a human pelvis. Results show that the DRRs rendered are very similar in quality as the PSNR between them is above 32 dB. Given the speed improvement of using the sampling approach over the accurate one our further tests focus on a comparison of the RR methods in comparison to the full resolution sampling method. A set of PSNRs were computed between the FR-DRR and each RR-DRR approach, shown in Table IV. CT datasets for a human lung and a human pelvis were used in the generation of a set of DRRs at different resolutions. Differences between FR-DRRs and RR-DRRs in Figs. 5 and 6 are visually imperceivable as their PSNR is well above that considered by experts in image compression as indicative of excellent image quality [41]. Furthermore, our experiments confirm these DRRs can be used successfully within a registration algorithm. The results in Table IV show that using 25% of the total number of rays (RR1-DRR) on voxel grids containing fewer than 256 × 256 × 172 voxels reduces the quality. In addition, if the number of sample points inside the CT volume is reduced

2600


TABLE IV SET OF PSNRS BETWEEN REDUCED RESOLUTION DRRS AND FR DRRS, WHERE P: PELVIS VOLUME AND L: LUNG VOLUME

to 25% of the FR-DRR for voxel grids with a resolution less than 256 × 256 × 133 voxels then approximations no longer adequately represent the FR-DRR. A PSNR analysis confirmed that there is no difference at all between the DRRs generated on the CPU when compared to those generated on the GPU as the PSNR ≡ ∞ (MSE ≡ 0). Although, Mori et al. [28] reported a difference in the quality of DRRs which were rendered using GPU-based and CPU-based methods in single and double precision, respectively. In our method, we did not face this problem as we are not performing any interpolation at a subvoxel level to render a DRR. The next section describes how our DRR generation methods may be used within a hybrid algorithm for 2-D/3-D registration involving components which execute on the CPU and GPU. V. HYBRID APPROACH TO 2-D/3-D REGISTRATION Bearing in mind the differences between CPU and GPU architectures, and the difference in the complexity of the applications, we found that the best way to achieve the optimum 2-D/ 3-D image registration performance was using a hybrid system (CPU/GPU-based system). This approach is able to achieve the maximum benefit from both GPU and CPU taking into consideration the registration process and its internal subprocesses (optimization, similarity measure, and DRR rendering). In this section, we describe the structure and performance of a hybrid 2-D/3-D image registration system in which processes are split between the CPU and GPU. A. Structure of the Registration Using a Hybrid Approach We proposed a structure for the registration process that reflects the complexity of the subprocesses and the capabilities of the CPU and GPU. To evaluate the complexity of internal processes, we give the registration process a mathematical description: f (n) = α × (n3 + n2 ) + ζ, where α indicates the number of required transformation operations (translation and rotation) required by the optimization process, n represents a dimension of the CT volume, and ζ is a constant representing the total number of support operations such as computing the direction of X-rays, transforming the detector plane, etc. From the previous equation, we can say that the registration process is required to render α DRRs of complexity O(n3 ) and similarity measure operations of complexity O(n2 ) so that the final complexity of

the registration process will be O(n3 ). The high complexity of the DRR rendering process is due to the huge number of rays that intersect the voxels inside the CT volume. In spite of the high complexity of this problem, an important characteristic can be exploited, namely that the ray intersections can be computed independently. Therefore, each ray can be processed individually using the GPU to benefit from the large number of processor cores available. On the other hand, the similarity measure process can be implemented as a CPU-based or GPU-based process. The decision depends mainly on the way the DRR rendering process is implemented. In the first case of implementing the similarity measure process as a CPU-based process, the DRRs should be copied back from the device memory (GPU memory) to the host memory (CPU memory), which could lead to a synchronization problem on the GPU device as it is busy during the rendering of a DRR and in the same time must return the DRR back to the host memory to be evaluated by the similarity measure process. According to the previous mathematical description of the registration process, we know that the similarity measure process is not a high complexity process and normally does not require a long time to be executed using the CPU, so for example it requires about 2 ms to measure the similarity between two images of size 512 × 267 pixels using a Dual Quad core, Intel 2.3-GHz processor. Whereas transforming the DRRs from the device memory to the host memory requires almost the same time. The first case of implementing the similarity measure process as a CPU-based process is illustrated in Fig. 10. In the second case of implementing the similarity measure as a GPU-based process, it does not require any data transfer between the device and the host memory as the DRRs are stored on the device memory, but time is required to perform the similarity measure on the GPU. In this case, there will not be a delay or synchronization problem but more device memory will be required to store the DRR. For a DRR containing 512 × 267 pixels where each pixel requires 4 bytes a total of 0.52 MB is needed, which by modern GPU standards does not present a significant issue. The second case of implementing the similarity measure process as a GPU-based process is illustrated in Fig. 11. Storing the CT Volume on the GPU presents a limiting factor. The required memory space on the GPU will vary according to the size of the CT volume and the rendered DRR. The total required size in bytes is 4 × sizeof (float) × CT-dimX × CTdimY × CT-dimZ + 4 × sizeof(float) × DetectorPlane-dimX × DetectorPlane-dimY + 4 × sizeof(float) × DRR-dimX × DRR-dimY. For example the required device memory to render a DRR of size 512 × 267 pixels from a CT volume of size 512 × 512 × 267 voxels will be : 4 × 512 × 512 × 267 + 4 × 512 × 267 + 4 × 512 × 267 ≡ 281063424 byte ≡ 268 MB. B. Performance of the Registration Using a Hybrid Approach In this section, we will measure the performance of the hybrid system of the registration process by measuring the speed and accuracy as the main components of registration performance.


Fig. 10. Single iteration of the registration process, where the DRR rendering is implemented as a GPU-based rendering process and the similarity measure process is implemented as a CPU-based process.

1) Speed: The speed of the registration process depends mainly on the speed of rendering the DRRs (we already illustrated in Section V-A). 2) Accuracy: The accuracy of the registration process is dependent on the quality and resolution of the DRRs, as there is no change in the other internal processes (similarity measure and optimisation). Results show there is no difference at all between DRRs rendered using the CPU and DRRs rendered using the GPU as the PSNRs were ∞ all the time (see Table III). Although, the accuracy of the similarity measure process (NCC) will not be affected if it is implemented as a CPU-based process or GPU-based process because it is a mathematical operation and it is supported by both methods. In this case, we conclude that the accuracy of the hybrid system of 2-D/3-D image registration will not be affected as the system runs between the CPU and the GPU. Referring to the different methods of rendering reduced resolution DRRs that we have introduced in Section III-B. We compared the results of performing the 2-D/3-D registration process using FR-DRRs and RR-DRRs. We used the RR-DRR with different ratios of compression using the sampling and reduced rays methods. Fig. 12 shows the results of performing the 2-D/3-D registration process between a ground truth reference captured by an electronic portal imaging device resulting from a kilovoltage X-ray source and floating DRRs in the range

2601

Fig. 11. Single iteration of the registration process, where the DRR rendering implemented as a GPU-based rendering process and the similarity measure process is implemented as a GPU-based process.

Fig. 12. Results of performing the 2-D/3-D registration process between kilovolt reference image and the DRRs in range of 0◦ → +20 ◦ , where (a) results using FR-DRRs, (b) results using RR2-DRRs using the sampling method with 50%, (c) results using RR2-DRRs using the sampling method with 75%), and (d) results using RR1-DRRs using the method of reducing the number of rays with 75% compression.

2602


terpolation would draw performance benefits from device cache. This will form the basis of further work. Finally, although the results from this paper demonstrate that interactive 2-D/3-D registration using sparsely rendered DRRs can be achieved with an accuracy which is clinically acceptable for patient setup prior to delivery of radiation treatment and IGRT of certain cancer sites, the off-the-shelf GPUs used do not have FDA approval. A more suitable platform for clinical evaluation would be the NVidia TESLA GPU. ACKNOWLEDGMENT

Fig. 13. Accuracy for the performance of the 2-D/3-D registration using different methods of rendering the DRRs (full and reduced resolution images).

0◦ → +20◦ . The DRRs are rendered in FR mode, reduced resolution using the sampling method (with 50% and 75%), and reduced resolution using the method of reducing the number of rays (with 75% compression). In Fig. 13, we have combined the four curves from Fig. 12 to compare the 2-D/3-D registration accuracy resulting from different rendering methods. The results show that the performance is almost the same as when using the FR-DRRs and RR-DRRs with 50% compression using the sampling method. However, the performance using RR-DRR with 75% compression did not provide an accurate result using either method. These results are unsurprising given the PSNR values (< 36 dB) for the pelvis RR-DRR presented in Table IV. At large angles (i.e., > 17◦ ), the registration error becomes severe. We believe this is due to estimates we are forced to make concerning the beam geometry. VI. CONCLUSION AND SUGGESTIONS FOR FURTHER WORK In this paper, we accelerate the speed of the image registration process by enhancing the speed of rendering DRRs using the GPU and we proposed a hybrid CPU/GPU-based registration algorithm that splits the registration process between the CPU and GPU to get the maximum performance in terms of speed and accuracy. Using CUDA together with an approximate sampling strategy for rendering DRR images improves the speed without significantly compromising the DRR image quality and registration accuracy. We are able to render a DRR from 256 × 256 × 133 CT volume in about 24 ms using a NVidia GeForce 8800 GTX and in about 2 ms using an NVidia GeForce GTX 580. These figures compare favorably with results recently published using forward projective (shader based) implementations on the GPU. Recently Mensmann et al. [17] addressed volume visualization by ray casting on the GPU and conclude that only small gains in performance are possible unless factors such as the shared memory model are factored into the design. Our algorithm does not explicitly manage a shared device cache but our experience is that in the context of ray cast DRRs (no reflections, nearest neighbour interpolation) the benefits of such a strategy are more limited as samples are only accessed once. Improvements in image quality by adopting more advanced in-

The authors would like to acknowledge their collaboration with the Colney Oncology Centre, Norfolk and Norwich University Hospital, U.K., and would also like to thank them for providing the CT data. REFERENCES [1] N. Milickovic, D. Baltas, S. Giannouli, M. Lahanas, and N. Zamboglou, “CT imaging based digitally reconstructed radiographs and their application in brachytherapy,” Phys. Med. Biol., vol. 45, pp. 2787–2800, 2000. [2] A. Fielding, P. Evans, and C. Clark, “The use of electronic portal imaging to verify patient position during intensity-modulated radiotherapy delivered by the dynamic MLC technique,” Int. J. Radiation Oncol. Biol. Phys., vol. 54, no. 4, pp. 1225–1234, 2002. [3] O. Dorgham and M. Fisher, “Performance of 2D/3D medical image registration using compressed volumetric data,” in Proc. MIUA, Jul. 2008, pp. 261–265. [4] O. Dorgham, M. Fisher, and S. Laycock, “(Oct. 2009) Performance of a 2D-3D image registration system using (lossy) compressed Xray CT,” Ann. BMVA, vol. 2009, no. 3, pp. 1–11, [Online]. Available: http://www.bmva.org/annals/2009/2009-0003.pdf [5] R. S. Brock, A. Docef, and M. J. Murphy, “(2010) Reconstruction of a cone-beam CT image via forward iterative projection matching,” Medical Physics, vol. 37, no. 12, pp. 6212–6220. [6] D. Godfrey, F. Yin, M. Oldham, and C. Willett, “Digital tomosynthesis with an on-board kilovoltage imaging device,” Int. J. Radiat. Oncol. Biol. Phys., vol. 65, pp. 8–15, 2006. [7] J. Wilkinson, “Geometric uncertainties in radiotherapy,” Brit. J. Radiol., vol. 77, pp. 86–87, 2004. [8] A. C. Kak and M. Slaney, Principles of Computerized Tomographic Imaging. Piscataway, NJ: IEEE Press, 1988. [9] J. Britten and W. Guan, Visualisation of reciprocal space, http://www. rhpcs.mcmaster.ca//guanw/gallery.html, Jul. 2010 [10] W. W. Ali Khamene, Peter Bloch et al., “Automatic registration of portal images and volumetric CT for patient positioning in radiation therapy,” Med. Image Analysis, vol. 10, pp. 96–112, 2006. [11] P. Lacroute and M. Levoy, “Fast volume rendering using a shear-warp factorization of the viewing transformation,” in Proc. SIGGRAPH ’94, Orlando, FL, Jul. 1994, pp. 451–458. [12] L. Westover, “Footprint evaluation for volume rendering,” in Proc. SIGGRAPH’90, 1990, pp. 367–76. [13] D. B. Russakoff, T. Rohlfing, K. M. et al., “Fast generation of digitally reconstructed radiographs using attenuation fields with application to 2D3D image registration,” IEEE Trans. Med. Imaging, vol. 24, no. 11, pp. 1441–1454, Nov. 2005. [14] T. Malzbender, “Fourier volume rendering,” ACM Trans. Graphics, vol. 12, no. 3, pp. 233–250, 1993. [15] J. Kriiger and R. Westermann, “Accelerated techniques for GPU-based volume rendering,” in Proc. IEEE Conf. Vis., Seattle, WA, 2003, pp. 287– 92. [16] S. Lefebvre, S. Hornus, and F. Neyret, “(Apr. 2005) Texture sprites: Texture elements splatted on surfaces,” in ACM-SIGGRAPH Symposium on Interactive 3D Graphics (I3D), ACM SIGGRAPH. ACM Press, [Online]. Available: http://www-evasion.imag.fr/Publications/2005/LHN05 [17] J. Mensmann, T. Ropinski, and K. Hinrichs, “An advanced volume raycasting technique using GPU stream processing,” in Proc. Int. Conf. Comput. Graphics Theory Appl., 2010, pp. 190–198. [18] CUDA Programming Guide, NVIDIA, Santa Clara, CA, Feb. 2010.


[19] D. Laney, S. P. Callahan, N. Max, C. T. Silva, S. Langer, and R. Frank, “Hardware-accelerated simulated radiography,” in Proc. IEEE Vis., Nat. Lab. Lawrence Livermore, Berkeley, CA, Oct. 2005, pp. 343–350. [20] C. Metz, “Digitally reconstructed radiographs,” Master’s thesis, Utrecht University, Utrecht, the Netherlands, 2005, 2012. [21] H. Yan, L. Ren, D. J. Godfrey, and F. F. Yin, “Accelerating reconstruction of reference digital tomosynthesis using graphics hardware,” Med. Phys., vol. 34, no. 10, pp. 3768–3776, 2007. [22] J. Spöerk, H. Bergmann, W. Birkfellner, F. Wanschitz, and S. Dong, “Fast DRR splat rendering using common consumer graphics hardware,” Med. Phy., vol. 34, pp. 4302–4309, 2007. [23] D. Ruijters, B. M. H. Romeny, and P. Suetens, “GPU-Accelerated digitally reconstructed radiographs,” in Proc. BioMed, Philips Med. Syst. Best, The Netherlands, Feb. 2008, pp. 13–15. [24] D. A. LaRose, “Iterative X-ray/CT registration using accelerated volume rendering,” Ph.D. dissertation, Carnegie Mellon Univ., 2001. [25] M. Grabner, T. Pock, T. Gross, and B. Kainz, “Automatic differentiation for GPU-accelerated 2D/3D registration,” in Advances in Automatic Differentiation, C. H. Bischof, H. M. Bücker, P. D. Hovland, U. Naumann, and J. Utke, Eds. New York: Springer, 2008, pp. 259–269. [26] F. Ino, J. Gomita, Y. Kawasaki, and K. Hagihara, “A GPGPU approach for accelerating 2-D/3-D rigid registration of medical images,” in Proc. Parallel Distrib. Process. Appl., 2006, pp. 939–950. [27] Y. Lu, W. Wang, S. Chen, Y. Xie, J. Qin, W.-M. Pang, and P.-A. Heng, “Accelerating algebraic reconstruction using CUDA-enabled GPU,” in Proc. Int. Conf. Comput. Graph., Imag. Vis., 2009, vol. 0, pp. 480–485. [28] S. Mori, M. Kobayashi, M. Kumagai, and S. Minohara, “Development of a GPU-based multithreaded software application to calculate digitally reconstructed radiographs for radiotherapy,” Radiological Phys. Technol., vol. 2, no. 1, pp. 40–45, 2009. [29] C. Bethune and A. J. Stewart, “Accelerated computation of digitally reconstructed radiographs,” in Proc. Int. Congr. Series, School Comput., Queen’s Univ. Kingston, Canada, 2005, vol. 1281, pp. 98–103. [30] J. Spöerk, “High-performance GPU based rendering for real-time, rigid 2D/3D-image registration in radiation oncology” Ph.D. dissertation, Vienna Univ. Technol., 2010. [31] W. Birkfellner, R. Seemann, M. Figl, J. Hummel, C. Ede, P. Homolka, X. Yang, P. Niederer, and H. Bergmann, “Wobbled splatting a fast perspective volume rendering method for simulation of X-ray images from CT,” Phys. Med. Biol., vol. 50, no. 9, pp. N73–N84, 2005. [32] M. Kraus and T. Ertl, “Pre-integrated volume rendering,” in The Visualization Handbook, C. Hansen and C. Johnson, Eds. New York: Elsevier, 2005, ch. 10, pp. 211–228. [33] S. Bergner, T. Möller, D. Weiskopf, and D. Muraki, “A spectral analysis of function composition and its implications for sampling in direct volume visualization,” IEEE Trans. Vis. Comput. Graph., vol. 12, no. 5, pp. 1353– 1360, Sep./Oct. 2006. [34] D. B. Kirk and W. mei W. Hwu, Programming Massively Parallel Processors: A Hands on Approach. New York: Elsevier, 2010. [35] R. L. Siddon, “Fast calculation of the exact radiological path for a threedimensional CT array,” Med. Phys., vol. 12, no. 2, pp. 252–255, 1985. [36] J. Amanatides and A. Woo, “A fast voxel traversal algorithm for ray tracing,” in Proc. Eurographics, 1987, pp. 3–10. [37] H. Nguyen, GPU Gems 3. Reading, MA: Addison-Wesley, 2007. [38] A. Williams, S. Barrus, R. K. Morley, and P. Shirley, “An efficient and robust ray-box intersection algorithm,” J. Graph., Gpu, Game Tools, vol. 10, no. 1, pp. 49–54, 2005.

2603

[39] A. Shen and L. Luo, “Point-based digitally reconstructed radiograph,” in Proc. Int. Conf. Pattern Recog., 2008, pp. 1–4. [40] T. M. Lehmann and C. Gönner, K. Spitzer, “Survey: Interpolation methods in medical image processing,” IEEE Trans. Med. Imag., vol. 18, no. 11, pp. 1049–1075, Nov. 1999. [41] L.-G. C. Sheng-Chieh Huang and H.-C. Chang, “A novel image compression algorithm by using log-exp transform,” in Proc. IEEE Int. Symp. Circuits Syst., vol. 4, 1999, pp. 17–20.

Osama M. Dorgham received the B.Sc. degree in computer science from Princess Sumaya University for Technology, Amman, Jordan, the M.Sc. degree in computer science from Al Balqa Applied University, Al Salt, Jordan, and the Ph.D. degree from the University of East Anglia, Norwich, U.K. He is currently an Assistant Professor in the Department of Computer Information Systems, Al Balqa Applied University. His research interests include medical imaging and computer graphics.

Stephen D. Laycock received the B.Sc. degree in computing science in 2001 and the Ph.D. degree in the field of haptic rendering for virtual environments in 2005 both from the University of East Anglia (UEA), Norwich, U.K. Since 2005, he has been a Lecturer in computer graphics at the UEA. His research interests include haptic rendering, collision detection, molecular graphics, and applications of graphics to cultural heritage.

Mark H. Fisher received the B.Sc. degree in electrical and electronic engineering from Aston University, Birmingham, U.K., and the M.Sc. degree in microprocessor engineering and digital electronics and the Ph.D. degree both from the Department of Computation, Institute of Science and Technology, University of Manchester, Manchester, U.K. He is currently a Senior Lecturer in the School of Computing Sciences, University of East Anglia, Norwich, U.K. His research interests include biomedical imaging.

GPU Accelerated Generation of Digitally Reconstructed ... - IEEE Xplore

GPU Accelerated Generation of Digitally Reconstructed ... - IEEE Xplore

Suggest Documents

GPU Acceleration for Digitally Reconstructed ... - IEEE Xplore

gpu-accelerated digitally reconstructed radiographs - Technische ...

Fast Generation of Digitally Reconstructed Radiographs Using ...

GPU-Accelerated Visualization of Scattered Point Data - IEEE Xplore

GPU-Accelerated Pipeline For Next Generation ...

Comparison of digitally reconstructed radiographs generated from ...

Automatic quality control of digitally reconstructed radiograph ...

Temperature-Compensated dB-linear Digitally ... - IEEE Xplore

MULTI-GPU ACCELERATED SIMULATIONS OF

Analysis of Accelerated Gossip Algorithms - IEEE Xplore

The Accelerated Operation Approach Substations ... - IEEE Xplore

Challenges in accelerated life testing - IEEE Xplore

Design Considerations of Digitally Controlled LCL ... - IEEE Xplore

GPU accelerated biochemical network simulation

GPU-accelerated Gibbs sampling - arXiv

GPU-accelerated Gibbs sampling - arXiv

PowerVR GPU Accelerated Augmented Reality

Modeling of Quantization Effects in Digitally Controlled ... - IEEE Xplore

GPU-accelerated Path Rendering - Nvidia

Scalability of Quasi-Hysteretic FSM-Based Digitally ... - IEEE Xplore

Use of a digitally reconstructed radiograph-based computer simulation ...

GPU Accelerated Computation of Hexagonal Cellular Automata

GPU Accelerated Computation of Hexagonal ... - Stephane Gobron

GPU-Accelerated Rendering of Unbounded Nonlinear Iterated ...