GPU programming for biomedical imaging

0 downloads 0 Views 708KB Size Report
Keywords: GPU, CUDA, parallel computing, medical imaging. 1. ..... scheduled by the hardware, without the intervention of the operating system ...... list-mode data,” Journal of the Optical Society of America A 29, 1003–1016 (June 2012).
Invited Paper

GPU programming for biomedical imaging Luca Cauccia and Lars R. Furenlida,b a Center

for Gamma-Ray Imaging, The Univ. of Arizona, Tucson, AZ, USA; of Optical Sciences, The Univ. of Arizona, Tucson, AZ, USA

b College

ABSTRACT Scientific computing is rapidly advancing due to the introduction of powerful new computing hardware, such as graphics processing units (GPUs). Affordable thanks to mass production, GPU processors enable the transition to efficient parallel computing by bringing the performance of a supercomputer to a workstation. We elaborate on some of the capabilities and benefits that GPU technology offers to the field of biomedical imaging. As practical examples, we consider a GPU algorithm for the estimation of position of interaction from photomultiplier (PMT) tube data, as well as a GPU implementation of the MLEM algorithm for iterative image reconstruction. Keywords: GPU, CUDA, parallel computing, medical imaging.

1. INTRODUCTION Up until a few years ago, clock scaling was the dominant way to increase the performance of a computer. Electronic components were designed to operate at higher and higher frequency so that the frequency of the clock that drove them could be increased. Assuming that the execution of an instruction took a constant number of clock cycles, as the frequency increased, the time required to carry out an instruction shortened. However, about a decade ago, clock scaling started to show its limitations.1 Circuit technology poses a limit on the speed at which a logic gate can switch from one state to the other. Furthermore, as the frequency (clock rate) increases, power consumption increases at the same rate.2 Cooling becomes a huge problem. Another problem is that the speed at which a signal propagates inside a processing unit is finite. For a clock frequency of, say, 3 GHz, the distance a signal can travel during a clock cycle is no more than a few centimeters. Building small circuits is not always possible. In any case, a small circuit is harder to cool than a larger one consuming the same amount of power. A possible way to increase performance and solve the problems mentioned above is to abandon serial computation (in which only one execution flow exists at any time and the instructions are executed sequentially) and consider hardware and software components capable of carrying out parallel computing. More specifically, we refer to parallel computing as the ability to carry out many calculations simultaneously by using multiple processing elements working independently.3–5 Parallel computing solves the problems with frequency scaling. In fact, rather than designing a single, powerful processing element that carries out operations fast, parallel computing allows us to break down the problem into subproblems that are solved concurrently by many slower processing elements. Collectively, the computational power of these slower processing elements surpasses the power of the single and faster processing element. One way to achieve parallel computation is by means of multi-core computers. A multi-core computer3, 6 is equipped with a single CPU that supports two or more execution threads. The cores in multi-core CPU execute instructions independently of each other. All the cores are implemented in the same physical package and so they might share some on-die resources, such as cache memory. The number of cores is usually rather small. Dual- or quad-core architectures are currently most common. The majority of the CPUs produced today are multi-core. Further author information: (Send correspondence to L.C.) L.C.: E-mail: [email protected], Telephone: 1 520 626 4162 L.R.F.: E-mail: [email protected], Telephone: 1 520 626 4256

Medical Applications of Radiation Detectors V, edited by H. Bradford Barber, Lars R. Furenlid, Hans N. Roehrig, Proc. of SPIE Vol. 9594, 95940G · © 2015 SPIE · CCC code: 0277-786X/15/$18 · doi: 10.1117/12.2195217

Proc. of SPIE Vol. 9594 95940G-1 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 11/10/2016 Terms of Use: http://spiedigitallibrary.org/ss/termsofuse.aspx

More recently, graphics processing units (GPUs) have become a viable alternative to multi-core CPUs. No longer confined to scene rendering for gaming, today’s GPUs have become sophisticated enough that generalpurpose parallel algorithms can be coded and run on a GPU device. GPU technology has found applications in many scientific fields, ranging from signal processing to medical imaging, and from life sciences to fluid dynamics. Thanks to mass production and constant hardware improvements, GPU technology offers state-ofthe-art computational power at affordable prices. Because of their intended use in highly parallel data-intensive applications, one of the key design strategies of GPU devices is to optimize data manipulation and processing versus flow control. Indeed, this is the case of most gaming and entertainment applications in which the same operation is performed on many data elements, without the need for complicated flow control. In a GPU, many transistors are employed to optimize and speed up floating-point operations and data manipulation, while support for advanced flow control (such as branch prediction and instruction pipelining) is generally lacking or very limited. A variety of software development kits (SDKs) for GPU programming are available, one of them being the CUDA computing platform and programming model introduced in 2007 by NVIDIA. Thanks to its simple, yet powerful model, CUDA has become extremely popular and successful in the scientific community. CUDA is a minimal extension to the C/C++ programming language that allows the development of parallel algorithms that can run on modern GPUs. We present two illustrative examples of GPU programming related to the field of medical imaging. In the first example, we discuss a real-time maximum-likelihood algorithm for the estimation of gamma-ray photon parameters from photomultiplier tube (PMT) or silicon-photomultiplier (SiPM) data. In the second example, we examine a possible GPU implementation of the maximum-likelihood expectationmaximization (MLEM) algorithm. These examples have the dual intent of showing how highly parallelizable algorithms can be implemented in CUDA, and illustrating the unique capabilities (such as shared and texture memory spaces) of GPU devices. We conclude this paper by arguing that GPU technology has enabled the development of new detector concepts—such as photon-processing detectors—capable of performing maximumlikelihood estimation of parameters (such as position, energy, time of arrival, direction of propagation, etc.) on a photon-by-photon basis. In Sec. 2 we provide some details of the CUDA programming model, which are then built upon in Sec. 3 to discuss kernel launch and thread execution. The example applications of GPU programming in biomedical imaging are discussed in Sec. 4. In Sec. 5, we introduce some of the GPU technological innovations that will become available in the near future, and we discuss what kind of applications these new technologies will allow. Finally, in Sec. 6 we conclude this paper and summarize the benefits of GPU computing for biomedical imaging applications.

2. THE CUDA PROGRAMMING MODEL In the CUDA language, the GPU unit is usually referred to as the device. The device acts as a coprocessor to the rest of the computer (which is usually referred to as the host), so that compute-intensive pieces of code and the data they operate on are offloaded to the GPU device for faster execution. Host memory is the conventional system memory, while device memory is physically installed on the GPU card along with the GPU cores and control logic (see Figure 1). The GPU can directly access only the memory installed on the GPU device. Specialized library functions are provided to copy blocks of data from the host memory to the device memory, from the device memory to the host memory, or even from the device memory to the device memory.7 In order to access the parallel capabilities of a CUDA-enabled device, the programmer writes kernels. A kernel in execution is referred to as a thread. Thus, a kernel is a piece of code, while a thread is an abstract entity that represents a piece of code that is executing. Thus, a single kernel might give rise to many threads, each of them working with different inputs and, potentially, starting to execute at different points in time. In CUDA, threads are grouped into 1D, 2D, or 3D blocks and blocks are grouped into 1D, 2D, or 3D grids. When calling a kernel, a particular calling syntax is used to specify the size of blocks and grids. The maximum number of threads per block is a relatively small number. For current GPU hardware implementations,7 this number does not exceed 1024, while the number of blocks in a grid can be much higher. For example, current hardware supports grids that can contain millions of blocks. Figure 2 illustrates these concepts and presents an example of grouping threads into blocks and blocks into a grid.

Proc. of SPIE Vol. 9594 95940G-2 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 11/10/2016 Terms of Use: http://spiedigitallibrary.org/ss/termsofuse.aspx

Device Device memory GPU logic

GPU cores

Disk I/O

Host memory

CPU

Figure 1. Diagram of a computer equipped with a GPU device (adapted from Ref. 8)

Thread

Block

Grid

Figure 2. Thread and block hierarchies in CUDA (adapted from Ref. 8)

Proc. of SPIE Vol. 9594 95940G-3 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 11/10/2016 Terms of Use: http://spiedigitallibrary.org/ss/termsofuse.aspx

Each GPU device is characterized by a compute capability, which is a number of the form X.Y , in which the number X is called the major version number and the number Y is called the minor version number. The major version number refers to the core architecture, while the minor version number corresponds to an incremental improvement to the core architecture. Overall, the compute capability identifies the set of features supported by the GPU hardware and is used by applications at runtime to determine which hardware features and/or instructions are available on the present GPU.7 As an example, floating point operations in double precision are supported by GPU devices with compute capability 1.3 or higher. GPU devices currently available typically have compute capability 5.0 or higher and support a wide range of features, one of them—as we will discuss later—being dynamic programming. A GPU device is equipped with a few different memory spaces. Device memory can be local, shared, or global. Local memory has a scope of one thread. This means that only the thread to which a portion of this memory is associated can access it. Local memory is used to store the content of variables declared inside a kernel function. This type of memory is automatically handled by the CUDA compiler and, when possible, mapped to GPU registers to increase performance. Shared memory is a particular type of on-chip memory that can be shared among all the threads in a thread block. This type of memory is very fast but only 16 kB or 48 kB (depending on the compute capability) of shared memory are available for each block. Shared memory is used to share variables and data with other threads in the same block. Because all the threads in the block have read and write access to shared memory, care must be paid to avoid the case in which a thread reads the content of a shared memory variable before another thread has finished writing to the same variables. In computing, this type of situation is called a race condition and might lead to the undesirable situation in which the final result depends on the actual order in which threads are scheduled for execution. The CUDA language provides synchronization mechanisms to avoid race conditions. Global memory is visible to all threads and host, and its lifetime extends to the whole application. Global memory is off-chip and it is not cached. Because of this, accessing global memory usually requires hundreds of clock cycles. Global memory is typically used to share input and output data between the device and the host. This is accomplished by means of special function calls that the host uses to copy data from host memory to global memory and vice versa. As in the case of shared memory, the user must avoid race conditions when accessing global memory. The same synchronization barrier mechanism mentioned for shared memory is used to prevent race conditions when accessing global memory. While all the types of memory spaces described above are read/write memories, a GPU device is equipped with read-only memory spaces as well. These memory spaces are referred to as constant and texture memories. To make use of these memory spaces, the host code has to set their content and then they will be available to threads during the execution of kernel code. In contrast with the memory spaces discussed before, the constant and texture memory spaces are cached, resulting in higher performance if the same datum is accessed multiple times during the execution of kernel code. Furthermore, because these memory spaces can only be read, no race conditions between threads can arise. The way in which texture memory is accessed is somewhat peculiar. As its name suggest, the texture memory has been designed to facilitate rendering of complex surfaces and other tasks that are common in graphics applications and games. This is the reason why texture memory supports features such as linear interpolation and referencing via normalized floating-point coordinates. Further details about constant and texture memory spaces are available in the CUDA documentation.7

3. KERNEL LAUNCH AND THREAD EXECUTION The CUDA programming language provides the __global__ declaration to specify that a function being defined is a kernel. Within the kernel, built-in variables are available to the programmer to determine the grid and block sizes, the block index within the grid, and the thread index within the block. To allow grids and blocks of dimensionality up to 3, the type of these built-in variables is a C struct with three integer fields, denoted, respectively, .x, .y, and .z. The hardware initializes these variables automatically and blocks and threads are indexed in unit steps starting from 0. When calling a kernel, a particular syntax is used to specify the grid and block sizes and to start parallel thread execution on the GPU device.

Proc. of SPIE Vol. 9594 95940G-4 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 11/10/2016 Terms of Use: http://spiedigitallibrary.org/ss/termsofuse.aspx

Figure 3 summarizes the basic steps that the programmer needs to carry out in a CUDA application. First of all, the CPU instructs the GPU device to allocate a block of device memory for the input data. This is accomplished via the library function cudaMalloc(...). Data are copied to the device memory via the library function cudaMemcpy(...). Thread execution on the GPU device is started by means of a kernel call. The syntax of this call is of the form my_kernel(...), in which my_kernel is the name of the kernel, N is the grid size, and M is the block size. Parameters to be passed to the kernel are enclosed in the parentheses. As Figure 3 suggests, threads start executing asynchronously with respect to the CPU. This implies that a kernel call such as my_kernel(...) typically returns immediately, and the CPU and GPU can work in parallel. When data need to be copied from the device memory to the host memory, execution on the CPU is suspended until all of the threads on the GPU have terminated. Once the CPU has copied the results from the device memory to the host memory, the portion of device memory previously allocated is no longer needed and can be released. cudaMalloc(...) cudaMemcpy(...)

cudaMemcpy(...) cudaFree(...) my_kernel(...)

CPU GPU

Figure 3. Workflow of a CUDA application

As an example, we consider a kernel that takes as input two arrays of K single-precision floating-point numbers and returns a third array that is the component-by-component sum of the two input arrays. An implementation of this kernel is: __global__ void add_kernel(float *a_dev, float *b_dev, float *c_dev, int K) { int k = blockDim.x * blockIdx.x + threadIdx.x; if(k < K) { c_dev[k] = a_dev[k] + b_dev[k]; } return; } The arrays are passed to the kernel as pointers to device memory (previously allocated). Within the kernel, built-in variables are used to calculate a number, k, which denotes the position of the elements within the input arrays. In our implementation we have decided to use 1D grids and 1D blocks, so only the .x fields of built-in variables blockDim, blockIdx, and threadIdx are of interest. Furthermore, because the values of blockIdx.x and threadIdx.x always start from 0 and are incremented by one for the next block or thread, this way of calculating the index k guarantees that no two threads will use the same value of k. Furthermore, if the grid is large enough, all the elements of the output array will be calculated. Thread blocks are automatically scheduled by the hardware, without the intervention of the operating system or the programmer. In a GPU device, threads have little context, and so their scheduling is extremely efficient: a thread ready for execution can be selected and scheduled in at a cost of just a few clock cycles. Because of the

Proc. of SPIE Vol. 9594 95940G-5 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 11/10/2016 Terms of Use: http://spiedigitallibrary.org/ss/termsofuse.aspx

large number of cores, hundreds of threads can be executing concurrently. This is a very different architecture design than conventional multi-core CPUs, which usually have no more than eight or sixteen cores and do not typically require many threads to keep the hardware busy. In a GPU device, the order in which threads are scheduled for execution is undefined. Therefore, the programmer cannot rely on any particular order. There are cases, however, in which it is necessary to know if a thread has reached a particular point in the execution. This can be accomplished via synchronization barriers. In CUDA, when a threads encounters a synchronization barrier, its execution is suspended until all the threads in the same block have reached a synchronization barrier. Synchronization is possible only among the threads in the same block. No synchronization mechanism is provided for threads that belong to different blocks. A GPU device performs thread scheduling in hardware, with minimal overhead. Furthermore, the hardware is able to detect when a thread is waiting for data to be read from memory. This makes it possible to temporarily suspend the execution of threads that are waiting for data and select for execution threads that already have data available. By scheduling out threads that are waiting for data, it is possible to hide memory-access latencies with actual computation.7 This makes GPU programming very efficient, as data transfers and thread execution are performed concurrently by two separate pieces of circuitry on the GPU device. The programmer does not need to worry about data transfers and thread scheduling, and this improves code readability and reduces the chances of a programming mistake. The GPU model of parallel thread execution leads to a high degree of scalability.4 With the term scalability, we refer to the ability of a system to be enlarged to accommodate growing amounts of computation or advances in technology. In a GPU device, only the threads in the same block can cooperate with each other by sharing the same portion of shared memory. By design, the maximum number of threads in a thread block is a relatively small number—such as 1024—which is not expected to grow as technology improves. On the other hand, as technology evolves, a larger and larger number of blocks can be run in parallel. Because the programmer cannot make any assumptions on block scheduling and he/she can only use barrier synchronization among threads that belong to the same block, the number of blocks that can be executed in parallel can be increased without the need to redesign old code to avoid performance penalties due to underutilization.

3.1 Dynamic Parallelism in CUDA GPU devices of capability 3.5 and above support so-called dynamic parallelism. Dynamic parallelism refers to the capability of the GPU to generate new work for itself, synchronize the results, and control the scheduling of that work via dedicated hardware, all without involving the CPU.7 Besides improving performance, dynamic parallelism expands the type of algorithms that can be implemented on a GPU device and makes it easier for developers to optimize recursive and/or data-dependent execution patterns by allowing the GPU hardware to orchestrate kernel launches, without the need for host-CPU interaction. In simpler terms, dynamic parallelism enables parent threads to call kernels, thus creating child blocks and threads in a nested way. The parent threads can then utilize the results calculated by the child blocks and threads without CPU involvement. Thus, dynamic parallelism enables dynamic load balancing by generating child blocks and threads according to data-driven decisions or workloads. To better illustrate the concept of dynamic parallelism, we consider a simple example, depicted in Figure 4. In this example, we used the symbol “≪...≫” to denote that a kernel is being launched. The first kernel launch is initiated in the CPU code, in which a grid with three blocks—B1 , B2 , and B3 —is created. As required by the CUDA programming model, each block in this grid is of the same size. In our example, available GPU resources (i.e., GPU processing cores and total amount of shared memory) are very limited, so that only the threads in blocks B1 and B2 can start executing immediately. Dynamic parallelism makes it possible for the last thread in block B2 to generate new blocks, populate them with threads, and start their execution. This is accomplished with the same CUDA formalism we depicted in Figure 3. This new grid—denoted as “Child grid 1” in Figure 4—and the blocks it contains, can be of any valid size, and the kernel being executed by the child threads can be any kernel. This includes the possibility for a kernel to call itself recursively. When all the threads in block B1 have completed execution, enough GPU resources become available and the threads in block B3 can be started. As before, one of the threads in one of these blocks can perform a kernel

Proc. of SPIE Vol. 9594 95940G-6 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 11/10/2016 Terms of Use: http://spiedigitallibrary.org/ss/termsofuse.aspx

CPU thread ≪...≫

CPU GPU

B1

Parent grid

B2

B3

≪...≫

≪...≫ C1

Child grid 2

Child grid 1

C2 C3

Figure 4. Example of dynamic parallelism in CUDA (adapted from Ref. 7)

launch, thus generating a new grid (“Child grid 2”) consisting of one single block with four threads. Only when all of the child threads generated by a parent thread have terminated, can the parent thread safely continue executing instructions. A thread block is not considered complete until all child grids created by its threads have completed. The CUDA language provides appropriate function calls to implement this type of parent-child synchronization. Parent and child threads have access to the same global memory space and child threads use global memory to return results to the parent thread. For each child thread launch, there are only two points in time at which global memory as seen by the child threads is consistent with global memory as seen by the parent threads: when 1) the child grid is created and a kernel invoked by the parent; or 2) the parent uses a synchronization barrier to wait for the child grid to complete. In other words, all the memory operations in the parent thread prior to the child grid’s invocation are visible to the child grid, and all memory operations of the child grid are visible to the parent thread after the parent thread has synchronized on the child grid’s completion.7

4. APPLICATIONS IN BIOMEDICAL IMAGING 4.1 Position Estimation from PMT or SiPM Signals In this first example, we will consider an application of GPU computing to gamma-ray imaging. A typical problem in gamma-ray imaging is the estimation of the position at which a gamma-ray photon interacts with a camera’s crystal. For each photon that interacts with the camera, a set of photomultiplier tube (PMT) or silicon-photomultiplier (SiPM) outputs is recorded. With a statistical model for the output signals as described in Refs. 9–13, a GPU algorithm is ideal for estimating the position of interaction from PMT or SiPM outputs since the data from different events are independent. Various methods have been proposed to carry out this estimation step, including those described in Refs. 14–18. The method we present here15, 16 is based on maximumlikelihood (ML) estimation. Thus, it enjoys all the properties of maximum likelihood estimates discussed in Ref. 19. Assume that the vector g represents noisy PMT outputs for a photon-crystal interaction that occurred at location r in the camera’s crystal. In our treatment, r = (x, y) will represent a 2D location. It can be argued9 that the K components g1 , . . . , gK of g are independent and follow Poisson statistics with parameters (means) g 1 (r), . . . , g K (r), respectively. We will refer to the vector g(r) of parameters as the mean detector response

Proc. of SPIE Vol. 9594 95940G-7 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 11/10/2016 Terms of Use: http://spiedigitallibrary.org/ss/termsofuse.aspx

function (MDRF) vector for the location of interaction r. As our notation implies, the MDRF vector g(r) depends on the location of interaction r within the crystal. The ML estimation problem is mathematically formalized as   rˆML = arg max pr(g | r0 ) , r0 ∈D

in which D denotes the crystal space and pr(g | r0 ) is the probability density function of the measured data g conditioned on the assumption that the location of interaction was r0 . Equivalently,   rˆML = arg max ln pr(g | r0 ) . (1) r0 ∈D

In the problem outlined above, r is the parameter we want to estimate and g is the observed data. Thus, our goal is to use g to calculate an estimate rˆML of r. Because the K PMT outputs g1 , . . . , gK obey Poisson statistics and they are assumed statistically independent, we can write the probability density function pr(g | r0 ) in (1) as a probability: Pr(g | r0 ) =

K g Y [g k (r0 )] k −gk (r0 ) . e gk !

k=1

If we take the logarithm of Pr(g | r0 ), we get ln Pr(g | r0 ) =

K  X

 gk ln [g k (r0 )] − ln(gk !) − g k (r0 ) .

k=1

Inserting this expression in (1) gives:15 " rˆML = arg max r0 ∈D

r0 ∈D

# gk ln [g k (r0 )] − ln(gk !) − g k (r0 ) =

k=1

" = arg max

K  X

K  X

# gk ln [g k (r0 )] − g k (r0 )

,

(2)

k=1

where the last form was obtained from the previous by discarding the ln(gk !) term, which does not depend on r0 and hence does not influence the outcome of the arg max function. To implement the maximum likelihood search efficiently, we can take advantage of the fact that Pr(g | r0 ) is a smooth function of r0 for fixed g, and we can consider an algorithm that performs multiple iterations to refine the maximum-likelihood estimate rˆML . This particular approach is well suited for a GPU implementation in which the MDRF data g(r0 )—typically tabulated only for a discrete set of points r0 —is stored in GPU texture memory. Our GPU implementation of the estimation algorithm uses the same algorithm of Ref. 15, in which Pr(g | r0 ) is evaluated for points r on a regular grid. The point of the grid that attains the largest value of Pr(g | r0 ) is retained and used as the center of another grid, finer than the previous one. As shown in Figure 5, this process is repeated until a fixed number of iterations are performed. Texture fetching was used to speed up our implementation: we represented the mean detector response g k (r) as a 2D layered texture7 in which the k-th layer is associated to the k-th PMT and r is the 2D texture coordinate (properly scaled to perform texture fetching correctly). The GPU cubic B-spline interpolation library20 was also used. Texture zero-padding— automatically performed by the hardware—provided an extra bonus: we no longer had to explicitly deal with boundary conditions at the detector edges, and that resulted in a more elegant and much faster code. Another benefit of textures is that the grid of points can be contracted by any factor α throughout the iterations. This improves flexibility, as we can adapt the contracting factor to the “smoothness” of the likelihood function (the smoother the likelihood function is, the larger α can be).

Proc. of SPIE Vol. 9594 95940G-8 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 11/10/2016 Terms of Use: http://spiedigitallibrary.org/ss/termsofuse.aspx

(a) Iteration 1

(b) Iteration 2

(c) Iteration 3

(d) Iteration 4

(e) Iteration 5

(f) Iteration 6

Figure 5. Illustration of the contracting-grid algorithm (adapted from Refs. 8, 15)

The application we consider in this section also provides a good example of the benefits of shared memory and the advantages of using a block and/or a grid size and dimensionality that best fit the data and algorithm we want to develop. In our implementation, we used a 1D grid of size J × 1 × 1, so that each PMT data vector g1 , . . . , gJ (corresponding to J distinct interaction events) can be indexed via the built-in variable blockIdx.x. In other words, each block of threads was assigned to one and only one PMT data vector gj , for j = 1, . . . , J. We further used shared memory to load PMT data vector gj once and share it with all the threads in the same block. This requires a synchronization mechanism among the threads in the same block to ensure one thread does not start accessing the shared memory before the PMT data vector has been copied to shared memory. The size and dimensionality of the blocks were chosen to match the size of the contracting grid used in the maximum-likelihood search. This makes it possible to use built-in variables threadIdx.x and threadIdx.y to quickly and conveniently calculate the point of the contracting grid at which to calculate the likelihood. Thus, the number of points in the contracting grid can be changed by simply making the blocks larger or smaller. Finally, after the point of the contracting grid at which the likelihood attains its maximum has been found, this information is shared among all the threads in the block via shared variables. We compared our GPU implementation of the algorithm discussed above with a CPU implementation of the same algorithm. Because of the lack of support for textures on conventional CPUs, extra code had to be developed to emulate the behavior of texture fetching on a CPU. For our tests, we considered a gamma-ray camera equipped with K = 9 PMTs and the MDRF data g k (xp , yq ) consisted of a collection of nine 153 × 153 grids of floating-point values. In our tests, we used a list of J = 105 noisy PMT data vectors g1 , . . . , gJ obtained during a 18F-NaF bone scan of a mouse. In our algorithm, we used a contracting grid of 8 × 8 points. After each iteration, the grid was contracted by factor α = 2.50. Finally, the algorithm performed M = 6 iterations for the calculation of each estimate rˆML,j , with j = 1, . . . , J. Performance results are reported in Table 1. Our GPU implementation outperforms our CPU implementation by a factor ranging from about 150 to more than 300. Hardware Platform

Events/s

Speedup

Intel Xeon CPU E5540, 2.53 GHz 7679.97 R R NVIDIA Tesla C2075 1168121.50 R R K20m 2449966.25 NVIDIA Tesla

— 152.10 319.01

R

R

Table 1. Performance results for the 2D ML estimation algorithm

Proc. of SPIE Vol. 9594 95940G-9 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 11/10/2016 Terms of Use: http://spiedigitallibrary.org/ss/termsofuse.aspx

4.2 Image Reconstruction via the MLEM Algorithm In this section, we provide some implementation details regarding GPU code that has been developed for image reconstruction of FastSPECT II data. FastSPECT II is a SPECT imaging system developed at the Center for Gamma-Ray Imaging (University of Arizona) for small-animal imaging. It consists of sixteen stationary (i.e., non-moving) modular gamma-ray cameras. Each camera has an input face measuring about 120 × 120 mm2 . Dedicated circuitry interfaces the cameras to a computer station for data acquisition of the photomultiplier tube (PMT) signals generated by each camera. The size of the field of view is approximatively 42 × 42 × 54 mm3 . GPU code was developed to speed up the pre-processing of the acquired data and the actual reconstruction via the maximum-likelihood expectation-maximization (MLEM) algorithm. Pre-processing of the data consisted of 2D maximum likelihood estimation of position of interaction for each detected gamma-ray photon via the contracting grid algorithm we discussed in the previous section. Our reconstruction code implements the MLEM algorithm in the following form:21, 22 M gm hm,n 1 X , fˆn(k+1) = fˆn(k) PN (k) sn m=1 hm,n0 fˆ 0 0 n

n =1

(k) in which fˆn is the estimated activity for the n-th voxel at the k-th iteration, sn is the sensitivity for voxel n, gm is the bin count for the m-th bin (hence m both encodes the camera and the pixel on the camera’s face), and hm,n is the (m, n)-th component of the system matrix. A plot of the sensitivity is shown in Figure 6.

−10

0

20

−20

−10

s 0 mm

−20

−10 mm

−10 mm

mm

−20

10

Transverse

Sagittal

Coronal −20

0 10

10

20

20

0

10

−20

−10

0 mm

10

20

T1 L.....il

20 −20

−10

0 mm

10

20

Figure 6. Plot of the sensitivity across three planes, showing almost uniform sensitivity over a large portion of the system’s field of view. The volume for which the sensitivity exceeds 5% its maximum value is about 56.60 cm3

Our parallel implementation of the MLEM algorithm takes advantage of the capabilities of modern GPU devices. For example, instead of storing in memory the values of hm,n , we calculate those elements on-the-fly and as they are needed in the reconstruction algorithm. By moving a point source inside the field of view, we collected point spread function (PSF) data and we performed a Gaussian fit of the PSF. The function we used in the fit has the following form:    1 (x − p1 )2 (y − p2 )2 2p5 (x − p1 )(y − p2 ) hn (x, y) = p0 exp − + − , 2(1 − p25 ) p23 p24 p3 p4 in which p0 , . . . , p5 are fitting coefficients. From p3 , p4 , and p5 , the size of a rectangular area centered around point (p1 , p2 ) was calculated to ensure that hn (x, y) ≈ 0 if point (x, y) is outside this box. Because the full width at half maximum (FWHM) of the function hn (x, y) depends on the values of p3 , p4 , and p5 and these values might vary greatly for different values of n, dynamic parallelism was used to efficiently evaluate hn (x, y) only for points uniformly spaced inside this box. Our implementation in CUDA has the following basic structure:

Proc. of SPIE Vol. 9594 95940G-10 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 11/10/2016 Terms of Use: http://spiedigitallibrary.org/ss/termsofuse.aspx

__global__ void forward_projection_parent_kernel(...) { ... ... if((2 * box_size.x + 1) * (2 * box_size.y + 1)