Journal of
Imaging Article
GPU Accelerated Image Processing in CCD-Based Neutron Imaging Jonathan Schock 1, * , Michael Schulz 2 1 2 3
*
and Franz Pfeiffer 1,3
Chair of Biomedical Physics, Department of Physics and Munich School of BioEngineering, Technical University of Munich, 85748 Garching, Germany Heinz Maier-Leibnitz Zentrum, Technical University of Munich, 85748 Garching, Germany;
[email protected] Department of Diagnostic and Interventional Radiology, Klinikum rechts der Isar, Technical University of Munich, 81675 München, Germany;
[email protected] Correspondence:
[email protected]; Tel.: +49-89-289-10846
Received: 17 July 2018; Accepted: 9 August 2018; Published: 21 August 2018
Abstract: Image processing is an important step in every imaging path in the scientific community. Especially in neutron imaging, image processing is very important to correct for image artefacts that arise from low light and high noise statistics. Due to the low global availability of neutron sources suitable for imaging, the development of these algorithms is not in the main scope of research work and once established, algorithms are not revisited for a long time and therefore not optimized for high throughput. This work shows the possible speed gain that arises from the usage of heterogeneous computing platforms in image processing along the example of an established adaptive noise reduction algorithm. Keywords: neutron imaging; image processing; OpenCL; parallel processing; noise reduction
1. Introduction Non-destructive testing is a term comprising several important techniques to probe specimen for certain characteristics without the need to physically alter them and has a vast field of applications. One area inside these non-destructive testing methods is radiographic as well as tomographic imaging of samples with e.g., X-rays or neutrons. In those imaging techniques, a sample is placed inside an incident beam. The characteristics of the beam are changed by the introduction of the sample and this change can be recorded by suitable detectors such as scintillation counters or cameras to gain information about the sample. In neutron imaging, the intensity of the incident beam is decreased inside the sample by several different interactions such as absorption, coherent and incoherent nuclear scattering and magnetic dipole interaction. As the interaction cross-sections depend on the composition of the atomic nucleus, they vary wildly over the periodic table and can even have significant differences for different isotopes of the same element as can be seen in Table 1.
J. Imaging 2018, 4, 104; doi:10.3390/jimaging4090104
www.mdpi.com/journal/jimaging
J. Imaging 2018, 4, 104
2 of 11
Table 1. Interaction cross-sections of neutrons for some selected elements. All cross-sections are given in barn with 1 b = 10 × 10−24 cm2 . The scattering cross section σsc and the absorption cross section σab are tabulated for v = 2200 m/s (E = 25.30 meV) and taken from [1] and given without the error. Isotope
σsc,N
σab,N
1H 2H 1
82.02 7.64 5.24 1.503 180 1044
0.3326 0.000519 767 0.231 49,700 259,000
5B
13 Al
64 Gd 157 Gd 64
Therefore, the attenuation of a sample is dependent on its material composition and is pictured in a 2D image in projections or in a 3D array of voxels in a tomographic acquisition. Neutrons are mostly produced in research reactors by nuclear fission or in spallation sources and are characterised by their temperature which is linked to their energy via E(meV) = 0.086T (K).
(1)
Mostly thermal and cold neutrons with temperatures around 25 K to 300 K are used for imaging [2]. Compared to X-ray imaging, a typical neutron imaging setup has a lower total flux and the global availability of neutron beam-time is magnitudes lower [3,4]. As the cross-sections of neutron interactions are strongly energy dependent, in general the spectrum of the incident beam has to be well known in order to being able to quantify the results. Therefore, not all neutron sources are equally well suited for imaging applications. This is why it is essential to maximise usable beam-time and to get the most of the acquired images with suitable image processing. Furthermore, a fast image-processing work-flow allows scientist to perform dynamic measuring rows and to directly assure the quality of the acquired images on-site. Especially in medical imaging in the X-ray domain, GPU based image processing is well established and has shown potential for speed increase in several orders of magnitudes [5]. Noise reduction is an important first step of all image pipelines and therefore chosen as an example in this work. GPU-based video denoising has also been demonstrated with speed gains [6]. However, the application to neutron imaging has not been shown. Furthermore, the chosen implementation shows some optimisations that are transferable to a wide range of image-processing tasks not only in neutron imaging but in general to all pixel based imaging techniques. 2. Materials and Methods 2.1. Setup The measurements for this work were carried out at the Advanced Neutron Tomography and Radiography Experimental System (ANTARES) inside the research reactor neutron source facility Forschungs–Neutronenquelle Heinz–Maier–Leibnitz (FRM II). The cold neutron source used in the experiment has a temperature of about 25 K [7]. The shielding hutch of the ANTARES has three separated compartments, of which the first is used for beam- and spectrum shaping devices, amongst them being a mechanical velocity selector, a double crystal monochromator or interference gratings that can be used for grating based phase contrast imaging [8]. The detection system consists of a shielded detector box holding an Andor Ikon L CCD camera by Oxford Instruments, Abingdon, UK, with 2048 pixels in each dimension. A two-mirror beam-path allows the camera to be placed outside the direct neutron beam-path. A schematic drawing of the setup and the detector is shown in Figure 1. Measurements were done with an 18 mm pinhole, resulting in an L/D ratio of 500 at the sample position and an effective pixel size of about 27 µm.
J. Imaging 2018, 4, 104
3 of 11
Figure 1. A schematic sketch of the experimental hutch of the ANTARES setup is shown in (a) with the source-detector distance being approximately 9 m. In (b), the approximately 50 cm wide camera setup with a two-mirror beam-path and the scintillator is shown.
2.2. Noise Removal The images taken with the setup above suffer from noise stemming from various sources. Besides electronic and thermal noise as well as readout noise from the camera, the main noise source in neutron imaging experiments that deteriorate image quality are high energy γ-particles that are produced in nuclear reactions of neutrons inside material in and around the experimental setup. Those high energy particles lead to bright spots when they hit the detector, which are referred to in this work as gamma spots. An example of the resulting artefact of such gamma spots is shown in Figure 2.
800
270
760 260
720 680
250
640
240
600 560
230
520 480
220
440 210 420
430
440
450
460
470
480
490
400
Figure 2. Direct hits of two high energy particles on the detector of an Andor Ultra EMCCD camera by Oxford Instruments, Abingdon, UK, with a pixelsize of 16 µm. The camera is rotated 90◦ counter-clockwise; the readout direction of the camera is on the right side of the image.
J. Imaging 2018, 4, 104
4 of 11
Two bright spots with a bright surrounding can be seen in this image. These spots results from an excess charge production on the CCD sensor, that can no longer be contained in the potential well of one single pixel. Furthermore a streak in the readout direction of the camera can be seen emerging from the primary interaction position. This is due to the fact, that charge transfers from pixel to pixel in the readout direction are facilitated due to the inner workings of a CCD camera. This streak therefore is more prominent at higher readout frequencies and lower potential well depths of the pixels. The problem with these gamma spots is, that a global correction such as Gaussian blurring or taking the median of the affected area decreases the resolution of the image due to the averaging nature of these filters. Adaptive filters, such as described by [9] circumvent this problem by adapting the size of the filter kernel to the strength of noise in the image. Their algorithm will be shortly summarized in the following. A Laplacian of Gaussian (LoG) filter is a well-known edge finding algorithm and is a convolution with the LoG kernel, that can be written in the discrete form h ( n1 , n2 ) =
(n21 + n22 − 2σ2 )hg (n1 , n2 ) , 2πσ6 ∑ ∑ hg
(2)
n1 n2
with the abbreviation
2
2
hg (n1 , n2 ) = e−(n1 +n2 )/(2σ
2)
(3)
where n1 and n2 are denoting the position of the kernel element in each dimension and σ is the standard deviation. The resulting edge-image shows high intensities at pixels, where there is a fast change in image intensity in the raw image. As higher noise levels lead to sharper edges, the edge-image can be thresholded in several levels, to adjust the size of a subsequent median kernel. In that way, the median filter only affects areas, in which noise above a certain threshold is present and can be tuned to account for a larger neighbourhood when the gamma spots affect a larger area. To account for very bright and big spots, the threshold image for the median with a kernel size of 7 can additionally be binary-dilated. 2.3. Sample To allow for reproducible results, three dimensional test arrays with a discrete uniform random gray value distribution in the open interval [0, 1000) were created with a size of 1024 pixels in each dimension to simulate an image stack. These images where used to profile both the running time of the benchmark implementation as well as our optimised parallel implementation. The demonstration of the noise-removal abilities of the presented implementation is shown on images of polymer electrolyte membrane fuel cells (PEMFCs) acquired at the ANTARES and explained in more detail in [10]. 2.4. Implementation A reference implementation of the described algorithm was done in Python 2.7 to allow for detailed profiling and identification of the most timely image processing steps needed which have been identified to be the convolution with the precomputed LoG kernel and the median filtering. In general, a convolution in the image domain can be expressed as a multiplication in the Fourier domain and the needed algorithms to do a Fast Fourier Transform (FFT) are well known and constantly improved [11–13]. However, for small kernels convolution is faster and has less memory overhead than FFT [14]. Furthermore, there is no convolution kernel that defines the median, as the median is based on order statistics. Therefore a pixel-wise implementation is chosen both for the reference and the optimized version. The optimised implementation is done in OpenCL with the Python API PyOpenCL [15]. OpenCL was chosen because of it being an open royalty-free standard that is portable across a lot of different computing architectures such as CPUs and GPUs as well as other programmable hardware such as FPGAs and DSPs [16]. OpenCL implements an image object, which has a dedicated underlying memory model that allows the use of samplers for reading. As these samplers allow for automatic bounds-checking, the algorithm gets a lot simpler as manual bounds-checking has to be
J. Imaging 2018, 4, 104
5 of 11
done no longer. Further more, the image model allows caching of pixel values, so that subsequent access to pixels in the neighbourhood of a lately read-out pixel are accelerated. Memory transfer from host to device is done with PyOpenCL and each transfer as well as each kernel enqueueing is enriched with an OpenCL event that allows to extract exact timing information for memory transfers as well as kernel executions on the device. The source code of the OpenCL kernels is shown in Appendix A. 3. Results and Discussion 3.1. Profiling The profiling of the implementation in python shows that the running time of the convolution and the median scales linearly with the pixel count of the input array as can be seen in Figure 3. Furthermore, the median additionally scales with the footprint or region size. 400 Median 3 Median 5 Median 7 Convolution
350
running time (s)
300 250 200 150 100 50 0 0
32 64
128
200
z - size of input array
400
512
Figure 3. Running time of the convolution and the median compared with the z-size of the three-dimensional array [z, 1024, 1024] on which they are acting. All implementations are single-threaded and scale approximately linear. The median runs are two-dimensional median filters with the size being the x and y dimensions of the lookup-window. The convolution is done with the LoG Kernel with a size of 9 × 9 and σ = 0.8. Profiling implementations are done with scipy.ndimage [17].
From this measurements, the time needed for an 2048 × 2048 pixel image can be extrapolated. To get an estimate, for the running time in a real application, the running time of the median kernels has to be multiplied with the noise density in each threshold bin, as the median only has to be executed on these values, which reduces running time. This is not the case for the convolution. Li et al. mention noise densities of 13.35%, 1.68% and 0.33% for the corresponding median sizes of 3, 5 and 7 [9]. With these noise densities, the running time of the complete algorithm is dominated by the time taken in the convolution step and is around 1 s for one image compared to 5 s mentioned in the same source. 3.2. Optimisation In the OpenCL implementation, the input data, namely the array holding the LoG coefficients and the raw input images, has to be transferred from host to device. Subsequently the convolution,
J. Imaging 2018, 4, 104
6 of 11
thresholding, dilation and the adaptive median is run on the device. The timing of these steps on the same random images used before is done on one Tesla K10 G2 by Nvidia, Santa Clara, CA, USA, and shown in Figure 4.
Queue2
Coefficients 0 ms
gaussian_laplacian 228 ms
threshold_dilate 6 ms
Input_copy 108 ms
Queue1
0.0
adaptive_median 133 ms
threshold 6 ms
0.1
0.2 0.3 Running time (s)
0.4
Figure 4. Running time of an optimised version of the LoG adaptive median filter. Queue 1 lists all input timings, Queue 2 shows all numeric calculations. Each coloured block is one separate OpenCL kernel. Shown is the timing for one complete run of the algorithm on a 3D array with a size of 64 × 1024 × 1024 random values between 0 and 1000 and thresholds at 50, 100 and 400 for median sizes of 3, 5 and 7.
This timing suggests a possible speed increase of 160 times compared to the original implementation. Transfer time back to the host is not included in the calculation as it strongly depends on the speed of the desired output storage but in general should be in the order of the time needed for the input. The time needed in this implementation is not dependent on the noise density or the threshold levels as the kernel is run on every input pixel in parallel as long as the input can be mapped to the available number of work items on the GPU device. If the size of input data is increased, the running time increases discretely with each new memory block. As an example, an image stack with a z-size between 65 and 128 would take double the time as the shown timing. The running time can further be decreased by interleaving the memory operations with the calculation. On modern graphics hardware, the input copy of a subsequent stack can happen while the calculations are done for the previous input. Conversely, the output of the previous result can be done during the calculation of the subsequent array. If implemented that way, the running time for a stack of 640 images is calculated to be around 4 s. As this implementation is highly parallelised, if higher speed is needed the algorithm can be run on several OpenCL devices at parallel. The application of the filter to an in-situ radiography of a fuel-cell is shown in Figure 5.
J. Imaging 2018, 4, 104
7 of 11
(a)
(b)
(c)
Figure 5. One projection of an in-situ neutron radiography of a fuel-cell. Subfigure (a) shows the uncorrected radiography, (c) shows the same radiography after correction. Red boxes indicate the regions shown in the magnification in (b) with the corrected version being below the uncorrected version. The fuel-cell is approximately 2.5 mm wide and the effective pixel size of the measurement was about 27 µm.
This work shows, that efficient parallel programming on GPUs is a suitable technique in image processing for CCD-based neutron imaging as a large range of image processing algorithms can be implemented in that way. If the processing-chain gets longer, the overhead of memory operations from host-device transfers can be more and more neglected against the speed gained from parallel execution. We showed, that even pixel-based algorithms and algorithms which have no formulation as a convolution kernel can be efficiently implemented on GPU to allow beam-line users a real-time feedback regarding image quality. Further work in this direction could be done in the GPU implementation of the complete imaging toolchain from background subtraction and flat-field correction up to implementation of dynamic image statistics such as pixel-wise standard deviations to allow e.g., for real-time corrections of intensity fluctuations of neutron sources. Author Contributions: Software, Data Curation, Visualization and Writing — Original Draft Preparation J.S.; Resources M.S.; Project Administration, Funding Acquisition, review and editing M.S. and F.P.; Supervision F.P. Funding: We acknowledge financial support through the DFG (Gottfried Wilhelm Leibniz program) and the BMBF (project 05K16WO8, NeuRoFast). Acknowledgments: The authors would like to thank Simon Thiele, Riko Moroni and Matthias Breitwieser from IMTEK, Freiburg for the collaboration with the fuel-cell measurements. Conflicts of Interest: The authors declare no conflict of interest. The founding sponsors had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, and in the decision to publish the results.
J. Imaging 2018, 4, 104
Abbreviations The following abbreviations are used in this manuscript: API CPU DSP FFT FPGA GPU LoG
Application Programming Interface Central Processing Unit Digital Signal Processor Fast Fourier Transform Field Programmable Gate Array Graphics Processing Unit Laplacian of Gaussian
Appendix A. Source Code This section lists the OpenCL kernels developed in this work and used for benchmarking. Appendix A.1. LoG Convolution 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
_ _ k e r n e l void LoG ( __read_only image3d_t d_image , __read_only image3d_t d _ c o e f f i c i e n t s , _ _ w r i t e _ o n l y image3d_t d_out_image , i n t s i z e , / / The d i m e n s i o n o f t h e c o n v o l u t i o n k e r n e l int offset_x , / / o p t i o n a l i f o r i g i n i s not at ( 0 , 0 ) int offset_y ) { / / Get t h e work e l e m e n t : int x_i = get_global_id (0)+ offset_x ; int y_i = get_global_id (1)+ offset_y ; int z = get_global_id ( 2 ) ; / / I n i t i a l i s e values : float gauss_laplacian = 0 . 0 ; f l o a t weight = 0 . 0 ; int get_x = 0 ; i n t get_y = 0 ; i n t o f f s e t = ( s i z e − 1)/2; / / Convolution : f o r ( i n t y = y_i − o f f s e t ; y