Efficient Exploitation of Heterogeneous Platforms for Vertebra ...

1 downloads 0 Views 392KB Size Report
the polygonal approximation is the one proposed by Douglas and Peucker in [18]. ... 1) Parallel fraction f: Amdahl's law [19] proposed an es- timation of the ..... T. Greer, B. Ter Haar Romeny, J. B. Zimmerman, and K. Zuiderverld,. “Adaptive ...
BIOMEDICAL ENGINEERING INTERNATIONAL CONFERENCE (BIOMEIC’12) OCTOBER 10-11,2012, TLEMCEN (ALGERIA)

1

Efficient Exploitation of Heterogeneous Platforms for Vertebra Detection in X-Ray Images Sidi Ahmed Mahmoudi, Fabian Lecron, Pierre Manneback, Mohammed Benjelloun and Sa¨ıd Mahmoudi University of Mons, Faculty of Engineering. Computer Science Department 20, Place du Parc. 7000, Mons, Belgium Email: {Sidi.Mahmoudi, Fabian.Lecron, Pierre.Manneback, Mohammed.Benjelloun and Said.Mahmoudi}@umons.ac.be

Abstract—Back problems are often related to an abnormal condition of the spine. In this context, conventional X-Ray radiography is the most common modality used in emergency rooms since it is relatively inexpensive and fast. In this paper, we develop a method for detecting and extracting vertebræ on X-Ray images. In a medical context, it is crucial to develop efficient applications with a reduced execution time, especially in case of urgent diagnosis. Therefore, we propose to accelerate the method by exploiting effectively the high computing power of parallel and heterogeneous (Multi-CPU/Multi-GPU) platforms. Our approach applies firstly a complexity estimation of the proposed vertebra detection steps in order to select the phases which can be well adapted for GPU parallelization. These phases will be implemented on hybrid platforms exploiting simultaneously CPUs and GPUs, and using efficient scheduling strategies. We propose also to overlap data transfers by kernels (GPU functions) executions using CUDA streaming technique within multiple GPUs. Experimentations have been conducted using a set of high resolution X-Ray images, showing a global speedup ranging from 4 to 28, by comparison with the CPU implementation. Index Terms—Heterogeneous architectures, Vertebra detection, Complexity estimation, Scheduling, CUDA streaming

I. I NTRODUCTION Back problems are a recurrent cause of daily life complaints. Many of these problems are the consequence of an abnormal condition of the spine. Medical image analysis allows to determine clinical indices helping spine diagnosis and treatment. In this context, conventional X-Ray radiography is the most common modality used in emergency rooms since it is relatively inexpensive and fast. Therefore, clinical needs exist for automatic algorithms of radiographic image analysis. One way to extract quantitative data is to detect and extract biomedical objects on images. That being done, clinical indices can be computed for medical purpose. In this paper, we are interested in a method for detecting and extracting vertebræ on X-Ray images. We propose an efficient solution for vertebra detection with a reduced execution time thanks to the exploitation of multiple CPUs and GPUs. Graphic processing units (GPUs) have presented an effective tool for improving performance in case of single image processing. Indeed, the result (output image) can be directly visualized from GPU, using graphic libraries for image rendering, such as OpenGL [1]. However, our method of vertebra extraction is applied on multiple X-ray images. So, two additional constraints occur: the first one is the inability to

visualize many output images using only one video output that requires a transfer of results from GPU to CPU memory. The second constraint is the high computation due to treatment of large sets of noisy medical images. In order to overcome these constraints, we propose a heterogeneous implementation of vertebra detection exploiting the full computing power of machines. The proposed implementation is based on the complexity estimation of our method steps using different parameters. Based on this estimation, we apply hybrid (Multi-CPU/Multi-GPU) treatments of high intensive steps and Multi-CPU treatments on low intensive steps. Within our heterogeneous implementation we employed efficient scheduling strategies in order to maximize the exploitation of resources. Moreover, we exploit the CUDA streaming technique enabling to overlap data transfers by kernels executions within multiple GPUs. Thanks to this approach, we accelerated the process of vertebra detection using larger sets of X-ray images. The remainder of the paper is organized as follows: related works are described in section 2. Section 3 presents the CPU implementation of the proposed vertebra extraction method. In the fourth section, we describe our heterogeneous implementation of this method, based on the complexity estimation of its steps and using efficient scheduling strategies. Section 5 is devoted to present the obtained results of our approach using large sets of medical images. Then, a comparison between CPU, GPU and hybrid implementations is presented. Finally, section 6 concludes and proposes further work. II. R ELATED W ORK On the one hand, works on automatic vertebra extraction on radiographs are few in the literature. Zamora et al. tried to take advantage of the Generalized Hough Transform (GHT) in [2] but they present a segmentation rate equal to 47% for lumbar vertebræ without providing information about the detection rate. Recently, Dong and Zheng [3] have proposed a method combining GHT and the minimal intervention of a user with only 2 clicks in the image. A fully automatic approach has been developed by Casciaro and Massoptier in [4]. They use a shape constraint characterization by looking for every shape that could be an inter-vertebral disc. On the other hand, the majority image processing algorithms dispose of sections that consist on similar computation over

BIOMEDICAL ENGINEERING INTERNATIONAL CONFERENCE (BIOMEIC’12) OCTOBER 10-11,2012, TLEMCEN (ALGERIA)

many pixels. This fact makes these algorithms well adapted for acceleration on GPU by exploiting its processing units in parallel. In this context and in case of single image processing, Yang et al. [5] implemented several classic image processing algorithms on GPU with CUDA [6]. OpenVIDIA project [7] has implemented different computer vision algorithms running on graphic hardware such as single or multiple graphic processing units. There are also some GPU works dedicated to medical imaging for new volumetric rendering algorithms [8] and magnetic resonance (MR) image reconstruction [9]. Furthermore, there are different works dealing with the exploitation of hybrid platforms of multicores and GPUs. OpenCL [10] proposed a framework for writing programs which execute across hybrid platforms consisting of both CPUs and GPUs. Ayguad´e et al. presented a flexible programming model for multicores [11]. StarPU [12] provided a unified runtime system for heterogeneous multicore architectures enabling to develop effective scheduling strategies. In previous work, we proposed in [13] a partial GPU implementation of a semi-automatic method of vertebra detection. We developed in [14] a heterogeneous edge detection used for vertebra segmentation. However, efficient exploitation of hybrid platforms requires effective memory management, task scheduling and complexity estimation. Our main contributions are presented in three points: • we propose an efficient heterogeneous implementation of the edge detection step using Deriche-Canny [15] method which enables a better noise removing and requires less number of operations than Canny method [16]. The edge detection step represents the most intensive phase of the proposed vertebra extraction method. • we propose a complexity estimation of vertebra detection steps using several metrics. Based on this estimation, we apply Multi-CPU treatments on low intensive steps and Multi-CPU/Multi-GPU treatments on high intensive ones. • we exploit an effective scheduling strategy within multiple CPUs and GPUs taking into account the average transfer and computing times of previous tasks (previous processed images). We propose also to overlap data transfers by kernels (GPU functions) executions within multiple GPUs using the CUDA streaming technique. III. CPU IMPLEMENTATION OF VERTEBRA DETECTION This section describes our sequential framework allowing to detect and extract vertebræ on a radiograph. This step is crucial since it can be used to determine clinical indices about the spine. The vertebra detection can also be considered to initialize a segmentation process. In this work, we propose to locate the vertebræ by detecting the position of anterior corners. Four steps (Fig. 1) compose the framework: a contrast-limited adaptive histogram equalization to improve the image contrast, an edge detector, an anterior corner detection and finally the vertebra localization. This framework is more detailed in [14]. A. Contrast-Limited Adaptive Histogram Equalization The X-Ray images we deal with have very poor contrast. Before any further process, we need to improve the image

2

quality. A simple histogram equalization has no impact on the radiographs. Therefore, we propose to use a specific method: the Contrast-Limited Adaptive Histogram Equalization [17]. The principle is to divide the image in contextual regions, where the histogram is equalized. B. Edge Detection The Canny [16] or Deriche-Canny [15] detectors allow to detect edges in an image by taking advantage of the information given by the intensity gradient. The first step is to reduce the noise by removing isolated pixels. To this aim, the image is convolved with the Gaussian filter. Next, the Sobel operator is applied on the resulting image in order to determine the gradient of a pixel. Additional information about the gradient orientation is also computed. Once the gradient has been evaluated for every pixel, only maxima have to be retained. High gradient intensity stands for high probability of edge presence. Finally, the last phase makes a hysteresis binarization. High and low thresholds are defined in such a way that, for each pixel, if the gradient intensity is: • lower than low threshold, the point is rejected; • greater than high threshold, the point is part of the edge; • between the low and the high thresholds, the point is accepted only if it is connected to an accepted point. C. Corner Detection To detect interest points in a radiograph, we based our approach on the geometrical definition of a corner, i.e. a point is considered as a corner if it is located at the intersection of two segment lines. The idea is to perform an edge polygonal approximation. The Canny algorithm provides the edges on the image but only acts on the pixel values. In order to carry out the polygonal approximation algorithm, we need to define the contours as sets of points. Therefore, a simple contour tracking approach has been developed. The algorithm used for the polygonal approximation is the one proposed by Douglas and Peucker in [18]. This approach is based on the principle that a polyline represented by n points can be reduced in a representation with 2 points if the distance between the segment line joining the extremities of the polyline and the farthest point from this line is lower than a given threshold. D. Vertebra Localization Among the corners detected based on edge polygonal approximation, vertebra corners need to be distinguished. The first stage of our procedure is to build a statistical model of the spine curvature in order to extract the mean shape. The landmarks of the model are the anterior vertebra corners. The next step brings a user to mark out the higher anterior corner of the C3 vertebra and the lower anterior corner of the C7 vertebra to define a ROI. Then, we perform an alignment between these two particular points and the mean shape of the spine curvature model. Finally, for each landmark, we search the closest corner detected by the approach based on the edge polygonal approximation.

BIOMEDICAL ENGINEERING INTERNATIONAL CONFERENCE (BIOMEIC’12) OCTOBER 10-11,2012, TLEMCEN (ALGERIA)

IV. H ETEROGENEOUS IMPLEMENTATION OF VERTEBRA DETECTION

We presented in the previous section a sequential (CPU) implementation enabling to detect vertebræ based on several steps. However, this method is hampered by the high computing time which increases significantly when processing large sets of high resolution images. We propose in this section a heterogeneous implementation exploiting the full computing power of computers in order to accelerate our approach. The proposed implementation is based on three main steps: complexity estimation of each step, hybrid computing and optimization within heterogeneous platforms.

step of vertebra localization is not analyzed since it presents a high dependency between its phases which presents an important constraint for parallelization. Vertebra detection

Substeps

Comp-img /substep

Comp-img /step

f

Estimated complexity

Histogram equalization

Histogram equalization

< 4 HL

< 4 HL

68%

Low

Contours detection

Grad comp Grad direction Grad magnitude Non-max supp Hysteresis

25 HL HL HL 2 HL cl = &cl; task->handle = queue->img_handle; task->handle.mode = STARPU_RW; starpu_task_submit(task); queue = queue->next; }

// // // // // //

Create task The codelet The buffer Read/Write Submit task Next image

Fig. 2.

Heterogeneous computing for vertebra detection

C. Optimization within heterogeneous platforms For a better exploitation of parallel and heterogeneous platforms, we employ three techniques to improve performances : exploitation of GPU Texture and Shared memories, data transfers overlapping and efficient scheduling. 1) Exploitation of texture and shared memories: we propose to exploit the texture and shared memories in case of GPU treatments. Indeed, we load the input image on GPU texture memory for a fast access to pixels. Each pixel neighbors are loaded on GPU shared memory for a fast processing of pixels using their neighbors values [14]. 2) Data transfers overlapping: in case of GPU treatments, we exploit CUDA streams in order to overlap kernels executions by data transfers (between CPU and GPU memories). This enables to treat each subset of images on its own stream. Each stream consists of three instructions (Fig. 3(a)): • • •

Copy of images subset from host to GPU memory; Computations performed by CUDA kernels; Copy of output images from GPU to host memory.

In our case, we use four CUDA streams for each GPU, so that each GPU can overlap effectively data transfers by kernels executions. Our choice of four CUDA streams was based on our experimental results. Fig. 3(b) presents the evolution of GPU computing times of edges detection (200 X-ray images) method, relatively to streams number. The choice of four streams provided the faster solution in our case.

BIOMEDICAL ENGINEERING INTERNATIONAL CONFERENCE (BIOMEIC’12) OCTOBER 10-11,2012, TLEMCEN (ALGERIA)

(a) Multiple image processing within four CUDA streams Fig. 3.

5

(b) GPU acceleration vs CUDA streams number

GPU Optimization of multiple image processing

3) Efficient scheduling: to achieve efficient scheduling, we propose to estimate the duration of each task. Therefore, we provide to codelets a performance model (Listing 2, line 6) assuming that for similar sizes of input/output images, the performance will be very close. StarPU will then check record of the average time of previous executions on different processing units, which will be used as estimation. Our performance model is based on the history of task execution times, tasks are submitted asynchronously, and the dmda (deque model data aware) scheduler is employed, that uses both task execution performance models and data transfer times into account. It schedules tasks where their termination time will be minimal. V. E XPERIMENTAL R ESULTS We propose to qualitatively evaluate the effectiveness of the vertebra detection. The sequential framework has already been validated in a previous publication. The data we used in this work are radiographs focused on the cervical vertebræ C3 to C7. The sample of radiographs comes from a NLM database freely available. These images are the digitized versions of X-Ray films collected during the the years 1976-1980. Persons aged 25 through 74 were examined. We give the user a visual illustration of a detection applied to a radiograph at Fig. 4.

On the one hand, the quality of vertebra extraction remains identical since the procedure has not changed. Only the architecture and the implementation did. On the other hand, the exploitation of Multi-CPU/Multi-GPU platforms enabled to accelerate significantly the process of vertebra detection on X-ray images. This fact allowed to apply the proposed approach on large sets of medical images in order to have more precision for vertebra extraction results. The obtained accelerations thanks to the proposed optimizations (section IV. C) are presented in table II. These optimizations are applied on edge detection phase which is implemented heterogeneously. Indeed, the exploitation of texture and shared memories enabled to improve performances (speedup of 19,79x) compared to basic hybrid implementation (16,93x). Multi-CPU/Multi-GPU treatments are more improved (22,39x) when we employ efficient scheduling of tasks taking into account the transfer and computing times of previous tasks. The overlapping of data transfers by kernels executions within multiple GPUs enable also to accelerate the computation (28,06x). These results are obtained using a set of 200 X-ray images with a resolution of 1476×1680 each.

Platforms

Basic implementation

SM & TM exploitation

Efficient scheduling

CUDA streaming

1GPU/2CPU 2GPU/4CPU 4GPU/8CPU

08,43 x 13,11 x 16,93 x

09,26 x 14,20 x 19,79 x

10,33 x 14,96 x 22,39 x

11,65 x 15,81 x 28,06 x

TABLE II O PTIMIZATIONS OF HETEROGENEOUS COMPUTING OF EDGE DETECTION

(a) Original Image Fig. 4.

Qualitative Results

(b) Detection of the vertebræ

Table III presents a comparison between CPU, GPU and heterogeneous implementations of the proposed vertebra detection steps. Notice that GPU and hybrid implementation are applied on edge detection step (high intensive step) only, while the steps of histogram equalization and corner detection (low detection steps) are launched on the multiple CPUs. Table III shows that the use of parallel and hybrid platforms enabled a high acceleration of performances compared to GPU or multiple CPUs solutions.

BIOMEDICAL ENGINEERING INTERNATIONAL CONFERENCE (BIOMEIC’12) OCTOBER 10-11,2012, TLEMCEN (ALGERIA)

Vertebra detection steps

1CPU Time (T (s))

T (s)

Acc (x)

1GPU T (s) Acc (x)

/

Histogram equalization

62.10 s

15.44 s

4.02 x

Edge detection

135.8 s

39.06 s

3.48 x

Corner detection

46.12 s

11.51 s

T

T

244.02 s

66.01 s

Total time

1GPU/8CPU

8CPU

15,80 s

4GPU/8CPU 8CPU Acc (x)

15.44 s

4.01 x

/ T

4.02 x

11.51 s

15.44 s

4.01 x

5.71 x

4GPU/8CPU T (s) Acc (x)

/

4.02 x

/

Acc

42.75 s

8CPU T (s) Acc (x)

/

08.60 x

Acc 3.70 x

T

6

11.51 s

4.84 s 4.01 x

T 31.79 s

28.06 x

/ Acc 7.68 x

TABLE III M ULTI -CPU/M ULTI -GPU BASED VERTEBRA DETECTION USING 200 IMAGES (1476×1680)

VI. C ONCLUSION In this paper, we presented a method to detect vertebræ on radiographs. The novelty of our approach is to propose an efficient solution for vertebra extraction with a reduced execution time thanks to the exploitation of multiple CPUs and GPUs. The general framework is composed of four steps: a contrast-limited adaptive histogram equalization to improve the image contrast, an edge detector, an anterior corner detection and finally the vertebra localization. We proposed a complexity estimation of each step in order to select the phases which can be well adapted for parallel (GPU) and heterogeneous implementations. Thus, we applied Multi-CPU treatments on low intensive steps and Multi-CPU/Multi-GPU treatments on high intensive steps. We optimized the heterogeneous treatments by exploiting the GPU texture and shared memory, and by using an efficient scheduling strategy taking into account duration and transfer times of previous tasks. We proposed also to exploit the CUDA streaming technique for overlapping data transfers by kernels executions. Experimentations showed a global speedup ranging from 4 to 28 using heterogeneous treatment compared to CPU versions. As future work we plan to develop a fast fully automatic segmentation approach based on a learning method such as Support Vector Machine (SVM). We plan also to improve the employed scheduling strategy by taking into account more parameters (number of operations, dependency factor, etc.) in order to have a better exploitation of resources. R EFERENCES [1] OpenGL, “OpenGL Architecture Review Board: ARB vertex program. Revision 45.” 2004. [Online]. Available: http://oss.sgi.com/projects/ ogl-sample/registry/ [2] G. Zamora, H. Sari-Sarraf, and L. R. Long, “Hierarchical segmentation of vertebrae from x-ray images,” in Medical Imaging 2003: Image Processing, vol. 5032. Proceedings of the SPIE, 2003, pp. 631–642. [3] X. Dong and G. Zheng, “Automated Vertebra Identification from XRay Images,” in Image Analysis and Recognition, ser. Lecture Notes in Computer Science, 2010, vol. 6112, pp. 1–9. [4] S. Casciaro and L. Massoptier, “Automatic Vertebral Morphometry Assessment,” in 28th Annual International Conference of the IEEE Engineering in Medicine & Biology Society, 2007, pp. 5571–5574. [5] Z. Yang, Y. Zhu, and Y. pu, “Parallel Image Processing Based on CUDA,” International Conference on Computer Science and Software Engineering. China, pp. 198–201, 2008.

[6] NVIDIA, “NVIDIA CUDA,” 2007. [Online]. Available: http://www. nvidia.com/cuda. [7] J. Fung, S. Mann, and C. Aimone, “OpenVIDIA : Parallel gpu computer vision.” In Proc of ACM Multimedia, pp. 849–852, 2005. [8] M. Smelyanskiy, D. Holmes, J. Chhugani, A. Larson, and al, “Mapping high-fidelity volume rendering for medical imaging to CPU, GPU and many-core architectures,” IEEE Transactions on Visualization and Computer Graphics, 15(6), pp. 1563–1570, 2009. [9] T. Schiwietz, T. Chang, P. Speier, and R. Westermann, “MR image reconstruction using the GPU,” Image-Guided Procedures, and Display. Proceedings of the SPIE, pp. 646–655, 2006. [10] Khronos-Group, “The Open Standard for Parallel Programming of Heterogeneous Systems,” 2009. [Online]. Available: http://www. khronos.org/opencl [11] E. Ayguad´e, R. M. Badia, F. D. Igual, J. Labarta, R. Mayo, and E. S. Quintana-Orti, “An Extension of the StarSs Programming Model for Platforms with Multiple GPUs,” Proceedings of the 15th International Euro-Par Conference on Parallel Processing. Euro-Par’09, pp. 851–862, 2009. [12] C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier, “StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures,” In Concurrency and Computation: Practice and Experience, Euro-Par 2009, best papers issue, pp. 863–874, 2009. [13] S. A. Mahmoudi, F. Lecron, P. Manneback, M. Benjelloun, and S. Mahmoudi, “GPU-Based Segmentation of Cervical Vertebra in X-Ray Images,” Proceedings of the High-Performance Computing on Complex Environments Workshop, in conjunction with the IEEE International Conference on Cluster Computing, pp. 1–8, 2010. [14] F. Lecron, S. A. Mahmoudi, M. Benjelloun, S. Mahmoudi, and P. Manneback, “Heterogeneous Computing for Vertebra Detection and Segmentation in X-Ray Images,” International Journal of Biomedical Imaging. Volume 2011, pp. 1–12, 2011. [15] R. Deriche, “Using Canny’s criteria to derive a recursively implemented optimal edge detector,” Internat. J. Vision,Boston, pp. 167–187, 1987. [16] J. Canny, “A computational approach to edge detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 8, no. 6, pp. 679–698, 1986. [17] S. M. Pizer, E. P. Amburn, J. D. Austin, R. Cromartie, A. Geselowitz, T. Greer, B. Ter Haar Romeny, J. B. Zimmerman, and K. Zuiderverld, “Adaptive histogram equalization and its variations,” Computer Vision, Graphics, and Image Processing, vol. 39, no. 3, pp. 355–368, Sep. 1987. [18] D. H. Douglas and T. K. Peucker, “Algorithms for the reduction of the number of points required to represent a digitized line or its caricature,” Cartographica: The International Journal for Geographic Information and Geovisualization, vol. 10, no. 2, pp. 112–122, 1973. [19] A. Grama, A. Gupta, G. Karypis, and V. Kumar, “Introduction to Parallel Computing,” second ed. Pearson Education Limited, 2003.

Suggest Documents