Efficient Exploitation of Heterogeneous Platforms for Images Features ...

Efficient Exploitation of Heterogeneous Platforms for Images Features Extraction Sidi Ahmed Mahmoudi and Pierre Manneback University of Mons, Faculty of Engineering 20 Place du Parc, Mons, Belgium e-mail: {Sidi.Mahmoudi, Pierre.Manneback}@umons.ac.be

Abstract—Image processing algorithms present a necessary tool for various domains related to computer vision, such as video surveillance, medical imaging, pattern recognition, etc. However, these algorithms are hampered by their high consumption of both computing power and memory, which increase significantly when processing large sets of high resolution images. In this work, we propose a development scheme enabling an efficient exploitation of parallel (GPU) and heterogeneous platforms (Multi-CPU/Multi-GPU), for improving performance of single and multiple image processing algorithms. The proposed scheme allows a full exploitation of hybrid platforms based on efficient scheduling strategies. It enables also overlapping data transfers by kernels executions using CUDA streaming technique within multiple GPUs. We present also parallel and heterogeneous implementations of several features extraction algorithms such as edge and corner detection. Experimentations have been conducted using a set of high resolution images, showing a global speedup ranging from 5 to 30, by comparison with CPU implementations. Index Terms—Heterogeneous architectures, GPU, Features extraction, Efficient scheduling, CUDA streaming

I. I NTRODUCTION In recent years, the CPU power has been capped, essentially for thermal reasons, to less than 4 GHz. This limitation has been circumvented by the change of internal architecture, with multiplying the number of integrated computing units. This evolution is reflected in both general (CPU) and graphic processing units (GPU), as well as in recent accelerated processors (APU) which combine CPU and GPU on the same chip [1]. Moreover, GPUs have larger number of computing units, and their power has far exceeded the CPUs ones. Indeed, the advent of GPU programming interfaces (API) has encouraged many researchers to exploit them for accelerating algorithms initially designed for CPUs. As a result, graphic processors have presented an effective way for improving performance of image processing algorithms. These methods present prime candidates for acceleration on GPU by exploiting its processing units in parallel, since they consist mainly of a common computation over many pixels. This solution is very efficient in case of single image processing [13], as the output image can be directly visualized from GPU, using graphic libraries for image rendering, such as OpenGL [16]. However, in case of multiple image processing, two additional constraints occur: the first one is the inability to visualize many output images using only one video output that requires a transfer of results from GPU to CPU memory.

The second constraint is the high computation intensity due to treatment of large sets of images. In order to overcome these constraints, we propose a development scheme enabling a full exploitation of parallel (GPU) and heterogeneous architectures, using the runtime system StarPU [2] which enables to exploit and develop efficient scheduling strategies. This scheme allows also overlapping data transfers by kernels executions using CUDA streaming technique within multiple GPUs. Moreover, we developed parallel and hybrid implementations of several algorithms such as contours and corners detection methods. They are exploited in two applications, the first one is a medical method of vertebra segmentation [11], the second one is an application of research and navigation in image and video databases [18]. The remainder of the paper is organized as follow: related works are described in section 2. Section 3 presents the proposed development scheme for image processing on GPU, while section 4 describes our development scheme for single and multiple image processing on heterogeneous architectures. Section 5 presents the experimental results. Finally, section 6 concludes and presents future works. II. R ELATED W ORK Most of image processing algorithms dispose of sections that consist on similar computation over many pixels. This fact makes these algorithms well adapted for acceleration on GPU by exploiting its processing units in parallel. In this category and in case of single image processing, Yang et al. [20] implemented several classic image processing algorithms on GPU with CUDA [15]. OpenVIDIA project [6] has implemented different computer vision algorithms running on graphic hardware such as single or multiple graphic processing units. Luo et al. developed a GPU implementation of the Canny edge detector [12]. There are also some GPU works dedicated to medical imaging for new volumetric rendering algorithms [19] and magnetic resonance (MR) image reconstruction [17]. Furthermore, there are different works dealing with the exploitation of hybrid platforms of multicores and GPUs. OpenCL [10] proposed a framework for writing programs which execute across hybrid platforms consisting of both CPUs and GPUs. Ayguadé et al. presented a flexible programming model for multicores [3]. StarPU [2] provided a unified runtime system for heterogeneous multicore architectures enabling to develop effective scheduling strategies.

In the other hand, there are few works in literature dealing with multiple image processing on parallel (GPU) or hybrid (Multi-CPU/Multi-GPU) architectures. Our contribution focuses on the development of parallel and heterogeneous implementations of features (edges and corners) extraction algorithms that can be applied on single or multiple images. We propose also a complexity evaluation of these methods based on several metrics. Indeed, we propose a scheme development enabling a full exploitation of computing power of the new parallel and hybrid platforms. The proposed scheme enables an efficient and dynamic scheduling of tasks within multiple CPUs and GPUs, it exploits also CUDA streaming technique for overlapping data transfers by kernels executions. Another contribution of this work is the exploitation of our scheme for improving performance of a medical application used for vertebra segmentation [11], and an application of research and navigation in image and video databases [18]. III. PARALLEL I MAGE P ROCESSING ON GPU As described in the previous section, graphic cards present an efficient tool for improving performance of image processing methods. This section is presented in four parts: the first one describes our development scheme for single and multiple image processing on GPU. The second part presents the employed GPU optimization techniques. The third part describes our GPU implementations of edges and corners detection methods. The fourth part is devoted to analyze the obtained results, and evaluate the complexity of the implemented algorithms based on different metrics. A. Development scheme for image processing on GPU The proposed scheme is based upon CUDA for parallel constructs and OpenGL for visualization. This scheme is based on three steps: loading of input images, CUDA parallel processing, results presentation. 1) Loading of input images on GPU: first, the input images are loaded on GPU memory. 2) CUDA parallel processing: this step disposes of two phases: threads allocation and CUDA processing. • Threads allocation: before launching the parallel treatments, the number of threads of the GPU’s grid has to be defined, so that each thread can perform its processing on one or a group of pixels in parallel. • CUDA processing: the CUDA functions (kernels) are executed N times using the N selected threads. 3) Results presentation: results can be presented using two different scenarii: • OpenGL visualization: in case of single image processing, results (output image) can be directly visualized on screen through the video output of GPU. Therefore, we propose to exploit the graphic library OpenGL enabling fast visualization, since it works with buffers already existing on GPU. • Transfer of results : OpenGL visualization is impossible in case of multiple image processing using

one video output only. So, a transfer of output images from GPU to CPU memory is required, which represents an additional cost for the application. B. GPU optimization For a better exploitation of GPU, we employ different optimization techniques, that depend mainly on the type of images (single or multiple) to treat: • Single image processing: we propose to load the input image on GPU texture memory for a fast access to pixels. We have also loaded each pixel neighbors on GPU shared memory for a fast processing of pixels using their neighbors values [13]. • Multiple image processing: we exploit the GPU shared memory for a fast access of neighbors values. Moreover, we exploit CUDA streams in order to overlap kernels executions by data transfers to/from GPU. This enables to treat each subset of images on its own stream. Each stream consists of three instructions (Fig. 1(a)): 1) Copy of images subset from host to GPU memory 2) Computations performed by CUDA kernels 3) Copy of output images from GPU to host memory C. GPU implementation of edges and corners detection Based on the scheme described in section III.A, we propose the GPU implementation of edges and corners detection methods, enabling both efficient results in terms of the quality of detected contours and corners, and improved performance thanks to the exploitation of GPU’s computing units in parallel. 1) Edges detection on GPU: we proposed a GPU implementation of the recursive contours detection method using Deriche technique [5]. The noise truncature immunity and the reduced number of required operations make this method very efficient. However, it is hampered by the important increase of computing time when processing large sets of high resolution images. Our GPU implementation of this method is described in [13], based on the parallelization of its four steps on GPU. 2) Corners detection on GPU: we developed the GPU implementation of Bouguet’s corners extraction method [4], based on Harris detector [8]. This method is efficient thanks to its invariance to rotation, scale, brightness, noise, etc. Our GPU implementation of this method is described in [14], based on parallelizing its four steps on GPU. D. Performance analysis and evaluation A comparison of computing times between CPU and GPU implementations involving edges and corners detections is presented in Table I. Notice that the use of GPU for single image processing enabled a high acceleration (speedup of 18.3), thanks to the exploitation of GPU’s computing units, and the fast visualization of results using OpenGL. Moreover, the use of texture and shared GPU memories allowed to more improve performance (speedup of 22) compared to GPU solution using global memory (Table I). An acceleration due to the fast access to pixels values within these memories. Notice also that the GPU implementations are slower than

(a) Multiple image processing with four CUDA streams Fig. 1. Images 256*256 512*512 1024*1024 2048*2048 3936*3936

CPU Time 7.88 ms 45 ms 141 ms 630 ms 2499 ms

(b) GPU acceleration vs CUDA streams number

GPU Optimization of multiple image processing

GPU GTX 280 (Global Memory) Load 3.2 (39%) 3.4 (39%) 3.7 (30%) 5.6 (15%) 14.4 (11%)

Kernels 4.1 (50%) 4.5 (51%) 6.3 (52%) 23.1 (61%) 90.2 (69%)

GPU GTX 280 (Tex & Shared Mem)

Acc Total time 8.2 ms 8.8 ms 12.2 ms 37.7 ms 131.6 ms

OpenGL Vis 0.9 (11%) 0.9 (10%) 2.2 (18%) 9.0 (24%) 27 (21%)

0.96 5.11 11.6 16.7 18.9

Load 3.1 (39%) 2.5 (35%) 2.5 (23%) 4.1 (12%) 8.8 (8%)

Kernels 4.0 (50%) 3.7 (52%) 6.1 (56%) 22 (65%) 82.5 (73%)

OpenGL Vis 0.9 (11%) 0.9 (13%) 2.2 (20%) 8.0 (23%) 22 (19%)

Total time 8 ms 7.1 ms 10.8 ms 34.1 ms 113.3 ms

Acc 0.99 6.34 13.1 18.5 22.1

TABLE I GPU P ERFORMANCES OF SINGLE IMAGE PROCESSING : CORNERS + EDGES DETECTION

Images Nbr 10 50 100 200

CPU Time 6s 30,3 s 61.5 s 136.1 s

GPU GTX 280 (Tex & Shared Mem) Load 0.08 (11%) 0.3 (8%) 0.61 (9%) 1.30 (8%)

Kernels 0.50 (68%) 2.63 (72%) 4.84 (70%) 10.6 (68%)

OpenGL Vis 0.15 (21%) 0.73 (20%) 1.47 (21%) 3.70 (24%)

Total time 0.73 s 3.66 s 6.92 s 15.6 s

GPU GTX 280 (Streaming)

Acc 8.2 8.3 8.9 8.7

Load 0.03 (5%) 0.12 (4%) 0.28 (5%) 0.50 (4%)

Kernels 0.49 (84%) 2.58 (85%) 4.79 (85%) 10.58 (85%)

OpenGL Vis 0.06 (10%) 0.33 (11%) 0.55 (10%) 1.32 (11%)

Acc Total time 0.58 s 3.03 s 5.62 s 12.4 s

10.34 9.99 10.95 10.98

TABLE II GPU P ERFORMANCES OF MULTIPLE IMAGE PROCESSING : CORNERS + EDGES DETECTION

the CPU ones when treating low resolution images since we can’t benefit enough from the GPU power in this case. Therefore, we propose to estimate the algorithm complexity using five parameters: parallel fraction, computation per-pixel, computation per-image, shared to global memory access ratio and texture to global memory access ratio. a) Parallel fraction (f ): Amdahl’s law [7] proposed an estimation of the theoretical speedup using N processors. This law supposes that f is the part of program that can be parallelized and (1−f ) is the part that can’t be made in parallel (data transfers, dependent tasks, etc.). Indeed, high values of f can provide better performance and vice versa. b) Computation per-pixel (comp-pixel): graphic processors enable to accelerate image processing algorithms thanks to the exploitation of the GPU’s computing units in parallel. These accelerations become more significant when we apply intensive treatments since the GPU is specialized for highly parallel computation. The number of operations per pixel presents a relevant factor to estimate the computation intensity. c) Computation per-image (comp-image): this fact is computed with the multiplication of the image resolution by the computation per pixel (described above).

d) Shared to global memory access ratio (SM to GM): the use of shared memory offers a faster access (read/write) to pixels by comparison with global memory. Therefore, a high usage of shared memory provides better performance. e) Texture to global memory access ratio (TM to GM): the use of texture memory offers a faster reading (read only from GPU) of pixels by comparison with global memory. A high usage of texture memory provides better performance. Table III presents the measured metrics described above and the corresponding GPU accelerations. The parallel fraction f presents the percentage of parallelizable computing part relative to total time, while the remaining part (1 − f ) is presented by transfer (loading, visualization) operations. The computation per pixel is presented by the average of operations number between the steps of contours and corners detection. The ratio of shared (or texture) to global memory access is computed by dividing the number of access to shared (or texture) memory by the number of access to global memory. Table III shows significant accelerations once we have high values of metrics. Otherwise, the use of low resolution images provides low values of f and low computations per image,

and hence less performance. The use of shared and texture memories doesn’t improve significantly performance in this case since there are no sufficient treatments. Images 2562 5122 10242 20482 39362

f 57% 81% 86% 88% 88%

comp-pixel 6.1 6.1 6.1 6.1 6.1

comp-image 4.0 ∗ 105 1.6 ∗ 106 6.4 ∗ 106 2.6 ∗ 107 9.4 ∗ 107

SM to GM 0.37 0.37 0.37 0.37 0.37

TM to GM 0.25 0.25 0.25 0.25 0.25

Acc 0.99 6.34 13.1 18.5 22.1

TABLE III C OMPLEXITY EVALUATION OF EDGES AND CORNERS DETECTION ON GPU

IV. I MAGE PROCESSING ON HETEROGENEOUS ARCHITECTURES

The GPU implementations described above have well improved performance in case of single image processing (speedup of 22). Otherwise, in case of multiple image treatment, performances are less improved (speedup of 10). Therefore, we propose an implementation exploiting effectively the full computing power of heterogeneous architectures that offers a faster solution for multiple image processing. This implementation is based on the executive support StarPU [2] which offers a runtime for heterogeneous multicore platforms. This section is presented in three parts: the first one describes the proposed development scheme for multiple image processing on hybrid architectures. The second part presents the employed scheduling strategy within multiple CPUs and GPUs, while the third part is devoted to the optimization of our solution by exploiting the CUDA streaming technique. A. Development Scheme for multiple images processing We propose a development scheme for multiple image processing on heterogeneous platforms, following 3 steps: loading of input images, hybrid processing, results presentation. 1) loading of input images: first, we load the input images in queues so that StarPU can apply treatments on these images. 2) Hybrid Processing: once the input images are loaded, StarPU can launch the CPU and GPU functions on heterogeneous processing units. StarPU is based on two main structures: the codelet and the task. The codelet defines the computing units that could be exploited (CPUs or/and GPUs), and the related implementations. The tasks apply this codelet on the set of images, so that each task is created and launched to treat one image in queue using CPU or GPU. 3) results presentation: once all tasks are completed, the results must be repatriated in buffers. This update is provided by a specific function in StarPU that enables to transfer data from GPU to CPU memory in case of GPU treatments. B. Task scheduling within heterogeneous architectures To achieve efficient scheduling, we propose to estimate the duration of each task. Therefore, we provide to codelets a performance model assuming that for similar sizes of input/output images, the performance will be very close. StarPU will then check record of the average time of previous executions on

different processing units, which will be used as estimation. In our case, tasks are submitted asynchronously, and the dmda (deque model data aware) scheduler is employed, that uses both task execution performance models and data transfer times into account. It schedules tasks where their termination time will be minimal. C. Optimization within heterogeneous architectures For a better exploitation of heterogeneous computing units, we employ the streaming technique within multiple GPUs. Thus, we propose to create four CUDA streams for each GPU, so that each GPU can overlap effectively data transfers by kernels executions. Our hybrid implementation will be composed of two steps only: loading of input images, hybrid processing and results transfer. 1) Loading of input images: same as in section IV.A.1. 2) Hybrid processing and results transfer: this step includes the phases of hybrid processing and results transfer (sections IV.A.2 and IV.A.3) in order to benefit from overlapping data transfers by kernels executions. In this case, we have also selected four CUDA streams since they offer better performances as shown in Fig. 1(b). Table IV presents a comparison between CPU, GPU and heterogeneous implementations involving edges and corners detection algorithms, applied on sets of high resolution images. Notice that the use of multiple CPUs and GPUs enabled to accelerate performance compared to multiple GPUs solutions thanks to the efficient scheduling and memory management, and to the overlapping of data transfers by kernels executions. Experimentations have been conducted using Ubuntu 11.04 on several platforms i.e. CPU Dual Core, GPU GTX 280 and Tesla C1060: • CPU: Dual Core 6600, 2.40 GHz, Mem: 2 GB • GPU: GeForce GTX 280, 240 CUDA cores, Mem: 1GB • GPU: Tesla C1060, 240 CUDA cores, Mem: 4GB The proposed implementations are summarized in a general scheme for a better exploitation of parallel and heterogeneous architectures (Fig. 2). This scheme enables to treat effectively both single and multiple images. For single image processing, we apply a complexity evaluation (section III.D) in order to affect sequential treatment for low intensive methods and parallel (GPU) treatments for high intensive tasks. On the other hand, hybrid implementations are applied when treating multiple images by exploiting both CPUs and GPUs. V. E XPERIMENTAL R ESULTS The presented scheme is exploited in two applications: vertebra segmentation and research in image and video databases. A. Heterogeneous computing for vertebra segmentation The context of this application is the cervical vertebra mobility analysis on X-Ray images. The main objective is to detect vertebra automatically. To deal with this issue, a sequential solution was developed in [13], based on many steps

TABLE IV P ERFORMANCE OF HETEROGENEOUS COMPUTING OF EDGES AND CORNERS IN MULTIPLE IMAGES (2048 X 2048): GPU T ESLA C1060 Images Nbr 10 50 100 200

1CPU Time (T) 6s 30,3 s 61.50 s 136.1 s

2CPU T Acc 04,54 01,32x 19,61 01,54x 39,45 01,56x 89,19 01,54x

8CPU T Acc 01,70 03,53x 08,91 03,40x 18,00 03,42x 40,06 03,40x

1GPU T Acc 00,58 10,34x 03,03 09,99x 05,62 10,95x 12,40 10,98x

1GPU-2CPU T Acc 00,57 10,53x 03,00 10,09x 05,47 11.25x 12,01 11.34x

2GPU T Acc 00,52 11,54x 02,51 12,06x 04,95 12,43x 10,45 13,03x

2GPU-4CPU T Acc 00,52 11,54x 02,32 13,05x 04,51 13,65x 09,29 14,66x

4GPU T Acc 00,49 12,24x 01,92 15,77x 03,11 19,79x 06,24 21,82x

4GPU-8CPU T Acc 00,46 13,04x 01,77 17,11x 02,93 21,00x 05,40 25,21x

B. Heterogeneous computing for videos indexation

Fig. 2.

Efficient image processing on parallel and heterogeneous platforms

The aim of this application is to provide a novel browsing environment for multimedia (images, videos) databases. It consists on computing similarities between videos sequences, based on extracting features of images (frames) composing videos [18]. The main disadvantage of this method is the high increase of computing time when enlarging videos sets and resolutions. Based on our scheme, we propose a heterogeneous implementation of the most intensive step of features extraction in this application. This step is presented by the edge detection algorithm which provides relevant information for detecting motion’s areas using Hu moments [9]. Fig. 3(c) shows detected contours (using hybrid implementations) in the video frames, this figure presents also the Hu moment detected based on these edges. We note that the hybrid implementation of contours extraction step allowed accelerating the process of similarity computation between video sequences. Fig. 3(d) presents the accelerations obtained with parallel and hybrid implementations of contours detection phase. These accelerations enabled a gain of 60% (3 min) compared to the total time of the application (about 5 min) treating 800 frames of a video sequence (1080x720). VI. C ONCLUSION

(Fig. 3(a)). This application is characterized by the large sets of high resolution images to treat, with a low grey level variation. The computation time and noise immunity truncature represent the most important requirements for this application. Based on our scheme, we propose a hybrid implementation of the most intensive steps (contours and corners detection) of the method. On the one hand, the quality of vertebra segmentation remains identical since the process has not changed. Only the architecture and the implementation did. On the other hand, the exploitation of heterogeneous platforms (Multi-CPU/MultiGPU) enabled to accelerate the computation time thanks to the hybrid implementation of edges and corners detection steps. This fact allowed applying the method on large sets of XRay medical images, and thus achieving more precision for vertebra detection results. Fig. 3(b) presents the accelerations obtained with parallel (GPU) and hybrid (Multi-CPU/MultiGPU) implementations of the method involving contours and corners detection steps. These accelerations enabled to obtain a gain of 50% (1,5 min) compared to the total time the application (3 min) using a set of 200 images with a resolution of 1472x1760.

We proposed in this work a development scheme for single and multiple image processing, exploiting effectively parallel (GPU) and heterogeneous (Multi-CPU/Multi-GPU) architectures. Our scheme ensures an efficient scheduling of jobs based on estimating performance and required transfer time for each task. It enables also overlapping data transfers by kernels executions within multiple GPUs. This solution was exploited for improving performance of a medical application of vertebra segmentation, and an application of research and navigation in video databases. Experimentations showed a global speedup ranging from 5 to 30 which is due to two main factors: • Low-level parallelization: GPU parallel processing between pixels in image : intra-image parallel processing. • High-level parallelization: inter-images parallel processing, enabling to exploit simultaneously both CPUs and GPUs cores, so that each core treats efficiently a subset of images. As future work we plan to extend our scheme to a general framework for multimedia (single/multiple images, video real time/offline) processing on parallel and hybrid platforms. We plan also to improve the scheduling strategy by taking into account more parameters (number of operations, dependency factor, etc.), in order to have a better exploitation of resources.

(a) Vertebra segmentation steps

(c)

1. Hybrid detection of contours Fig. 3.

(b) Performance of edge and corner detection

2. Hu Moments detected

(d) Performance of edge detection

Heterogeneous computing for vertebra segmentation and videos indexation

ACKNOWLEDGMENT Authors would like to thank the Communauté Française de Belgique for supporting this work under the ARC-OLIMP (AUWB-2008-12-FPMs11) research project. Authors aknowledge also the support of European COST action IC805. R EFERENCES [1] AMD, “The future brought to you by amd introducing the amd apu family.” AMD FusionTM Family of APUs, 2011. [Online]. Available: http://sites.amd.com/us/fusion/apu/Pages/fusion.aspx/ [2] C. Augonnet and al, “StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures,” EuroPar conference, best papers issue, pp. 863–874, 2009. [3] E. Ayguadé and al, “An Extension of the StarSs Programming Model for Platforms with Multiple GPUs,” Proceedings of the 15th International Euro-Par Conference on Parallel Processing, pp. 851–862, 2009. [4] J. Y. Bouguet, “Pyramidal Implementation of the Lucas Kanade Feature Tracker,” Intel Corporation Microprocessor Research Labs, 2000. [5] R. Deriche, “Using Canny’s criteria to derive a recursively implemented optimal edge detector,” Internat. J. Vision,Boston, pp. 167–187, 1987. [6] J. Fung, S. Mann, and C. Aimone, “OpenVIDIA : Parallel gpu computer vision.” In Proc of ACM Multimedia, pp. 849–852, 2005. [7] A. Grama, A. Gupta, G. Karypis, and V. Kumar, “Introduction to Parallel Computing,” second ed. Pearson Education Limited, 2003. [8] C. Harris and M. Stephens, “A combined corner and edge detector,” in The 4th Alvey vision conference, vol. 15, 1988, pp. 147–151. [9] M. K. Hu, “Visual Pattern Recognition by Moment Invariants,” In: IRE Transactions on Information Theory IT-8, pp. 179–187, 1962. [10] Khronos-Group, “The Open Standard for Parallel Programming of Heterogeneous Systems,” 2009. [Online]. Available: http://www. khronos.org/opencl

[11] F. Lecron, S. A. Mahmoudi, M. Benjelloun, S. Mahmoudi, and P. Manneback, “Heterogeneous Computing for Vertebra Detection and Segmentation in X-Ray Images,” International Journal of Biomedical Imaging. Volume 2011, pp. 1–12, 2011. [12] Y. Luo and R. Duraiswani, “Canny Edge Detection on NVIDIA CUDA,” Workshop on Computer Vision on GPUS, CVPR, 2008. [13] S. A. Mahmoudi and al, “GPU-Based Segmentation of Cervical Vertebra in X-Ray Images,” IEEE International Conference on Cluster Computing, pp. 1–8, 2010. [14] ——, “Détection optimale des coins et contours dans des bases d’images volumineuses sur architectures multicoeurs hétérogènes,” in 20èmes Rencontres Francophones du Parallélisme, France, 2011. [15] NVIDIA, “CUDA,” 2007. [Online]. Available: www.nvidia.com/cuda. [16] OpenGL, “OpenGL Architecture Review Board: ARB vertex program. Revision 45.” 2004. [Online]. Available: http://oss.sgi.com/projects/ ogl-sample/registry/ [17] T. Schiwietz and al, “MR image reconstruction using the GPU,” ImageGuided Procedures, and Display. SPIE conference, pp. 646–655, 2006. [18] X. Siebert and al, “Mediacycle: Browsing and performing with sound and image libraries,” in QPSR of the numediart research program, vol. 2, 2009, pp. 19–22. [19] M. Smelyanskiy and al, “Mapping high-fidelity volume rendering for medical imaging to CPU, GPU and many-core architectures,” IEEE Transactions on Visualization and Computer Graphics, 15(6), pp. 1563– 1570, 2009. [20] Z. Yang and al, “Parallel Image Processing Based on CUDA,” International Conference on Computer Science and Software Engineering. China, pp. 198–201, 2008.

Efficient Exploitation of Heterogeneous Platforms for Images Features ...

Efficient Exploitation of Heterogeneous Platforms for Images Features ...

Suggest Documents

Efficient Exploitation of Heterogeneous Platforms for Vertebra ...

photogrammetric exploitation of hdr images for ...

Tiling for Heterogeneous Computing Platforms - CiteSeerX

Efficient heterogeneous activation of

Segmentation of heterogeneous document images

Static LU Decomposition on Heterogeneous Platforms - Lara

Supporting Adaptivity with Heterogeneous Platforms Through User ...

Enabling task-level scheduling on heterogeneous platforms

Exploiting Heterogeneous Computing Platforms By ... - ThinkMind

Scheduling jobs on heterogeneous platforms - Semantic Scholar

Commercial Platforms With Heterogeneous Participants - Banco Central

Managing Dynamic Partial Reconfiguration Heterogeneous Platforms

Matrix Product on Heterogeneous Master-Worker Platforms

Joint Deep Exploitation of Semantic Keywords and Visual Features for ...

Hierarchical features infused heterogeneous grain

Scheduling Heterogeneous Wireless Systems for Efficient ... - CiteSeerX

Efficient Tasks scheduling for heterogeneous ... - CiteSeerX

Hybrid Energy Efficient Distributed Protocol for Heterogeneous ...

Scheduling Heterogeneous Wireless Systems for Efficient ... - CiteSeerX

Efficient Scheduling for Heterogeneous Services in ... - CiteSeerX

Efficient Group Mobility for Heterogeneous Sensor Networks

Robust Visual Path Following for Heterogeneous Mobile Platforms

Hybrid Energy Efficient Distributed Protocol for Heterogeneous ...

Efficient Group Mobility for Heterogeneous Sensor Networks