2015 12TH INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING, COMPUTING SCIENCE AND AUTOMATIC CONTROL (CCE), MEXICO, CITY. MEXICO
A case study of OpenCL-based parallel programming for low-power remote sensing applications A. Castro Angulo1 , R. Carrasco Alvarez2 , J. Orteg´on Aguilar3 , J. V´azquez Castillo3 , O. Palma Marrufo1 , and A. Castillo Atoche1,* 1 Department
of Mechatronics, Universidad Aut´onoma de Yucat´an, M´erida, Yuc., M´exico of Electronics, Universidad de Guadalajara, Guadalajara, Jal., M´exico 3 Science Division and Engineering, Universidad de Quintana Roo, Chetumal, Quintana Roo, M´exico 2 Department
With the advent of high-performance embedded computing (HPEC) systems, many digital processing algorithms are now implemented by special-purpose massively parallel processors. In this paper, a low-power ARM/GPU co-design architecture is addressed using OpenCL-based parallel programming for implementing complex reconstructive signal processing operations. Such operations are accelerated using data-parallel functions on the GPU and ARM processor, in a HW/SW co-design scheme via OpenCL API calls. Experimental results shows the achieved computational performance and the effectiveness of the OpenCL standard comparing the framework against traditional parallel embedded versions.
Index Terms—HPEC systems, OpenCL, remote sensing.
I. I NTRODUCTION Geospatial applications for remote sensing (RS) image enhancement/reconstruction is now a mature and well-developed research field, presented and detailed in several works, ([1]– [5] and the references therein). In particular, many current and future applications of remote sensing in Earth, space, and soon in exploration science will require real or near real-time processing capabilities. In recent years, several efforts have been directed towards the incorporation of highperformance embedded computing (HPEC) models to remote sensing applications [2]. The utilization of HPEC systems in RS imaging has become more widespread in recent years [2]-[5]. At the beginning, the main idea of computer science community consist in the use of personal computers (PCs) clustered together to work as a computational grid, which result in an attractive solution for remote sensing data processing. This strategy is referred to as cluster computing [9]. However, although parallel computing techniques have been employed in RS applications, a recent trend in the design of HPEC systems for RS data-intensive problems is to employ embedded heterogeneous computing resources [3][8]. A common approach of HPEC systems is the use of Field Programmable Gate Arrays (FPGAs), however programming such architectures is a challenging task with respect of a FPGA design criteria, i.e., speed or area resource usage in terms of BRAM, FF, LUT, DSP, etc. On the other hand, the emergence of GPUs has allowed these systems to evolve from expensive application-specific units into highly parallel and programmable commodity components. The latest-generation GPU architectures from NVIDIA, Tesla and Fermi series, now offer cards able to deliver up to 515 Gigaflops of Manuscript received September 18, 2015. * Corresponding author: A. Castillo Atoche (email:
[email protected]). c 978-1-4673-7839-0/15/$31.00 2015 IEEE
double-precision peak performance, which is several times the performance of the fastest quad-core processor available. However, in terms of power consumption for RS processing, the GPU is not the best option for HPEC solutions. In this regards, due to very hard restrictions on energy consumption typical of HPEC systems, the scientific challenge consist in to implement parallel computing techniques in lowpower embedded systems with the capability to implement computationally expensive operations for remote sensing applications. OpenCL provides an industry standard for parallel programming of heterogeneous computing platforms, and it’s designed to meet the requirements of exposing the compute capability of devices such an ARM SoC with a general-purpose GPU. Examples, such as, Qualcomm Adreno, the ARM Mali, the NVIDIA GeForce ULP, Vivantes ScalarMorphic, and the Imagination Technologies PowerVR are available architectures of low-power ARM/GPU systems. In this study, a RS case study for multispectral image reconstruction using a low-power device based on a GPU paired with an ARM processor is presented. The main contribution of this work relates to the OpenCL parallel programming approach for the implementation of the complex reconstructive operations of a multispectral image reconstruction algorithm. The co-design methodology for implementing the computationally intensive framework on the ARM/GPU embedded system is also presented, with the comparative time processing and qualitative reconstruction analysis. Alternative RS implementations related to regularizationbased techniques and hyperspectral/multispectral imaging have been developed in [4], [5], to achieve the near real-time implementation of Geospatial applications. For example, an exhaustive comparison of FPGA and GPUs platforms has been recently presented in [9]. Also, the GPU implementation of a linear prediction for lossless compression of ultraspectral sounder data was described in [10]. On the other hand, the
2015 12TH INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING, COMPUTING SCIENCE AND AUTOMATIC CONTROL (CCE), MEXICO, CITY. MEXICO
emergence of small size and relatively low-cost specialized hardware devices such as FPGAs now represent a real-time reconfigurable solution for HPEC of RS systems. In [6][8], hardware-level architectures based on a systolic array structure were previously implemented as co-processors units. In [6], 15.2 GFLOPS are reached with a Nexus 4 processor, which has an Adreno320 GPU using OpenCL platform. In [7], [8] implementations of frameworks for processing enduring images, biomedical and medical respectively, using the OpenCL platform for GPU programming are presented. However, in the context of the design of a low-power HPEC system oriented for real-time on-board multispectral image reconstruction, the addressed approach develops a new RS application-specific design of the WCLS technique using a OpenCL-based ARM/GPU embedded system. Finally, significant reduction in the computational load was proved and admits the real-time implementation for large-scale real-world Geospatial applications.
II. S UMMARY OF D ESCRIPTIVE R EGULARIZATION T ECHNIQUES In this Section, we present a brief summary of the descriptive regularization method that was previously developed in [4]. Let us consider the measurement data wavefield u(y) = s(y) + n(y) modeled as a superposition of the echo signals s and additive noise n that assumed to be available for observations and recordings within the prescribed time-space observation domain Y 3 y, where y = (t, p)T defines the time-space points in the observation domain Y = T × P. The model of observation wavefield u is specified by the linear stochastic equation of observation (EO) of operator form [4]: u = Sv + n; v ∈ V ; u, n ∈ U ; S : V −→ U . Next, we take into account the conventional finite-dimensional vector form approximation [4] of the continuous-form EO u = Sv + n
(1)
where u, n and v define the vectors composed by the coefficients of the finite-dimensional approximation of the fields u, n and v, respectively. The average b = vect {< vk , vk∗ >; k=1,...,K} has a statistical meaning of the average power scattering function traditionally referred as the spatial spectrum pattern (SSP), where the asterisk indicates the complex conjugate. This SSP is a second order statistics of the scattered field that represent the brightness reflectivity of the image scene B, represented in a conventional pixel format over the rectangular scene frame of each band of the multispectral image. The RS imaging problem is stated as follows: to find an estimate of the scene pixel-frame image via lexicographical reordering of the SSP ˆ reconstructed from the data correlation vector estimate b, matrix Ru pre–estimated by some means, e.g. via averaging the correlations over J independent snapshots [4] J X ˆ u = aver{u(j) u+ } = 1 u(j) u+ R (j) , (j) J j=1
(2)
and by determining the solution operator that we also refer to as the reconstructive signal operator (SO) W such that ˆ = {WR ˆ u W}diag . b (3) A family of algorithms for estimating the SSP was derived as follows A. Constrained least square (CLS) algorithm Consider the white zero-mean noise in observations and no preference to any prior model information. The regularization parameter is adjusted as the inverse of the signal-to-noise ratio (SNR), and W is recognized to be the reconstructive signal operator −1 T WCLS = W(1) = ST S + αI S , (4) where α is equal to N0 /b0 , b0 is the prior average gray level of the SSP, and N0 is the noise intensity. B. Weighted constrained least square (WCLS) algorithm Let us consider the case of an arbitrary zero-mean noise and the weighted matrices Mv , Mu , which generate fading functions that help to more quickly converge the cost function. In this case, the SO becomes the robust WCLS operator WWCLS = W(2) = ST Mu S + αMv
−1
ST .
(5)
Now, we are ready to proceed with the hardware implementation in the OpenCL-based ARM/GPU embedded system. III. HW/SW CODESIGN IMPLEMENTATION : A N O PEN CL A PPROACH The HW/SW co-design is a hybrid method aimed at increasing the flexibility of the implementation and improving the overall design process. In this section, a specific HW/SW architecture for the particular designed CLS and WCLS algorithms is presented. Figure 1 illustrates the algorithmic design flow. The block units are to be designed to speed-up the reconstructive signal processing operations of the algorithms developed to meet the real-time imaging system requirements. From the analysis of Figure 1, the system-level partitioning functions are specified for the HW/SW codesign aimed at the definition of the computational tasks that can be implemented in a parallel form. The design methodology is the following: in order to solve the linear inverse problem for the given model (1), and involving statistical information about the signal, v, data observations, u, and the additive white Gaussian noise (AWGN), n, to derive a solution operator W : U −→ V , an optimal estimate v ˆ = Wu is produced. In this regards, the SFO S is first constructed to proceed with this descriptive regularization technique of the WCLS method, the weighted matrices Mv , Mu , which provides additional knowledge about the problem, in the corresponding signal spaces U and V , respectively. Now, the ST Mu S and αMv products, related to the matched filtering and control of a priori model statistics functions, are implemented in parallel; and finally, the matrix inversion operation based on the LU decomposition is employed to solve the robust approximation of the statistically optimal WCLS method.
2015 12TH INTERNATIONAL CONFERENCE ON ELECTRICAL ENGINEERING, COMPUTING SCIENCE AND AUTOMATIC CONTROL (CCE), MEXICO, CITY. MEXICO
It is important to remark that (8) only depends of a priori values. In order to estimate a value of ρ, it’s assumed that the time processing of both ARM - GPU is directly proportional to the amount of pixels. In other words, ρ is given by the ratio between the time required to process the same image for the ARM and GPU cores. Figure 2 shows the image partition strategy used in this work.
Fig. 2. ARM/GPU partition strategy. GPU 1 thru 6 are the cores of the MaliT628 MP6, and CPU Exynos Cortex-A15 1.8Ghz quad core and Cortex-A7 quad core.
Algorithm 1. WCLS estimation
Fig. 1. WCLS design flow approach.
A. Spatial-spectral partitioning Let us consider a multispectral image of x × y pixels with z bands, and an embedded development platform of n GPUs connected to an ARM of m cores, where m = nτ ∀ τ ∈ N. The image ratio to be processed by one GPU (γ) is represented by γ=
1 − mβ , n
(6)
where β ∈ [0, 1] represent the image ratio to be processed by one ARM core. The value of β is the time required for the ARM core to processed the complete image that is equal to the time required for the GPU to process the whole image too. Also ρ is defined as follows ρ=
γ , β
(7)
which represents the pixels proportion that process the GPU in regard to ARM. Now, substituting (6) in (7) and solving for β, the image ratio to be processed by the ARM core is now defined as β=
1 . nρ + m
1 #define THREADS_PER_BLOCK 16 2 3 __kernel void W_estimator(int n, __global float *A, __global float *I) 4 { 5 6 int h = get_group_id(0); 7 int k = get_local_id(0); 8 for (int j=0; j= 0; j--) 15 { 16 for (int i=k; i