Massively Parallel GPU Design of Automatic Target Generation

0 downloads 0 Views 1MB Size Report
subsets called tiles such that each tile fits into the shared mem- ory. The term “tile” ... CUBLAS library is an implementation of the standard Basic Linear Alge-.
2862

IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 8, NO. 6, JUNE 2015

Massively Parallel GPU Design of Automatic Target Generation Process in Hyperspectral Imagery Xiaojie Li, Bormin Huang, and Kai Zhao

Abstract—A popular algorithm for hyperspectral image interpretation is the automatic target generation process (ATGP). ATGP creates a set of targets from image data in an unsupervised fashion without prior knowledge. It can be used to search a specific target in unknown scenes and when a target’s size is smaller than a single pixel. Its application has been demonstrated in many fields including geology, agriculture, and intelligence. However, the algorithm requires long time to process due to the massive amount of data. To expedite the process, the graphics processing units (GPUs) are an attractive alternative in comparison with traditional CPU architectures. In this paper, we propose a GPU-based massively parallel version of ATGP, which provides real-time performance for the first time in the literature. The HYDICE image data (307 ∗ 307 pixels and 210 spectral bands) are used for benchmark. Our optimization efforts on the GPU-based ATGP algorithm using one NVIDIA Tesla K20 GPU with I/O transfer can achieve a speedup of 362× with respect to its single-threaded CPU counterpart. We also tested the algorithm on Airborne Visible/InfraRed Imaging Spectrometer (AVIRIS) WTC dataset (512 ∗ 614 ∗ 224 of 224 bands) and Cuprite dataset (35 ∗ 350 ∗ 188 of 188 bands), the speedup was 416× and 320×, respectively, when the target number was 15. Index Terms—Automatic target generation process (ATGP), CUDA, graphics processing unit (GPU), hyperspectral imagery.

I. I NTRODUCTION

H

YPERSPECTRAL imaging is concerned with the measurement, analysis, and interpretation of spectra acquired from a given scene at a given distance by a satellite sensor. Two systems currently active and operated from airborne platforms are NASA Jet Propulsion Laboratory’s Airborne Visible/InfraRed Imaging Spectrometer (AVIRIS) and Naval Research Laboratory’s HYDICE sensor. Many more are under development. HYDICE sensor was developed by Hughes Danbury Optical Systems. Hyperspectral sensors are widely used in many fields such as geology, agriculture, and intelligence. A significant number of researchers work on hyperspectral image processing, such as automatic spectral target recognition (ASTR), image classification, and image fusion [1]–[6]. The Manuscript received May 26, 2014; revised July 18, 2014; accepted August 07, 2014. Date of publication September 03, 2014; date of current version July 30, 2015. This work was supported by the National Natural Science Foundation of China under Grant 41201335 and Grant 40971189. (Corresponding author: Bormin Huang.) X. Li and K. Zhao are with the Northeast Institute of Geography and Agroecology, Chinese Academy of Sciences, Changchun 130102, China. B. Huang is with the Space Science and Engineering Center, University of Wisconsin, Madison, WI 53706 USA (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/JSTARS.2014.2347299

major advantage of a hyperspectral sensor is its significantly improved spectral and spatial resolution. These improvements also mean that many unknown signals can be uncovered as anomalies without prior knowledge. This has significantly expanded the domain of many analysis techniques. The sensors have also been shown to detect targets with size smaller than a single pixel. In order to detect these targets, one must rely on their spectral properties and identify them on subpixel scale: a task that cannot be accomplished using traditional spatial-based image processing techniques. Real-time or nearly real-time processing of hyperspectral images is required for swift decisions, and it depends on fast data processing. In recent years, the graphics processing unit (GPU) has evolved into a highly parallel, multithreaded, multicore processor with tremendous computational power and very high memory bandwidth. One single GPU could have hundreds of parallel processor cores for execution of tens of thousands of parallel threads. In comparison, traditional CPUs are designed to obtain high performance for single-threaded programs, and only a few cores can be packed into a single processor. In contrast, GPUs are not constructed for single-threaded performance; they are optimized for execution of a large number of threads simultaneously. Furthermore, GPUs possess several merits. For example, unlike CPUs with their limited bandwidth, the large bandwidth available to GPUs allows high floating point computation speed; the highly parallel structure of GPUs makes them more efficient than general-purpose CPUs in calculations using large volumes of data. Therefore, GPUs provide a low cost, low power, large bandwidth, and highperformance alternative to conventional CPU microprocessors. GPUs have been used very successfully for numerous computational problems such as automatic electronic design, computer vision, financial analysis, drug design, image processing, engineering control, game playing, environmental data processing, and green computing [7]–[16]. GPUs appear to be an attractive alternative to expedite automatic target generation process (ATGP) algorithm. Bernab´e et al. [17] and [18] have also been explored for accelerating the automatic target detection and classification algorithm (ATDCA). In our study, an efficient GPU-based ATGP implementation is developed and the paper is organized as follows. The principle of ATGP algorithm is described in Section II. In Section III, we present the implementation of the GPU-based ATGP scheme, which consists of GPU hardware specification and CUDA computing basics, the algorithm implementation details, and optimization on multiple steps. The study is concluded in Section IV.

1939-1404 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information.

LI et al.: MASSIVELY PARALLEL GPU DESIGN OF ATGP IN HYPERSPECTRAL IMAGERY

II. ATGP A LGORITHM In this section, we briefly describe two versions of ATGP target detection algorithm. ATGP uses an Orthogonal Subspace Projector (ATGP-OSP) [19], which does not require prior knowledge about target signatures. Assume that t0 is an initial target signature. ATGP begins with the initial target signature t0 by applying an OSP P⊥ t0 to all image pixel vectors and finds a target signature t1 with the maximum orthogonal projection in the orthogonal complement space, denoted by ⊥ , which is orthogonal to the space linearly spanned by t0 . The reason for this selection is that the selected t1 generally has the most distinct features from t0 in the sense of orthogonal projection because t1 has the largest magnitude of the projection in ⊥ produced by Pt0 ⊥ . A second target signature t2 can be found by applying an OSP P[t0 ,t1 ] ⊥ to the original image and a target signature that has the maximum orthogonal projection in ⊥ is selected as t2 . This procedure is repeated again to find a third target signature t3 , and a fourth target signature t4 , etc. In order to terminate ATGP, a stopping rule is required. If we let Ui = [t1 t2 . . . ti ] be the ith target signature matrix generated at the ith stage, we define an Orthogonal Projection Correlation Index (OPCI) to be ηi = tT0 P⊥ U i t0 which can be used to measure the similarity between two consecutive generated target signatures. Since Ui−1 ⊂ Ui , ηi = T ⊥ T ⊥ tT0 P⊥ Ui t0 ≤ ηi−1 = t0 PUi−1 t0 , the sequence {t0 PUi t0 } is monotonically decreasing at i. In other words, the OPCI sequence {ηi } is monotonically decreasing at i. Using this property as the stopping criterion, ATGP can be summarized as follows. A. Automatic Target Generation Process 1) Initial condition: Select an initial target signature of interest denoted by t0 . Let ε be the prescribed error threshold. Set i = 0 and U0 = φ.  T −1 T 2) Apply Pt⊥0 via P⊥ U to all image U =I−U U U pixel vectors r in the image. 3) Find the first target signature, denoted by t1 , which has the maximum orthogonal projection   T  ⊥  Pt0 r . t1 = arg max Pt⊥0 r r

Set i = 1 and U1 = t1 . 4) If η1 = tT0 P⊥ Ui t0 < ε, go to step 8). Otherwise, set i = i + 1 and continue. 5) Find the ith target ti generated at the ith stage by 

T

⊥ ⊥ P[t0 Ui−1 ] r t1 = arg max P[t0 Ui−1 ] r r

where U1 = [t1 t2 . . . ti−1 ] is the target signature matrix generated at the (i − 1)th stage. 6) Let Ui = [t1 t2 . . . ti ] be the ith target signature matrix, calculate OPCI, ηi = tT0 P⊥ Ui t0 < ε and compare ηi to a prescribed threshold ε.

2863

TABLE I D EVICE PARAMETERS OF THE GPU AND CPU

7) Stopping rule: If ηi > ε, go to step 5). Otherwise, continue. 8) At this stage, ATGP is terminated. The target matrix Ui generated at this point contains ith target signatures, which does not include the initial target signature t0 . After ATGP is terminated, ATGP-generated targets are fed to the target classification process (TCP), which is used for target classification. Depending upon whether or not partial knowledge is used to select the initial target t0 in ATGP, two versions of the ASTR can be implemented, and they are referred to as desired target detection and classification algorithm (DTDCA) and ATDCA. III. I MPLEMENTATION OF GPU-BASED ATGP S CHEME A. GPU Hardware Specification and CUDA Computing Basics The parallel implementation of ATGP scheme is performed using an NVIDIA Tesla K20 [20]. The hardware specifications of one NVIDIA Tesla K20 GPU employed in our study are depicted in Table I. CUDA C is an extension to C programming language that offers direct programming in NVIDIA GPUs. It is designed such that its construction allows for the execution of data-level parallelism. A CUDA program is arranged into two parts: 1) a serial program running on the CPU; and 2) a parallel part running on the GPU, where the parallel part is called a kernel. The driver, in C language, distributes a large number of copies of the kernel into available processors in GPU and executes them simultaneously. A CUDA parallel program automatically utilizes the parallelism available on the given GPU. CUDA code consists of three computational phases: 1) transmission of data into the global memory of the GPU; 2) execution of the CUDA kernel; and 3) delivery of results from the GPU back to the memory of the CPU. So, the I/O in this paper means the transmission of data between CPU and GPU. From the viewpoint of CUDA programming, a thread is a basic atomic unit of parallelism. Threads are organized into a three-level hierarchy. The highest level is a grid that consists of thread blocks in a three-dimensional array. Fig. 1 shows the three-level thread hierarchy of a device for one GPU. Thread blocks implement coarse-grained scalable data parallelism and are executed independently, which permits them to be scheduled in any order across any number of cores. This allows CUDA code to scale to the number of available processors. In memory of CUDA, every thread has his own register and local memory; every thread shares a block of shared memory, and all threads can access the global memory.

2864

IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 8, NO. 6, JUNE 2015

Fig. 1. Three-level thread hierarchy of a device for one GPU utilized in our study: threads, thread block, and grids of blocks.

In the aspect of hardware running, threads of 32 are arranged together in execution, called a warp. Global memory loads and stores in half of a warp (i.e., 16 threads). A CUDA core group inside a multiprocessor issues the same instruction to all threads in a warp. Different global memory accesses are coalesced by the device into as few as one memory transaction when the starting address of the memory access is aligned and the threads access the data sequentially. Efficient use of the global memory is essential for a high-performance CUDA kernel. The following sections will demonstrate the application of high-performance computing using the GPU-based implementation on ATGP algorithm. B. Implementation of ATGP Algorithm Data measured using a HYDICE sensor contains three parts: x-coordinates, y-coordinates, and bands. The images— URBAN (307 × 307 pixels and 210 spectral bands) each collected using the HYDICE airborne sensor—are used to test ATGP algorithm. The image set was downloaded from [21]. Their location or time of acquisition is unknown. When a user inputs the target number, the target’s pixel coordinates will be obtained by ATGP algorithm. The current ATGP code is written in MATLAB script language, using a nonparallelized CPU-based implementation. To obtain our GPU-based parallelized implementation, the MATLAB code of ATGP algorithm is first translated to a standard C implementation manually, followed by conversion of C code to CUDA language that can be run on the GPUs efficiently. The conversion from C code to CUDA requires some caution. First, MATLAB arrays use 1 as the first index; C/CUDA arrays use 0 as first index; the indices need to be converted correctly. Second, in the process of conversion from C to CUDA C, it is worthwhile to pay attention to the handling of the grid point (i, j). In CUDA, the loops for spatial grid points (i, j) are replaced by index computations using thread and block indices i = threadIdx.x + blockIdx.x + blockDim.x j = threadIdx.y + blockIdx.y + blockDim.y where threadIdx and blockIdx are thread index and block index, respectively, and blockDim is the dimensional size

Fig. 2. Memory footprint of one thread on global memory where matrices A, B, and C are stored.

of a block. Each grid point (i, j) represents a thread in CUDA code. Note that, in C to CUDA conversion, when we calculate matrix multiplication, i.e., C = A ∗ B, there is a native implementation on the GPU that assigns one thread to compute one element of the matrix C. Each thread loads one row of matrix A and one column of matrix B from global memory, calculates the inner product, and stores the result into matrix C in the global memory. Fig. 2 shows the memory footprint of one thread on global memory where matrices A, B, and C are stored. When compiling the code, the default compiler options were used to compile the C versions of ATGP algorithm. We used gcc-lstdc++ ATGP.C -O3 for the C code. The gfortran version is GNU Fortran (GCC) 4.4.7 20120313 (Red Hat 4.4.7-3); GNU lib C 2.12-1.107. We first verified that the C code produces the same output as the MATLAB code. Further improvements to the CUDA code of ATGP algorithm will be presented in the following sections based on several conditions and configurations. The CUDA code was compiled using nvcc (NVIDIA CUDA compiler) version 5.5 and executed on an NVIDIA Tesla K20 GPU with computing capability of 3.5. The compiler options were –O3 –gpu-architecture sm 35 -fmad=false -m64 –maxrregcount 63 –restrict. The value 63 is the number of registers per thread, which is an intelligently chosen number (2∧ n − 1), we set it to be 63 temporary, and we will choose the optimal value on session III-I. In addition, the thread block size (i.e., threads per block) was chosen as 64. For comparison, this paper set the target number is 15, same as in [19]. ATGP algorithm use one single-threaded CPU and one GPU, respectively, without I/O transfer taken into account, our study showed that the original C code took 60.143 s, while the first version of the CUDA code without optimization (i.e., directly translated from the original MATLAB/C code) took 4.262 s. The runtime and speedups are summarized in Table II. In order to make full use of the parallel characteristic of GPU, the following Sections III-C to III-I will present a method to obtain the highest performance of ATGP algorithm by using a GPU.

LI et al.: MASSIVELY PARALLEL GPU DESIGN OF ATGP IN HYPERSPECTRAL IMAGERY

2865

TABLE II RUNTIME OF ATGP ON CPU/GPU B EFORE O PTIMIZATION

TABLE III GPU RUNTIME AND S PEEDUP OF ATGP A LGORITHM A FTER THE L1 C ACHE C OMMAND “ CUDA F UNC C ACHE P REFER L1” I S A PPLIED

Fig. 3. Draw matrix tile method, when the sizes of the matrix are not a multiple of 16.

C. Further Improvement With More L1 Cache Than Shared Memory As shown in Table I, the total memory of L1 cache and shared memory is 64 kB, and the shared memory can be set to 16 or 48 kB. From the GPU kernel code, we calculated the sum of the shared variation size, and then we know it only needs 2 kB shared memory. So the shared memory is configured 16 kB, and the L1 cache is configured to be 48 kB instead of 16 kB [22]. In order to achieve the set, a command “cudaFuncCachePreferL1” was launched in our CUDA code, which means that the usage of the L1 cache is larger than the shared memory. In the first CUDA version of ATGP algorithm, the computing performance using the L1 cache command was found to be better than that without using the command. This suggests that the usage of more L1 cache helps to speed up the CUDA implementation of this module. The GPU runtime and speedup are summarized in Table III after the L1 cache command “cudaFuncCachePreferL1” is applied. As we can see, the running time was reduced from 4.262 to 4.088 s.

Fig. 4. Main code of the draw matrix tile method. TABLE IV GPU RUNTIME AND S PEEDUP W HEN W E U SED THE D RAW T ILE M ETHOD

D. Increased Computation to Memory Ratio Through Tiling In the storage architecture of GPU, each thread has its own local memory, each block has a shared memory, and global memory is used to exchange data between different grids. Global memory is large but slow, whereas the shared memory is small but fast. A common strategy is to partition the data into subsets called tiles such that each tile fits into the shared memory. The term “tile” draws on the analogy that a large wall (i.e., the global memory data) can often be covered by tiles (i.e., subsets that each fit into the shared memory) [23]. The GPU kernel executes the C code in multiple iterations. In each iteration, one thread block loads one tile of A and one tile of B from the global memory to the shared memory, performs the computation, and stores the temporary result of C in a register. After all the iterations are complete, the thread block stores one tile of C into the global memory. The tile size is often 16 or 32 and is chosen to fit the warp size. In our program, the size of matrices A and B is 210 × 210 and 210 × 94 249, respectively, i.e., they are not multiples of 16. To solve this problem, we used zero padding. For example, we added 14 columns to the right end of matrix A, giving 210 + 14 = 224 columns, as shown in Fig. 3.

We also added 14 rows to the bottom of matrix A. We are then able to place 16 × 16 tiles. In order to use this configuration, we first determine whether the element rank is beyond the original matrix row and column before we read it. If exceeded, the element value is set to zero. The main code is shown in Fig. 4. Table IV shows the GPU runtime and speedup when we used the draw tile method. E. Memory Coalescing In aspect of data storage, the two-dimensional arrays in C/C++ are row-major. Therefore, the row-major data were stored in a contiguous address, so the read speed is faster. And if we read a column-major data, the speed is slower because they were stored in discontinuous address. In the tiled implementation above, the neighboring threads have coalesced access to matrix A and do not have coalesced access to matrix B. In column-major languages, the problem is the other way around. An obvious solution is to transpose matrix B by CPU before offloading it to GPU memory [24].

2866

IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 8, NO. 6, JUNE 2015

TABLE VI GPU RUNTIME AND S PEEDUP W HEN W E U SED L OOP U NROLLING

Fig. 5. Main code before and after using memory coalescing.

Fig. 7. Main code using the CUBLAS functions.

TABLE V GPU RUNTIME AND S PEEDUP W HEN W E U SED THE D RAW T ILE M ETHOD

TABLE VII GPU RUNTIME AND S PEEDUP W HEN W E U SE THE CUBLAS F UNCTIONS

G. CUBLAS

Fig. 6. Main code using loop unrolling.

We remade our code, to meet the above-mentioned conditions, as shown in Fig. 5. Using this method, the GPU runtime was reduced to 1.071 s and the speedup reached 56×, as displayed in Table V. F. Loop Unrolling Loop unrolling is a loop transformation technique that attempts to optimize a program’s execution speed at the expense of its size. The goal of loop unrolling is to increase a program’s speed by reducing instructions that control the loop, such as pointer arithmetic and “end of loop” tests on each iteration; reducing branch penalties; as well as “hiding latencies, in particular, the delay in reading data from memory.” In addition, loop unrolling sometimes has side effects on register usage, which may limit the number of concurrent threads. However, in this case, loop unrolling does not increase register usage in our program [25]. We rewrite our code after use loop unrolling, as shown in Fig. 6. Using the loop unrolling method, the GPU runtime was reduced to 0.906 s and the speedup reached 66×, displayed in Table VI.

NVIDIA CUBLAS (CUDA Basic Linear Algebra Subroutines) library [26] is a GPU-accelerated version of standard BLAS library. It can use NVIDIA GPU’s resources. CUBLAS library is an implementation of the standard Basic Linear Algebra Subprograms (BLAS) API on top of the NVIDIA CUDA runtime library. The simplest approach to utilize the CUBLAS library is to create matrix and vector objects in the GPU memory space, fill them with data, and then call CUBLAS functions in a certain order. According to the functions defined in CUBLAS, calculation of transpose multiplication of matrix (like B = A’ ∗ A or B = A ∗ (A’)−1 ) is implemented by CUBLAS functions and is shown in Fig. 7. After using CUBLAS functions, the runtime and speedup of ATGP algorithm are shown in Table VII. It can be seen that the speedup rose from 66× to 308×. Moreover, in the implementation of ATGP algorithm, it is necessary to solve an inverse matrix problem. We used three methods to achieve this function: 1) CULA library [27] LU-based linear solvers: In CULA library, to solve the inverse matrix is divided into two steps: 1) matrix LU decomposition; and 2) solve the inverse matrix. The matrix LU decomposition has to be performed before the equations are solved. For the data saved in the device, we use culaDeviceSgetrf() and culaDeviceSgetri() and avoid the transportation between host and device. 2) Solve the linear equation: For equation AX = B, if B is the identity matrix E, then X is the inverse of the matrix A. The specific measures are as follows: first, define an identity matrix E with same size as matrix A, and then use the CULA function culaDeviceSgesv() to solve the linear equations AX = E and obtain the matrix X that is inverse matrix A−1 of matrix A. 3) Use MAGMA library [28] to achieve: MAGMA project aims to develop a dense linear algebra library but for heterogeneous/hybrid architectures. When we calculate the inverse matrix using MAGMA library, it

LI et al.: MASSIVELY PARALLEL GPU DESIGN OF ATGP IN HYPERSPECTRAL IMAGERY

2867

TABLE VIII S PEEDUP W HEN U SING THE D IFFERENT I NVERSE M ATRIX M ETHODS

Fig. 9. GPU runtime of the CUDA version, where registers = 16 − 255.

Fig. 8. GPU runtime of the CUDA version, where block size = 8 − 1024. TABLE IX GPU RUNTIME AND S PEEDUP W HEN W E U SED A B LOCK S IZE OF 32

also needs two steps, the matrix decomposition, and then the inverse matrix generation. The function calls are as follows: magma spotrf gpu(‘u’, (m-1), upu, m-1, &result); magma spotrs gpu(‘u’, m-1, band, upu, m-1, up, m-1, &result); The runtime and speedup when using the different matrix inversion methods are shown in Table VIII. From this table, we can see that using linear equation method to calculate inverse matrix spent the minimum time, therefore, we choose this method for matrix multiplication in ATGP implementation. The speedup increased to 356×. H. Effect of Block Size Per Grid on the GPU-Based Code As we have talked that CUDA architecture automatically schedules execution in different blocks, the number of threads in one block, i.e., the block size, is optimized experimentally. In order to determine the influence of the thread block size on computing performance, we performed a test with different block sizes. Total 11 runs are executed, and the first three runs are excluded due to unstable computing performance, the remaining 8 runs are averaged and used for the average speedup calculation for the given block size. We set the block size from 8 to 1024, and the results are presented in Fig. 8. It can be seen that the best performance is achieved when the block size is 32. Hence, we set the block size to 32. The runtime and speedup are shown in Table IX, and the speedup reached 362× after setting the optimized block size.

Fig. 10. URBAN image results of ATGP algorithm when the target number is 15.

I. Effect of Registers Per Thread on GPU-Based Codes By keeping the block size at 32, the optimized CUDA version of ATGP algorithm was used for the study of runtime when the number of registers per thread is varied. Similar to the approach presented in the previous section, given a number of registers per thread, 11 executions of the CUDA codes are performed and only the last 8 runs are used for the speedup calculation. The results are shown in Fig. 9, and we can see that the best is achieved when the registers per thread is 63, which is same as we set previously. At the same time, we finished the whole optimization process of GPU-based ATGP algorithm, through the optimization, the run time is reduced to 0.166 s from 60.143 s, and the speedup achieved is 362×. J. Results of ATGP Algorithm Using ATGP algorithm and the urban data as input, we can determine pixel coordinates of the identified targets, and they are shown in Fig. 10. In order to test the speedup of this algorithm with different target number, we also performed tests with the target numbers 10–20; the results are shown in Fig. 11. The results can be seen

2868

IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, VOL. 8, NO. 6, JUNE 2015

algorithm can achieve a speedup of 362×, 416×, and 320×, with respect to one CPU core for HYDICE URBAN, AVIRIS WTC, and AVIRIS Cuprite data, respectively, when the target number is 15. R EFERENCES

Fig. 11. Speedup for target number is 10–20. TABLE X GPU RUNTIME W ITH I/O OF M ETHOD IN [17] AND O UR M ETHOD FOR AVIRIS C UPRITE DATA IN T ESLA C1060

TABLE XI GPU RUNTIME AND S PEEDUP FOR D IFFERENT DATA S OURCES

from Fig. 11, that the speedups of this algorithm are more than 340× when the target number is 10–20. In order to prove the effectiveness of our approach, we choose the same data, same GPU card, same target numbers as in [17], AVIRIS Cuprite data (350 × 350 pixels and 188 spectral, Tesla C1060, 30, respectively). We used linear method to calculate the inverse matrix. The runtime with I/O was shown in Table X. In order to test the applicability of our program, we also tested its performance on other data. One example is AVIRIS WTC dataset obtained by the National Aeronautics and Space Administration Airborne Visible Infrared Imaging Spectrometer (AVIRIS) over the World Trade Center (614 × 512 pixels and 224 spectral bands), the runtime and speedup achieved using this data is also shown in Table XI for a target number of 15. IV. C ONCLUSION In this paper, we implemented ATGP algorithm on an NVIDIA Tesla K20 GPU. Several additions have been made to improve the computing performance. For example, we launched the L1 cache with more memory than the shared memory; CUBLAS and CULA functions have been applied. In addition, the effects of threads per block on the GPU-based implementation were studied. At the end, the runtime reduced from 9.003 s to 1.223 s, when we choose the same data, same GPU card, and same target numbers as in [17]. Using one NVIDIA Tesla K20 GPU with I/O (CPU to GPU and GPU to CPU), our optimization effort on the GPU-based ATGP

[1] C. I. Chang, “Three-dimensional receiver operating characteristics (3D ROC) analysis,” in Hyperspectral Data Processing: Algorithm Design and Analysis, 1st ed. Hoboken, NJ, USA: Wiley, 2013. [2] D. Manolakis, D. Marden, and G. A. Shaw, “Hyperspectral image processing for automatic target detection applications,” Lincoln Lab. J., vol. 14, no. 1, pp. 79–116, 2003. [3] S. Valero, P. Salembier, and J. Chanussot, “Hyperspectral image representation and processing with binary partition trees,” IEEE Trans. Image Process., vol. 22, no. 4, pp. 1430–1443, Apr. 2013. [4] C. Rodarmel and J. Shan, “Principal component analysis for hyperspectral image classification,” Surv. Land Inf. Syst., vol. 62, no. 2, pp. 115– 122, 2002. [5] B. Mojaradia, H. Emamib, M. Varshosazc, and S. Jamali, “A novel band selection method for hyperspectral date analysis,” Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci., vol. 37, no. B7, pp. 447–451, 2008. [6] C. I. Chang, “Automatic subpixel detection unsupervised subpixel detection,” in Hyperspectral Imaging: Techniques for Spectral Detection and Classification, 2nd ed. New York, NY, USA: Kluwer/Plenum, 2013. [7] M. H. Lv, W. Xiong, and L. Cai, “A GPU-based parallel processing method for slope analysis in geographic computation,” Adv. Mater. Res., vol. 538, pp. 625–631, 2012. [8] Y. Gao et al., “Optimization for viewshed analysis on GPU,” in Proc. 19th Int. Conf. Geoinformat., Shanghai, China, 2011, pp. 1–5. [9] P. Tsang, W. K. Cheung, T. C. Poon, and C. Zhou, “Holographic video at 40 frames per second for 4-million object points,” Opt. Express, vol. 19, no. 16, pp. 15205–15211, 2011. [10] J. Mielikainen, B. Huang, H. A. Huang, and M. D. Goldberg, “Improved GPU/CUDA based parallel weather and research forecast (WRF) single moment 5-class (WSM5) cloud microphysics,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 5, no. 4, pp. 1256–1265, Aug. 2012. [11] T. S. Muhammad, B. Huang, T. J. Hsieh, Y. L. Chang, and W. Y. Liang, “GPU acceleration of tsunami propagation model,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 5, no. 3, pp. 1014–1023, Jun. 2012. [12] A. Plaza, Q. Du, Y. L. Chang, and R. L. King, “Foreword to the special issue on high performance computing in earth observation and remote sensing,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 4, no. 3, pp. 503–507, Sep. 2011. [13] C. A. Lee, S. D. Gasster, A. Plaza, C. I. Chang, and B. Huang, “Recent developments in high performance computing for remote sensing: A review,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 4, no. 3, pp. 508–527, Sep. 2011. [14] C. H. Song, Y. S. Li, and B. Huang, “A GPU-accelerated wavelet decompression system with SPIHT and Reed-Solomon decoding for satellite images,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 4, no. 3, pp. 683–690, Sep. 2011. [15] S. C. Wei and B. Huang, “GPU acceleration of predictive partitioned vector quantization for ultraspectral sounder data compression,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 4, no. 3, pp. 677–682, Sep. 2011. [16] T. Balz and U. Stilla, “Hybrid GPU-based single- and double-bounce SAR simulation,” IEEE Trans. Geosci. Remote Sens., vol. 47, no. 10, pp. 3519–3529, Oct. 2009. [17] S. Bernab´e, S. L´opez, A. Plaza, and R. Sarmiento, “GPU implementation of an automatic target detection and classification algorithm for hyperspectral image analysis,” IEEE Geosci. Remote Sens. Lett., vol. 10, no. 2, pp. 221–225, Mar. 2013. [18] S. Bernab´e et al., “Hyperspectral unmixing on GPUs and multi-core processors: A comparison,” IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens., vol. 6, no. 3, pp. 1386–1398, Jun. 2013. [19] C. I. Chang, “Orthogonal subspace projection (OSP) revisited: A comprehensive study and analysis,” IEEE Trans. Geosci. Remote Sens., vol. 43, no. 3, pp. 502–518, Mar. 2005. [20] NVIDIA Corporation, The CUDA C Best Practices Guide v4.1. Santa Clara, CA, USA: NVIDIA Corp., 2012. [21] F. Li, M. K. Ng, and R. J. Plemmons, “Coupled segmentation and denoising/deblurring models for hyperspectral material identification,” Num. Linear Algebra Appl., vol. 19, no. 1, pp. 153–173, 2012.

LI et al.: MASSIVELY PARALLEL GPU DESIGN OF ATGP IN HYPERSPECTRAL IMAGERY

[22] NVIDIA Corporation, NVIDIA’s Next Generation CUDATM Compute Architecture: Fermi. Santa Clara, CA, USA: NVIDIA Corp., 2012. [23] D. B. Kirk and W. W. Hwu, “CUDA memories,” in Programming Massively Parallel Processors: A Hands-on Approach, 2nd ed. San Mateo, CA, USA: Morgan Kaufmann, 2012. [24] M. Q. Ma, Y. Liu, and S. T. Zeng, “Research of matrix multiplication based on CUDA architecture,” Microcomput. Appl., vol. 30, no. 24, pp. 62–68, 2011. [25] NVIDIA Corporation, CUBLAS Library 5.5 User Guide. Santa Clara, CA, USA: NVIDIA Corp., 2013. [26] P. Benner, S. Enrique, and Q. Ort´ı, “High performance matrix inversion of SPD matrices on graphics processors,” presented at the Int. Conf. High Perform. Comput. Simul. (HPCS), 2011, pp. 640–646. [27] EM Photonics. (2013). CULA Programmer’s Guide. Newark, DE, USA: EM Photonics [Online]. Available: http://www.culatools.com/ [28] University of Sydney, Magma Computational Algebra System. NSW, Australia: School of Mathematics and Statistics, University of Sydney, 2006 [Online]. Available: http://icl.cs.utk.edu/magma/

Xiaojie Li received the Ph.D. degree in optical engineering from Tianjin University, Tianjin, China, in 2010. Currently, she is an Associate Professor with Northeast Institute of Geography and Agroecology, Chinese Academy of Sciences, Changchun, China. Her research interests include photoelectric detection technology, remote sensing data processing, and high-performance computing.

Bormin Huang received the M.S.E. degree in aerospace engineering from the University of Michigan, Ann Arbor, MI, USA, in 1992, and the Ph.D. degree in the area of satellite remote sensing from the University of Wisconsin-Madison, Madison, WI, USA, in 1998. Currently, he is a Research Scientist and Principal Investigator at the Space Science and Engineering Center, University of Wisconsin-Madison. His research interests include remote sensing science and technology, including satellite data compression and communications, remote sensing image processing, remote sensing forward modeling and inverse problems, and high-performance computing in remote sensing. Dr. Huang serves as a Chair for the SPIE Conference on Satellite Data Compression, Communications, and Processing and a Chair for the SPIE Europe Conference on High-Performance Computing in Remote Sensing, an Associate Editor for the Journal of Applied Remote Sensing, a Guest Editor for the special section on High-Performance Computing in the Journal of Applied Remote Sensing, a Guest Editor for the special issue on Advances in Compression of Optical Image Data from Space in the Journal of Electrical and Computer Engineering, and a Session Chair or Program Committee member for several IEEE or SPIE conferences.

2869

Kai Zhao received the Master’s degree from Changchun Institute of Geographical Sciences, Changchun, China, in 1990. He is currently a Professor with the Northeast Institute of Geography and Agroecology, Chinese Academy of Sciences (CAS), Changchun, China, Supervisor of the Ph.D. degree, and part-time Professor with the Graduate University of CAS. His research interests include development of microwave remote sensing instrument and study of basic theories of microwave remote sensing. Dr. Zhao is a member of Resource Information System Association of China Society of Natural Resources and a member of Jilin International Society for Photogrammetry and Remote Sensing (ISPRS).

Suggest Documents