In Proceedings of the 21st International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), Oct. 2009. Copyright by IEEE Available from http://www.cos.ufrj.br/~monnerat
Accelerating Kirchhoff Migration by CPU and GPU Cooperation Jairo Panetta, Thiago Teixeira, Paulo R. P. de Souza Filho, Carlos A. da Cunha Filho, David Sotelo, Fernando M. Roxo da Motta, Silvio Sinedino Pinheiro, Ivan Pedrosa Junior, Andre L. Romanelli Rosa, Luiz R. Monnerat, Leandro T. Carneiro and Carlos H. B. de Albrecht Tecnologia Geofísica – Petróleo Brasileiro SA, PETROBRAS
[email protected], {thiagoxt, prps, ccunha, david, fernando.roxo, sinedino, ivanjr, aromanelli, luiz.monnerat, ltcarneiro, albrecht}@petrobras.com.br
Abstract We discuss the performance of Petrobras production Kirchhoff prestack seismic migration on a cluster of 64 GPUs and 256 CPU cores. Porting and optimization of the application hot spot (98.2% of a single CPU core execution time) to a single GPU reduces total execution time by a factor of 36 on a control run. We then argue against the usual practice of porting the next hot spot (1.5% of single CPU core execution time) to the GPU. Instead, we show that cooperation of CPU and GPU reduces total execution time by a factor of 59 on the same control run. Remaining GPU idle cycles are eliminated by overloading the GPU with multiple requests originated from distinct CPU cores. However, increasing the number of CPU cores in the computation reduces the gain due to the combination of enhanced parallelism in the runs without GPUs and GPU saturation on runs with GPUs. We proceed by obtaining close to perfect speed-up on the full cluster over homogeneous load obtained by replicating control run data. To cope with the heterogeneous load of real world data we show a dynamic load balancing scheme that reduces total execution time by a factor of 20 on runs that use all GPUs and half of the cluster CPU cores with respect to runs that use all CPU cores but no GPU.
1. Introduction The oil industry is a voracious consumer of computing cycles worldwide. Petrobras has a large park of x86 based clusters dedicated to seismic processing production runs. As happened with any enterprise that relies on ever-growing computational power, the speed of growth has been restricted by the
overwhelming costs of energizing, cooling and hosting computing hardware. In this scenario, it is crucial to test new architectural tendencies that promise significant energy and space reductions per delivered computing power. Over the last four years, we have been conducting a research project at Petrobras to test the adequacy of innovative computing architectures for seismic processing. As early project experiments indicated, recent released GPUs deliver considerable gains over x86 based hardware in the speed per power, speed per area and speed per price metrics. Petrobras recently acquired a cluster of 36 quad-core, dual processor 2.33GHz Xeon 5410 nodes, accelerated by two NVIDIA Tesla C1060 GPUs per node, to test its adequacy to the in-house developed Kirchhoff prestack seismic migration code. This paper reports the application port strategy and the impressive performance achieved on 32 cluster nodes. The main contributions are the porting strategy that privileges CPU and GPU cooperation instead of pure GPU speed, and the detailed performance data on a multi GPU cluster. The presentation is organized as follows: Section 2 summarizes the published work, Section 3 describes the application and establishes nomenclature, Section 4 describes the application porting strategy, and presents its performance results over a control run on a single cluster node, Section 5 extends the results to 32 nodes and Section 6 reports performance results using 32 nodes on a typical production run. Conclusions and future work are derived at Section 7.
2. Related Work Prestack Kirchhoff time migration has been used by the seismic industry for a long time. In a previous work
([9]) we showed the performance of Petrobras production code on x86 clusters, Blue Gene and Cell architectures. In a related paper ([10]), we explore code structure characteristics that allow multi-algorithm support and summarized performance data (speed/core and speed/watt) on a large set of computer architectures including GPUs. There are many published reports on porting applications to GPUs, but they usually concentrate in extracting high performance of a single GPU, leaving to the companion CPU just IO and control operations. Another common characteristic is the performance exploitation of a single GPU. Very few reports extend the investigation to multiple GPUs within a node and to multiple nodes. Focusing in this scenario, we survey some of the most recent reports. At [1], Garland et al survey the use of the NVIDIA Tesla GPU over a set of applications in molecular dynamics, numerical linear algebra, medical imaging, fluid dynamics and seismic processing. Performance data are limited to a single GPU, except for medical imaging (two GPUs) and fluid dynamics (eight GPUs). There is no mention of CPU – GPU cooperation, except on the Linear Algebra LU decomposition, where pivoting is kept on the CPU. At the recent Stanford Exploration Project meeting of May 2009, few works (such as [13]) were presented on GPU usage in geophysics, but they all emphasize single GPU performance. Deschizeaux and Blanc ([11]) report the porting of another seismic migration algorithm to GPUs. It is a production code that routinely runs on large x86 based clusters. However, they just report the achieved speedup range (8x to 15x) of GPU code over optimized x86 code running on a single CPU core per cluster node, with no information on multi-GPU performance. Again, the CPU just executes IO and control loops. Micikevicius ([12]) reports porting to GPUs a 3D finite difference algorithm kernel that solves the wave equation. This kernel is the key component of the Reverse Time Migration (RTM) procedure, a promising seismic algorithm. Performance data are presented on a single GPU in the RTM case and on four GPUs in the kernel case. CPU use seems to be limited to IO, control loops and communications.
3. Seismic Migration We summarize the pertinent characteristics of the seismic method, Kirchhoff migration and code structure. For a detailed presentation, see [2] and [3]. Figure 1 depicts data acquisition at the sea. The source periodically generates waves, which
propagation produces reflections at subsurface interfaces that are later collected by receivers. The set of reflections originated by a single wave and collected by a single receiver is called a trace. A trace is composed by amplitude values denominated samples, collected at discrete times. receivers
source
sea bottom
reflecting surface
Figure 1: Seismic data acquisition The set of all samples collected by a survey is represented by S(t,x,y,o), where t is the signal travel time (from source to subsurface and back to receiver), x and y are the surface coordinates of the mid-point (half way from source to receiver) and o is the distance from source to receiver, known as offset. Since the ship moves and periodically generates waves, data acquisition provides redundancy: a single mid-point is associated to multiple traces, corresponding to multiple offsets, originated by waves generated at distinct source positions and received by distinct receivers. Data redundancy is desired, since it can be used to improve signal quality. A typical survey contains tens to hundreds of terabytes of data. Seismic processing aims to extract subsurface information from surveys. The oil industry worldwide uses commercially available software packages (e.g. [4, 5]) containing up to tens of millions of source lines, comprising hundreds of seismic modules (e.g. signal to noise enhancement, reverberation suppression), and desired functionalities such as project database manager, a tool to cascade seismic modules, etc. A seismic processing professional selects the modules that are most appropriate for the survey and the target area characteristics. Processing a survey demands months of team work, but its execution time is fully dominated by the seismic migration module. Seismic migration is the process of producing a reliable subsurface image (i.e., positioning reflecting surfaces) that is consistent with acquired data. It is an inverse problem, since it produces model parameters from observed data. As many inverse problems, a consistent solution may require multiple executions,
since output dependent information is required on input, such as the distribution of seismic wave velocity in subsurface. The migration algorithm may use a representative simplified velocity field or a more complex field that accounts for wavefront deformation details. In the former case, the process is called time migration and produces image T(τ,x,y,o), where τ represents vertical travel time, while in the latter case the process is called depth migration and produces image T(z,x,y,o), where z represents depth. Kirchhoff migration uses the Huygens-Fresnel principle to collapse all possible contributions to an image sample. Wherever this sum causes constructive interference, the image is heavily marked, remaining blank or loosely marked on destructive interference parts. A contribution to an image sample T(z,x,y,o) is generated by an input sample S(t,x’,y’,o) whenever the measured signal travel time t matches computed travel time from source to (z,x,y,o) subsurface point and back to receiver. The set of all possible input traces (x’,y’,o) that may contribute to an output trace (x,y,o) lie within an ellipsis of axis ax and ay (apertures) centered at (x,y,o). Input traces are filtered to attenuate spatial aliasing. Sample amplitudes are corrected to account for energy spreading during propagation. Kirchhoff depth migration is the result, for each output sample (z,x,y,o), of the summation
Tz , x , y ,o = ∑ β (αSt f, x ', y ',o + (1 − α ) St +f ∆t , x ', y ',o ) where the sum of contributions is taken over all input traces (x’,y’,o) within the aperture ellipse, the superscript f denotes the selected filter, β denotes amplitude correction, t is the travel time corresponding to z and α is the interpolation weight required to map computed floating point travel time to discrete samples. For each offset, the output volume surface is partitioned into output blocks of parameterized size, to reduce memory requirements and to enhance parallelism. Figure 2 depicts the basic steps of a Kirchhoff algorithm. The underlined text establishes nomenclature for later use. The main algorithmic distinction between depth and time migration is travel time computation. Depth migration computes travel time by adding precomputed source to target and target to receiver travel times (requiring retrieval of pre-computed large matrices), requiring reliable estimates of the interval velocity field. On the other hand, time migration relies on analytical approximations of the travel time dependency with offset and only requires estimates of the average velocity field. Consequently, depth
migration requires substantially more memory and computing time than time migration. The migration source code comprises about 32K lines of Fortran 95 and about 1K lines of C (to speedup IO). Code is Fortran 95 standard-conforming except for flush calls and C interoperability. For all offsets (offset loop) For all output blocks of current offset (output block loop) Clear output block volume For all input traces with this offset (input trace loop) Read input trace Filter input trace For all output traces in the current block (migration loop) For all output trace samples (contribution loop) Compute travel time Compute amplitude correction Select input sample and filter Accumulate input sample contribution to an output sample End For (contribution loop) End For (migration loop) End For (input trace loop) Dump output block volume End For (output block loop) End For (offset loop)
Figure 2: Kirchhoff Algorithm Each instance of the inner loop (contribution loop) requires about 160KB of memory, while one instance of the outer loop (migration loop) requires tens of MB. Inner loop memory references are the filtered input trace (load), the output trace (load, modify, store), the velocity (load) and scratch area. The outer loop (migration loop) brings another output trace and velocity to the inner loop. In consequence, the application has a high computational intensity, defined as the number of floating point operations per memory reference. Built-in instrumentation outputs our own application specific metric of computing speed: the number of sample contributions accumulated per second, defined as the execution rate of iterations of the contribution loop. An approximation of the usual Flops rate requires multiplication by 74 (average floating-point operations per contribution). Parallel domain decomposition occurs at the output space. The output block is the parallel unity of work (parallel grain) with blocks dynamically distributed to MPI processes, parallelizing the output block loop. We restrict our work to time migration. Computation is fully performed on single precision.
4. Porting to a Single GPU Porting and sequential optimization experiments were performed on a single, representative output block, selected from a survey. Table 1 contains
execution time profile of the migration of the selected block on a single core of a 2.33GHz Xeon 5410 (Clovertown). Table 1: Single Block Execution Time on CPU Component Filter Migration Loop Other Total
Execution Time (s) 636 41,324 48 42,008
The execution time of the control run is fully dominated by the migration loop (98.2% of total execution time), followed by the filter computation (1.5% of total execution time). Remaining computations are negligible (0.1% of total execution time). Data show that the migration loop is the best candidate to be accelerated by the GPU.
4.1. Application port and optimization The target GPUs (NVIDIA Tesla C1060) can be seen as a set of 30 processors (seen either as vector processors or scalar multiprocessors) that oversee a large (4GB) graphics memory ([6, 7]). Since bringing data from the graphics memory to each GPU processor takes 400 to 600 cycles [7], these cycles may be used by the processor to perform other, independent computations. Under the CUDA Programming Model ([7]), each GPU processor receives a set of scalar, independent threads and interleaves thread execution with memory accesses. Best performance occurs when the set of available threads perform identical operations as in a vector machine (see [8] for details). Traffic among GPU and CPU memory should also be minimized due to its high cost. Consequently, attaining high GPU speeds requires generating as many identical, independent threads as possible from one instance of the migration loop, as well as keeping data as much as possible on GPU memory. The selected strategy is to execute the migration loop for a filtered input trace in the GPU, while the CPU executes the input trace loop, reading and filtering input traces. The CPU sends the filtered input trace to the GPU and dispatches the migration loop computation to the GPU. To reduce traffic among CPU and GPU memory, each output block is created in GPU memory at the beginning of the output block loop and resides at the GPU until the end of the output block loop, when it is sent back to the CPU just for output. There are no race conditions on a single migration loop, since each output sample receives, at most, a
single contribution from an input trace. Consequently, migration loop iterations are fully independent and may be transformed into threads. A thread may be defined as the computation of the contribution of a single input trace to a single output sample. It turns out that this approach is inconvenient, since it does not consider the fact that some of the innermost loop computations are valid for a set of consecutive output samples (loop index), and therefore can be factored out of the loop. To factor these computations, we define a thread as the computation of the contributions of a single input trace to the set of consecutive output samples. Even with this restriction, the number of threads per typical output block exceeds one million. The innermost computation (accumulate input samples contributions to an output sample) requires a bi-linear interpolation of input samples – a graphics texture operation, performed by specialized GPU hardware. This operation combines the bi-linear interpolation with a vector gather over a cacheable texture memory. Code design with 1024 threads per multiprocessor and a low cache footprint allows hiding exceptionally well the latencies of this operation. To put in perspective, this is a scalar operation in most computer architectures. Each output trace is padded to fully utilize a block of threads and to avoid unnecessary conditional statements within a thread. Output block traces are enumerated and partitioned into sets of 128 traces, composing a CUDA block of threads. Output block is padded with empty traces, to guarantee that each CUDA block has the same number of output traces. Careful coding aligns variables in GPU memory to speed up the computation. We repeated the single output block migration experiment to measure the performance gains brought by the GPU. In this control run experiment, CPU and GPU computations execute synchronously, without overlapping in time, for detailed timing account. Table 2 contains execution time profile of the control run on the pair of CPU core and GPU. Table 2: Synchronous Execution of CPU and GPU Component
Execution Time (s)
Filter
636
Migration Loop
466
Other
48
Total
1,150
By comparing Table 1 with Table 2 we observe that the migration loop executes 89 times faster on the GPU than on the CPU, and that total execution time was reduced by a factor of 36.
Due to the large reduction of the migration loop execution time, the filter computation becomes the execution time hot spot, increasing its responsibility on total execution time from the original 1.5% to 55%. So far, we followed the canonical porting strategy of accelerating hot spots. Under this strategy, the natural next step is to move the filter computation from the CPU to the GPU. It turns out that this step does not consider the fact that there are four CPU cores to each GPU. Consequently, this step may overload the GPU while keeping four CPUs idle.
is to enhance the overlapping of CPU to GPU data movements with GPU computations. To measure the gain brought by GPU overloading, we varied MPI process count per node, assigning one MPI process to each pair of CPU core and logical GPU. For a fair comparison, each MPI process receives the same load of the control run. Table 4 contains total execution time as a function of MPI process count (used cores). Table 4: Overloading the GPU MPI Processes and Blocks
1
2
3
4
709
1,054
1,561
2,077
4.2. Overlapping CPU and GPU Executions
Total Exec Time (s)
An interesting way to reduce total execution time is to overlap CPU and GPU computations. The GPU computes the migration loop of a previously filtered input trace, while the CPU anticipates the next filter computation. To avoid racing conditions, a double buffer for filtered input traces is established on GPU memory. Table 3 contains the control run execution time with simultaneous CPU and GPU computation. Since execution times are measured at the CPU, filter and migration loop execution times are collapsed.
The execution times of Table 4 cannot be directly compared with previous results, since work increases with process count. Normalizing work by comparing execution time per output block is also misleading, since it hides the number of blocks simultaneously computed. Computing speed is a more appropriate metric, measured by the number of sample contributions accumulated per second. This metric considers the total amount of work (including parallelism) and the execution time. Since code instrumentation reports both total number of sample contributions and the execution time, the metric is easily computed. Table 5 contains execution speed, in millions of samples per second, in control run executions with and without a single GPU, as a function of the number of CPU cores.
Table 3: Overlapping CPU - GPU Execution Component
Execution Time (s)
Filter+Migration Other Total
659 48 707
Table 5: Execution Speed (MSamples/s) The comparison between Table 2 and Table 3 shows that overlapped CPU and GPU computation reduced total execution time by 63% with respect to nonoverlapped execution. Comparing Table 1 with Table 3 data show that overlapped computation of CPU and GPU reduced the original CPU execution time by a factor of 59.
4.3. Overloading the GPU A cluster node has eight CPU cores and two GPUs. Overlapped CPU and GPU execution uses one CPU core and one GPU, leaving the remaining CPU cores unused. A natural way to use all CPU cores is to overload GPUs, pretending that there is a logical GPU for each CPU core, assigning the computation of one output block to each pair of a physical CPU core and a logical GPU, while mapping the eight logical GPUs into the two physical ones. This is easily done by assigning multiple MPI processes to the same node with no code modification. A side effect of this strategy
CPU cores No GPU Single GPU
1 48.49 2,873.10
2 96.98 3,865.33
3 145.47 3,914.86
4 193.97 3,923.02
As expected, execution speed increases linearly with core count on pure CPU executions. GPU introduction increases execution speed by a large factor (59) on the single CPU core case, but GPU speed shows an asymptotic behavior with CPU count increase, due to an overloaded GPU. Table 6 shows the speed gain by using the GPU (computed as the ratio of the last two rows of Table 5) as a function of CPU core count. Table 6: Speed Gain by including the GPU CPU cores Speed Gain
1 59.25
2 39.86
3 26.91
4 20.23
The acceleration caused by the GPU is reduced with increasing CPU core count, indicating GPU overload. Consequently, multiple migration loop computations are enough to overload the GPU, and no gain can be
achieved by moving the filter computation from the CPU to the GPU. Observing that a migration loop can be performed either in the GPU or in a CPU core, it is natural to ask which scheduling policy should be used as CPU core count increases. For a single CPU core, Table 5 shows that overlapping CPU and GPU computation produces large gains. For increasing CPU core count, Table 7 compares the gains of scheduling output blocks to CPU only or to the CPU-GPU cooperation, computed by subtracting the speeds reported at Table 5. Table 7: Speed Increase with the number of cores Speed Increase No GPU Single GPU
1 to 2 CPUs 2 to 3 CPUs 3 to 4 CPUs 48.49 48.49 48.49 992.23 49.52 8.17
So far, all experiments considered homogeneous load, but a large survey has heterogeneous behavior. The output block loop load is heterogeneous due to at least two reasons: (1) output block sizes are uneven to accommodate an integer number of blocks in the output space and (2) the amount of input traces that contribute to an output block vary with output block position, since blocks close to the borders have less contributing input traces than those in the center. Heterogeneous load may change the results achieved so far. We proceed by characterizing the heterogeneous load and revisiting previous results under such a load.
6.1. Load characterization
There is a clear gain in scheduling the second output block to the CPU-GPU cooperation. The third output block may be scheduled to a CPU core alone or to the CPU-GPU cooperation, with marginal gains. There is negligible gain in scheduling the fourth output block to the GPU.
Figure 3 shows the execution time per output block of a single offset migration with real world data on pure CPU execution (no GPU is used). The (x,y) surface of the output space is partitioned into a grid of 26x6 output blocks. Blocks are numbered in (x,y) column-major ordering. Border effects are obvious. 90000
5. Multiple nodes over homogenous load
80000
Table 8: Execution time (s) with multiple nodes MPI Processes Total Exec. Time (s)
1 707
2 709
4 715
8 714
16 712
32 719
64 722
Execution time variation is a negligible 2.1% (15 seconds out of 707 seconds), since it is within the usual range of execution time fluctuation on a large cluster. Parallelism is close to perfection. Data also show that IO speed is high enough to feed 32 nodes.
6. Heterogeneous load
Total Execution Time (s)
70000
So far, all experiments used a single GPU of a single cluster node. But a cluster node has two GPUs and we may use up to 32 identical cluster nodes. We investigate how the execution time varies with multiple nodes. We assigned the same load of previous experiments to each MPI process (control run) and varied MPI process count. Each MPI process is executed by a cooperating pair of a CPU core and a GPU, without overloading the GPU, assigning at most two MPI processes to a node. Since each MPI process writes one output file per block, this experiment simulates real, homogeneous load. Table 8 contains total execution time as a function of MPI process count.
60000
50000
40000
30000
20000
10000
0 1
6
11 16 21 26 31 36 41 46 51 56 61 66 71 76 81 86 91 96 101 106 111 116 121 126 131 136 141 146 151 156
Output Block
Figure 3: Heterogeneous Load Characterization
6.2. Load balancing on CPUs Code uses a naïve load balancing strategy. Block loads are estimated and sorted in decreasing order. A single master process dispatches a block upon slave process request, in load decreasing order. This procedure is performed for each offset and execution of consecutive offsets overlap in time. This algorithm is known ([14]) to be a 4/3approximation to the optimal solution of the NP-hard load balancing problem, known as “minimum makespan scheduling”. In other words, this algorithm assigns tasks to processes in such a way that the total execution time is at most 1/3 longer than resulting from optimal assignment.
The load balancing effect is shown at Figure 4 for the single offset, 156 output block load of Figure 3 on a 52 CPU run (no GPUs). It contains execution time per CPU (and MPI process), highlighting the execution time of each block. Observe that there are processes that receive just two blocks, while other processes receive up to four blocks. The first process that finishes takes 32,061s (17%) less than the last process (182,785s).
days). Figure 5 shows how the execution time difference (from the first process that finishes an offset to the last process that finishes the same offset) varies on an overlapped offset execution. After an initial oscillation, the difference stabilizes at 30,000 seconds, about 0.9% of execution time. To observe load unbalancing effect on speed-up, we simulated the load of 20 offsets on multiples of 32 CPU cores. Table 9 shows that load unbalancing reduces parallel efficiency by about 5% with 256 cores,
200000
Table 9: Load unbalancing impact on core count
180000
Total Execution Time (s)
160000
140000
120000
100000
80000
60000
40000
20000
0 1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
31
33
35
37
39
41
43
45
47
49
51
MPI Process
CPU Cores 1 32 64 96 128 160 192 224 256
Execution Time (s) 169,558,192 5,302,111 2,653,726 1,794,580 1,332,924 1,088,870 901,156 781,847 690,433
Speed up 1.00 31.98 63.89 94.48 127.21 155.72 188.16 216.87 245.58
Efficiency 100.0% 99.9% 99.8% 98.4% 99.4% 97.3% 98.0% 96.8% 95.9%
Figure 4: Single offset load distribution on 52 CPUs In a production environment such as the one at Petrobras, it is infeasible to conduct long execution time research experiments. To overcome this limitation, we developed a simple execution time simulator that reproduces the block schedule policy to any number of CPUs. The simulator takes as input data the measured block load of Figure 3 and outputs the execution time per CPU. The simulator was validated by comparing reported execution speeds with those of similar production runs.
6.3. Using GPUs To account for the gains of including GPUs we executed the GPU overloaded code version over the 20 offsets load using 64 GPUs (32 cluster nodes), varying the number of CPU cores per GPU. Table 10 contains total execution times with 1, 2 and 3 CPU cores per GPU. Table 10: Execution time with 64 GPUs CPU cores Execution Time (s)
33000
64 128 192 44,632 33,432 33,363
32000
Overloading GPUs with two cores per GPU reduced the execution time of the non-overloaded case (one core per GPU) by 25%. But overloading GPUs with the third core reduced execution time by a mere 0.2% (69 seconds) with respect to the two cores per GPU execution.
Load Unbalancing (s)
31000 30000 29000 28000 27000 26000
Table 11: Gain in introducing GPUs
25000 0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Offset
Figure 5: Load unbalancing as a function of offset To verify how load unbalancing changes with increasing number of offsets (allowing overlapped offset execution), the simulator was fed with 20 offsets by repeating 20 times the load of Figure 3. Execution time with 52 CPUs was estimated in 3,280,538s (37
Cores Without GPU 64 128 192 256
Cores With GPU 64 128 192 59.46 29.87 39.87 20.19 26.96 27.01 15.47 20.65 20.70
The impact of bringing GPUs into the computation is summarized in Table 11, where the simulated execution times of Table 9 (CPU cores only) are divided by the measured execution times of Table 10 (CPU cores overloading GPUs). Gains are impressive in many ways. CPU cores use is optimized when two cores overload the GPU. The best execution time without the GPU occurs using all cores. By comparing these cases, the GPU introduction reduces the execution time by a factor of 20.
7. Conclusions and Future Work We report the full porting of a seismic production application to a cluster of GPUs with impressive performance gains. Using 64 GPUs and half of the available CPU cores, the execution is 20 times faster than a pure CPU execution with all available CPU cores, taking as input a real world survey that generates unbalanced load. CPU – GPU cooperation is crucial to achieve such a gain. Instead of continuously moving application hot spots from the CPU to the GPU, we privilege CPU – GPU cooperation, leaving to the first not only IO and loop control computations, but loading it with meaningful computations. GPU overloading is just a consequence of this strategy. A natural continuation of this work is to attempt to use idle CPU cores, e.g. by anticipating the execution of future GPU tasks. Another interesting direction is the use of elaborate load balancing schemes, maybe in conjunction with the use of idle CPU cores.
Acknowledgments The authors thank Petrobras for the long term research opportunity and for authorizing the public dissemination of research results. We also thank Carlos Lopo Varela for a throughout revision on an early version of this paper. The authors also gratefully acknowledge the continuous support of AMD, Intel, IBM and NVIDIA during this research term. Finally we also thank the National Center for Supercomputing Applications (NCSA) of the University of Illinois at Urbana Champaign for allowing access to the NVIDIA GPU cluster of the Innovative Computing Laboratory where early experiments on porting feasibility and achievable performance were conducted.
Bibliography [1] M. Garland et alli, “Parallel Computing Experiences with CUDA”, IEEE Micro, V 28, n 4, pp 13-27, July-Aug 2008.
[2] O. Yilmaz, Seismic Data Processing, Society of Exploration Geophysics, Tulsa, OK, 1988. [3] J. F. Claerbout, Imaging the Earth’s Interior, Blackwell Scientific Publications, USA, 1985. [4] ProMAX seismic processing family, available at http://www.halliburton.com/ps/Default.aspx?navid=221&pag eid=862&prodid=MSE%3a%3a1055450737429153. [5] Omega Seismic Processing System, available at http://www.westerngeco.com/content/services/dp/omega/inde x.asp? [6] S. Ryoo et alli, “Optimization Principles and Application Performance Evaluation of a Multithreaded GPU using CUDA”, Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, ACM Press, 2008. [7] NVIDIA CUDA Programming Guide Version 2.0, at http://www.nvidia.com/object/cuda_develop.html. [8] V. Volkov and J. W. Demmel, “Benchmarking GPUs to Tune Dense Linear Algebra”, SC’08, ACM Press, 2008. [9] J. Panetta et alli, “Computational Characteristics of Production Seismic Migration and its Performance on Novel Processor Architectures”, Proceedings of the 19th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), IEEE Computer Society, 2007. [10] J. Panetta et alli, “Seismic Imaging on Novel Computer Architectures”, to appear in Proceedings of the 11th International Congress of the Brazilian Geophysical Society, SBGf 2009. [11] B. Deschizeaux and J. Y. Blanc, “Imaging Earth’s Subsurface Using CUDA”, CPU Gems 3, available at http://developer.nvidia.com/object/gpu-gems-3.html.
[12] P. Micikevicius, “3D Finite Difference Computation on GPUs using CUDA”, GPGPU2 – Second Workshop on General-Purpose Computation on Graphics Processing Units, ACM, March 2009. [13] W. Jeong and R. Whitaker, “A Fast Eikonal Equation Solver for Parallel Systems”, Report SEP 138, Stanford Exploration Project, May 2009. [14] R. L. Graham, “Bounds on Multiprocessing Timing Anomalies”, SIAM Journal on Applied Mathematics, V 17, pp 416-429, 1969.