Oct 18, 2009 - multi-core CPU servers; (b) stream computing on graphics card (GPU) ... additional hardware dedicated to providing substantially increased ...
Main Menu
An implementation of the acoustic wave equation on FPGAs Tamas Nemeth*, Joe Stefani, Wei Liu, Chevron; Rob Dimond, Oliver Pell, Maxeler; Ray Ergas, formerly Chevron Summary
Code Analysis and Implementation
We utilize FPGA chips as co-processors in a PCI Express configuration to accelerate an acoustic isotropic modeling application. We have achieved speedups of 2 orders of magnitude when compared to a single-core implementation on modern CPU chips. These results indicate that FPGAbased acceleration technology may become a viable alternative for some tasks in seismic data processing.
The acoustic forward modeling application in consideration is 3D finite difference, with 4th-order in time, 12th order in space and uses single precision floating point arithmetic. Input data consist of two 3D earth model arrays (velocity and density) and a source function. The application iterates for a number of time steps with three wavefield arrays. Total memory requirement on the CPU therefore is to store 4-bytes per point, for 5 arrays, or 20 × N3, where N is a spatial dimension. Since acceleration has the biggest impact for large projects, we assume that any solution should work for model sizes of N=1000, with a total memory of 20 GB or more.
Introduction High-performance computing (HPC) is at a crossroads since 2004, when the single-core CPU clock frequencies stopped increasing. Since then several alternative technologies became viable for HPC: (a) ‘mainstream’ multi-core CPU servers; (b) stream computing on graphics card (GPU) co-processors; (c) Cell chip configurations; and (d) FPGA chips as co-processors. Hardware accelerators as co-processors are emerging as a powerful solution to computationally intensive problems. A standard desktop PC or cluster node can be augmented with additional hardware dedicated to providing substantially increased performance for particular applications. Previous efforts have shown that FPGA-based hardware accelerators can offer order-of-magnitude greater performance than conventional CPUs, providing the target algorithm performs a large number of operations per data point. FPGAs are off-the-shelf chips with a configurable ‘sea’ of logic and memory that can be used to implement digital circuits. FPGAs can be attached to the compute system either through the main system bus or as PCI Express cards (or similar) and are typically configured as highly parallel stream processors. FPGA acceleration has been successfully demonstrated in a variety of application domains including computational finance (Zhang et al., 2005), fluid dynamics (Sano et al., 2007), cryptography (Cheung et al., 2005) and seismic processing (Bean and Gray, 1997; He et al., 2005a; He et al., 2005b; Pell and Clapp, 2007). While these co-processor technologies are still out of the mainstream HPC, they offer potential speedups that cannot be ignored. The potential performance gains have large implications on seismic data processing and therefore it is important to systematically evaluate these technologies. In this study we present acoustic isotropic modeling results using a Maxeler PCI Express x16 acceleration card that is based on FPGA chips. We first describe the considerations related to producing an FPGA implementation and then compare the traces with those from CPU implementations.
The acoustic variable density modeling code contains a kernel which consumes the majority of the compute cycles, indicating that the algorithm is a good candidate for acceleration. The finite difference operators are calculated to minimize the relative phase velocity error over the bandwidth. The implementation is 12th-order in space because these long and optimum operators only require about 2.4 nodes per minimum wavelength. There is a stilloutstanding question regarding the optimum operator length on any particular parallel machine, where out-ofprocessor communications cost much more than inprocessor ones. We started the effort by studying the runtime performance of the application on CPUs. For a 3003 mesh run on the AMD Opteron, the portion of runtime not contained within the kernel is only 0.04%. The kernel itself can be broken down into 5 sub-components, which can be considered separately for acceleration (Table 1). Operation
Input Arrays
Output Arrays
2nd order operator Vector addition 4th order operator Update pressure Boundary sponge Remainder
K, b, q
A
% of Execution Time 46.8
A, p
p
1.1
K, b, A, p
p
46.4
p, q
p
1.0
p, q
p, q
4.0
Table 1: Proportion of kernel run-time devoted to different parts of the kernel for a 300^3 mesh size running on 2.8Ghz AMD Opteron
SEG Las Vegas 2008 Annual Meeting 2874
0.7
Downloaded 18 Oct 2009 to 146.169.5.100. Redistribution subject to SEG license or copyright; see Terms of Use at http://segdl.org/
2874
Main Menu
Acoustic wave equation on FPGAs
Arrays p and q denote pressure 3D arrays (current and previous), arrays K and b denote earth model arrays and A denotes a temporary array after some steps. At the top level, the algorithm can be adapted to choose the best option to make use of the three key resources: compute capability, memory capacity and memory bandwidth. We consider four different ways of implementing the same algorithm in software, and assess their relative performance for the FPGA (Pell et al., 2008). For all implementation options we consider only a single data access plan (all data residing on the FPGA card) since this is the solution likely to offer maximal speed-up. If on-card memory is limited other data access plans can be considered to offer a tradeoff between memory capacity and speed-up. Option 1: Uni-Axial The existing software implementation is written as six 11pt uni-axial convolutions and a set of smaller computations. On an FPGA this can be realized in six passes through the data-set, combining the point computations (such as the pressure update and boundary sponge) into the main convolution passes. The memory requirements for this implementation are identical to those of the software implementation: 3 wavefield arrays and 2 earth model arrays. The uni-axial convolutions require only minimal memory on the FPGA and thus can be implemented efficiently without domain decomposition, but offer a relatively low computation-to-load/store ratio which may lead to the algorithm becoming memory-bound at a faster rate. Option 2: 23pt Tri-axial The uni-axial code can alternatively be implemented as a 23pt tri-axial operator. On a CPU this approach is unattractive due to the large size of the convolution operator and the number of arithmetic operations that must be performed more than once. However, this option can be implemented using only 2 passes through the data-set, giving a much higher computation to load/store ratio. Triaxial convolutions place high demands on the FPGA internal memory, requiring the problem to be decomposed into smaller blocks which can be processed and then combined for each time step. The large size of the convolution operator makes the overhead from this approach significant, impacting both compute time and memory requirement. Option 3: 12pt Tri-axial An alternative to a 2-pass 23-pt tri-axial implementation is to split each pass in two, giving a 4-pass algorithm with a 12-pt operator. This implementation requires 3 extra data arrays to store results from the intermediate convolutions, leading to increased memory consumption over Option 2, although the domain decomposition overhead is reduced due to the smaller size of the operator.
Option 4: Composite Uni-axial This is an FPGA-optimized option developed from the initial uni-axial implementation (Option 1). Multiple uniaxial convolutions are combined and re-ordered to create an implementation which requires 3 passes through the dataset. The core of this approach is to take the original execution order and to re-arrange the computation order, then combine each set of two passes through the FPGA. This algorithm has a higher FPGA internal memory requirement than the uni-axial implementation, however, it is still dramatically less than the tri-axial implementation, and domain decomposition is unnecessary. Table 2 summarizes the different characteristics of these four implementation options (FLOPS estimated per data point). Op- Operator Passes CPU FPGA Temp. FPGA tion FLOPS FLOPS Array memory req. (est.) 1 1D 11pt 6 32 31 1 Low 2 3D 23pt
2
200+
93
1
Very high
3 3D 12 pt
4
54
51
4
High
4 Pseudo-2D 11 pt
3
-
59
1
Med.
Table 2: Summary of the four different algorithm implementation options
These implementation options were modeled for FPGAs by using the Maxeler Parton modeling tool. The Parton projections for speedup of acoustic modeling with a 4003 mesh are compared to multi-core 2.66GHz Intel Xeon. These chips are part of the popular 8-core HP DL140 1U servers that we use as a reference device. Here modeling implies that the original C/C++ software code was combined with Parton C++ library calls to estimate the performance of a particular partition. This is only the initial, exploratory stage of FPGA implementation, later the FPGA hardware models (circuits) are simulated on lowlevel software and a netlist file is generated. This file, in turn, is used in the Xilinx place-and-route software to generate the actual circuit configuration (bitstream) that is loaded on the FPGA chips at runtime. The modeling was done for capacities of Xilinx chips LX110T, LX220T, LX330T and assuming an unlimited area for resources on the chip. Another design consideration was made by assuming that the FPGA chips were placed on a PCI Express board (2 chips per board) and that there was enough memory on board to keep the main computations local to the PCI Express board. Figure 1 shows the modeled speedups against the best single-core CPU implementation. Based on these initial modeling results, we have selected the composite uni-axial option with LX330T chips to further focus on.
SEG Las Vegas 2008 Annual Meeting 2875
Downloaded 18 Oct 2009 to 146.169.5.100. Redistribution subject to SEG license or copyright; see Terms of Use at http://segdl.org/
2875
Main Menu
Acoustic wave equation on FPGAs
200
Speedup
180
Parton: Acceleration of 4003 with Different Options LX110T LX220T
160
LX330T
140
Unlimited Area
120 100 80 60 40 20 0 Uni-axial
23pt Tri-axial 12pt Tri-axial Implementation Option 3
Composite Uni-axial
Figure 1: Acceleration of 400 mesh computation using different options vs. 2.66Ghz Intel Xeon
Figure 2 shows the revised and refined Parton speedup projections for the chosen option of the acoustic modeling with a 4003 mesh compared to multi-core 2.66GHz Intel Xeon. 3
180
Parton: Acceleration of 400 Mesh over Single Core
pipelines per chip and lacks optimizations such as advanced compression schemes. Optimization 1 adds a new compression/decompression algorithm, which leads to a modest immediate speedup and further enables Optimization 2 and Optimization 3. In Optimization 2, FPGA compute capacity increases by 50% to six parallel pipelines, and finally we increase FPGA memory bandwidth by an additional 33% in Optimization 3. Optimization 3 provides a peak speedup of over 160x over a single core (Figure 2a), or 28-48x speedup per node depending on multi-core scaling (Figure 2b). We have observed as well that FPGA speedup increases further for problem sizes beyond 4003. Example We have tested Maxeler FPGA boards with the above described forward modeling implementation on several models. The boards were installed in 8-core servers as coprocessors. Figure 3 illustrates the tests on a 2-layer, 3D model, with 672 x 672 x 256 grid size.
160 140 Speedup
120 100 80 60 40 20 0
Initial Version
Optimization 1 Optimization 2 FPGA Software Optimizations
Optimization 3
(a) 3
60
Parton: Acceleration of 400 Mesh over Multi-core
1500
3000
4500 m/s
Minimum speedup 50
Maximum speedup
(a)
Speedup
40 30
0
20
0
Initial Version
Optimization 1 Optimization 2 FPGA Software Optimizations
Optimization 3
(b)
Figure 2: Acceleration of computation with 4003 mesh compared to 2.66GHz Intel Xeon, for single core and dual processor, quad core devices (total 8 cores). “Minimum speedup” compares to running multiple instantiations of sequential code. “Maximum speedup” compares to optimized OpenMP parallel software implementation.
The figure shows four different versions of the FPGA software. The initial version has four parallel compute
Depth (m)
500 550 600
10
2550
2000
Location (m)
6710
Vp=1.5
km/s
Den=1
Vp=4.0
km/s
Den=4
dx=10m fpk=32Hz nt=643 (b)
Figure 3: Experiment setup for acoustic modeling.
SEG Las Vegas 2008 Annual Meeting 2876
Downloaded 18 Oct 2009 to 146.169.5.100. Redistribution subject to SEG license or copyright; see Terms of Use at http://segdl.org/
2876
Main Menu
Acoustic wave equation on FPGAs
The velocities of the 2 layers were 1500 and 4000 m/s (Figure 3a) and the density contrast was 1:4. The source was placed at location (2000m, 500m) and the receiver line was placed at depth of 550 m. The modeling parameters were a grid size of 10 m, peak frequency of 32 Hz and 643 time steps. Figure 4 show the results of testing. Seismograms were calculated both on CPU using the reference modeling code (Figure 4a) and on the FPGA boards using the FPGA implementation (Figure 4b). The implementations are not identical in 2 ways: (a) the CPU implementation is a compiled code where some of the internal precision is implicitly decided at compilation time based on the compiler and IEEE specifications, while the precision is explicitly specified for the FPGA implementation at each step; and (b) the FPGA implementation utilizes compression as detailed in Optimization 1 above. Figure 4c shows the difference between these 2 implementations with a scale-up of 10 times. The differences are mostly due to compression, especially that of the initial source wavelet. 2000 Location (m) 6710
trace plot
Time (s)
0.2
1.0
CPU
(a)
FPGA
(b)
Difference (10X scale)
(c)
Figure 4: Comparison of the modeled seismograms between the CPU and the FPGA implementations.
Figure 5 displays a single trace from the apex marked ‘trace plot’ on Figure 4c and the zoom-in time periods. As seen from these figures, the error in Zoom-in 1 is at around 0.1% (maximum seismogram value is around 0.2 and the error is around 0.0002), while the error is at around 1.5% in the Zoom-in 2 period with a negative drift (maximum seismogram value is around 0.008 and the error is around 0.00012 and drift at -0.00012). The character of the error is mostly consistent with error caused by truncations due to compression. This error is in the accepted normal range of typical seismic modeling applications and controllable by adjusting the applied compression scheme and bit precision.
Difference (FPGA-CPU)
CPU FPGA
Zoom-in 2
Zoom-in 1
(a)
0.2
0.6
1.0
(b)
CPU FPGA
0.6
0.2
Zoom-in 1 (c) 0.04
0.1
0.14
Time (s)
0.2
Zoom-in 2 (d)
0.64
0.8
0.96
Time (s)
Figure 5: Comparison of a modeled seismogram between the CPU and the FPGA implementations.
Conclusions We have demonstrated an FPGA co-processor technology applied to acoustic isotropic finite difference modeling that is capable of achieving speedups in the range of 2 orders of magnitude compared to single-core implementation of the finite difference application on modern CPU chips. The implications of these results are twofold: (a) non-standard HPC technologies, such as FPGA chips can play a useful role in seismic data processing; and (b) the traditional notion of what is computationally cheap or expensive can be severely altered with the emerging new hardware and software technologies. Acknowledgements We thank Chevron Energy Technology Company for the permission to publish this study.
SEG Las Vegas 2008 Annual Meeting 2877
1.0
CPU FPGA
Downloaded 18 Oct 2009 to 146.169.5.100. Redistribution subject to SEG license or copyright; see Terms of Use at http://segdl.org/
2877
Main Menu
EDITED REFERENCES Note: This reference list is a copy-edited version of the reference list submitted by the author. Reference lists for the 2008 SEG Technical Program Expanded Abstracts have been copy edited so that references provided with the online metadata for each paper will achieve a high degree of linking to cited sources that appear on the Web. REFERENCES Bean, M., and P. Gray, 1997, Development of a high-speed seismic data processing platform using reconfigurable hardware: 67th Annual International Meeting, SEG, Expanded Abstracts, 1441–1443. He, C., G. Qin, and W. Zhao, 2005, High-order finite difference modeling on reconfigurable computing platform: 75th Annual International Meeting, SEG, Expanded Abstracts, 1755–1758. He, C., C. Sun, M. Lu, and W. Zhao, 2005, Prestack Kirchhoff time migration on high performance reconfigurable computing platform: 75th Annual International Meeting, SEG, Expanded Abstracts, 1902–1905. Pell, O., and R. Clapp, 2007, Accelerating subsurface offset gathers for 3D seismic applications using FPGAs: 77th Annual International Meeting, SEG, Expanded Abstracts, 2383–2387. Pell, O., T. Nemeth, J. Stefani, and R. Ergas, 2008, Design space analysis for the acoustic wave equation implementation on FPGA circuits: 70th Annual International Conference and Exhibition, EAGE, Extended Abstracts, P057.
SEG Las Vegas 2008 Annual Meeting 2878
Downloaded 18 Oct 2009 to 146.169.5.100. Redistribution subject to SEG license or copyright; see Terms of Use at http://segdl.org/
2878