A Parallel Solution to Geophysical Waveform Analysis using Standard Template Adaptive Parallel Library (STAPL) Paul Thomas, Derek E. Kurth, Nancy M. Amatoy Department of Computer Science Texas A&M University College Station, TX 77843-3112 fpthomas,dkurth,
[email protected]
Abstract
of problems. Since researchers in fields like Physics, Chemistry and Biology are not necessarily trained The Standard Template Adaptive Parallel Library (STAPL) computer scientists, tools that assist them in writ[6, 5] is the parallel equivalent of the ANSI C++ ing optimized code are very valuable. The Standard Standard Template Library (STL). Like STL, STAPL Template Library (STL) was developed to reduce prorelieves the programmer from implementing common gramming overhead by providing efficient implemenalgorithms and data structures. As a consequence, tations of data structures and algorithms. This paSTAPL also hides the complexities of parallel proper discusses a similar tool for parallel processing, gramming from the programmer. STAPL provides STAPL, the Standard Template Adaptive Parallel Lioptions for automatically parallelizing STL code and brary [6, 5] which is being developed in the Comalso allows the programmer to directly write parallel puter Science Department at Texas A&M University. programs in STAPL for better performance. In this STAPL contains parallel equivalents of the sequenpaper we study the speedups obtained by STAPL in a tial containers, algorithms and iterators present in STL. pre-existing STL program. The specific problem we STAPL can replace the STL code automatically by study involves seismic ray tracing, a technique used invoking an automatic preprocessing translation phase. by geoscientists to study the interior of the earth. STAPL also provides functionality to further optiThis application is essentially embarrassingly paralmize the code manually and achieve additional perlel. Initially, automatic parallelization with STAPL formance gains. achieved poor speedups. This was attributed to the In this paper we study the speedup obtained by limited use of STL in the original C++ code. After the manual and automatic parallelization of a seisminimal re-coding to increase the use of STL, good mic ray tracing code using STAPL. Seismic ray tracspeedups were obtained using both STAPL’s autoing [1, 2] is a key method for earth modeling and matic parallelization and manual parallelization. data analysis in the fields of Geophysics and Geosciences. Since the models dealt with can become very large, performing ray tracing computations in 1 Introduction parallel can improve total execution time. Figure 1 The scientific community has always made use of gives an overview of ray tracing. To generate ray incomputational power to find solutions for a number formation for a particular earth model, a source of This research supported in part by NSF CAREER Award seismic waves is first placed at a known location on CCR-9624315, NSF Grants IIS-9619850, ACI-9872126, EIA- the surface of the earth. The waves emitted from 9975018, EIA-0103742, EIA-9805823, ACI-0081510, ACI- the source travel through the earth, bending as they 0113971, CCR-0113974, EIA-9810937, EIA-0079874, and by travel due to the varying types of media they pass the DOE ASCI ASAP program grant B347886. through, and eventually bounce back to the earth’s y Corresponding Author
1
source
receivers
0 5
DEPTH (KM)
10
10
20
15 30 20 25
30
35
40 50
Figure 1: Ray Tracing: A seismic wave source is located
0
20
40
60
80
100
DISTANCE (KM)
on the earth’s surface. Seismic rays travel through the interior of the earth, bending as they travel through various media and eventually bounce back to the surface where they are detected by receivers (or geophones).
Figure 2: Seismic ray paths (thin lines) and wavefronts (heavy gray lines) in a medium with a linear increase in seismic wave velocity in the vertical direction. Numbers indicate the time corresponding to each wavefront in seconds. The rays propagate from the source, and the velocity gradient causes them to gradually bend and return to the surface, where the waves would be detected by receivers (geophones). Ray tracing is the numerical solution for the ray paths, the travel time of the wave along the path, and the wave amplitude to simulate wave propagation from the source to a specific receiver.
surface. Receivers (called geophones) are placed on the surface to detect the direction and amplitude of the waves as they reach the surface. At this point, there are two interesting questions. The inverse problem [7, 18, 8, 10, 17, 14] is concerned with how to send waves into an unknown earth model and use the information the receivers detect to predict or reconstruct the geological structure of the earth. This is obviously a very important problem for oil exploration, earthquake fault analysis, etc., [19, 20, 22, 9]. Unfortunately this is also a very difficult problem which currently cannot be solved. The forward problem [2, 13], investigated in this research, is to calculate the paths, travel times, etc., of waves propagating through a known earth model from a known source. An efficient solution to the forward problem may be useful for solving the inverse problem. The code that we study in this paper begins with a known model and ray source, and, given this information, computes the ray paths. We also compute the travel time for a ray to reach a certain point and the amplitude of the waves at a particular point. An illustration of this can be seen in Figure 2. This data could be used for geophysical simulations and visualizations [13, 1]. For a non-trivial earth model, there may be hundreds or thousands of rays, and, depending on the parameters (which we will discuss further in Section 2.2), we will be solving differential equations for each of those rays frequently. The time taken by the program can increase rapidly as parameters are changed. In most real world applications such as oil exploration and seismic plate analysis, re-
alistic input sets are very large, and the computation may be run multiple times. In such situations, parallel computation is a natural choice to obtain results in a reasonable time. On analyzing the problem, we found that the application is “embarassingly parallel.” That is, the ray calculations can be performed independently, as the rays do not interact with each other in any fashion. As the code made use of STL to begin with, STAPL was an obvious choice for parallelization.
1.1 Our Results In this paper we report on the parallelization of a sequential C++ code for seismic ray tracing that was developed by Dr. Rick Gibson of the Department of Geosciences at Texas A&M University. As a first step we tried automatic parallelization of the code using STAPL. Automatic parallelization failed to achieve significant speedup, mainly because the code did not make full use of STL containers, algorithms and iterators. We profiled the code to identify the regions that took most of the computation time and redesigned those sections of the code to make better use of STL constructs. Automatic parallelization was performed 2
again on the redesigned code and we were able to 01: 02: achieve good speedups. We also parallelized the code manually. We observed that in this particular prob- 03: lem manual and automatic parallelization were iden- 04: tical and gave similar results. If the initial code had 05: made full use of STL constructs, then this speedup 06: could have been obtained without the redesigning phase. In short, using STAPL, we were able to achieve par- 07: allelization without the overhead required by con- 08: 09: ventional parallel programming. 10: 11: 12: 13: 14: 15:
2 The Application 2.1 Organization of the code A brief overview of the seismic ray tracing code is given in Figure 3. Starting from a source on the surface of the earth, a set of rays propagate through the earth model. At any point in time, the current location of all of the waves propagating through the earth model constitutes the wavefront [21, 15, 16]. This can be thought of as a mesh of points in three dimensions identifying the wave locations at that time. An important requirement is that the rays present in the system should be sufficient to accurately describe the wavefront. For example, we may begin with a small number (e.g., 10) of very close rays. As they travel through the earth, their paths may diverge so that it is difficult to say with accuracy what rays between them would look like. Before the situation reaches this point, we interpolate, adding rays to the system to describe it more accurately. To test whether interpolation is necessary, we compute the paraxial correction to travel time [13] for a point midway between two adjacent rays in the wavefront. If this value exceeds a minimum threshold, we add a ray at the midpoint of the two rays using interpolation. The test is performed for all sets of adjacent rays (it is actually performed on “patches” of four rays at a time, as shown in Figure 4) in the wavefront before proceeding to the next time-step to calculate the next wavefront.
Get Information about the Earth Model Get Input Parameters RAYFIELD CONSTRUCTOR Setup the initial WaveFront Set up the initial Rays. Trace the initial Rays while ( Inside the Earth Model ) Step Front by one time step INTERPOLATE FRONT() for(i=0; i< Number_Patches; ++i) If (PatchTest(i)) then Intepolate(i) end for end while Print Results and Timing Statistics
16: End Of Program
Figure 3: Psuedo-Code for Ray Tracing Algorithm
The profiling software ‘cxperf,’ provided by HP in their HP-UX platform, reported how much CPU time each function took, including and/or excluding its child functions. It was also used to study the behavior of the program with changes in parameters. The main operations the code performs are to compute where rays will be at the next time step given the position at a previous time step (that is, to compute the next wavefront), and to interpolate when necessary. This corresponds to lines 07–14 of the pseudo-code. There are two important parameters involved in this process: frontDTau and dTestTau. The value of frontDTau defines the timestep between two wavefronts, i.e., how often a wavefront should be calculated. Calculating the positions of all the rays in a wavefront for the next timestep can be a computationally intensive process, so it is important to chose a value of frontDTau that will be small enough so that the wavefronts describe the model well, but large enough so that these calculations are not performed unnecessarily. We also note that the Runge Kutta method used to calculate the positions 2.2 Profiling the code of rays in the next wavefront is more accurate as The first step in parallelizing any code is to identify frontDTau is decreased. Figure 5(a) illustrates the the portions which take the most computation time. increase in computation time as frontDTau is made
3
A)
B)
Figure 4: (A) Schematic illustration of a portion of a wavefront mesh, with one interpolated cell (gray lines). (B) Logical topology of the mesh in the algorithm, where each ray and wavefront surface element is uniquely related to a pair of take-off angles and . The gray lines indicate the new boundaries of the interpolated patch on the wavefront surface.
by this code involve: 1. Setting up the initial rays and the ray field (Lines 03–06 of pseudo-code). 2. Testing for interpolation and performing interpolation if the test is positive (Lines 08–14 of pseudo code). After profiling the code with cxperf, we noticed that there are the two main time sinks in the computation, with the first step taking approximately 40% of the time and the second step taking approximately Figure 5: (a) Increasing the value of frontDTau causes 55% of the time. These two steps in the calculations the wavefront to be computed less frequently, so the time taken by the CPU decreases. Increasing this value too were implemented in two functions in the RayField much, however, will lead to an inaccurate model of the class, the class which contains the wavefronts at varearth structure due to lack of information about the wave- ious times of the simulation. Setting up the new rays front. (b) Increasing the value of dTestTau increases the was done in the RayField constructor. Testing of the amount of error allowed before interpolation will occur, wavefronts and interpolation were done in the Interso this also decreases CPU time. Clearly, increasing this polateFront() function in the RayField class. value too much will introduce undesirable error.
3 Parallelization with STAPL
smaller (that is, the computation occurs more frequently). The value of dTestTau defines the maximum error in travel time prediction allowed before interpolation should occur. When the program tests whether to interpolate, it looks at two points in the wavefront and computes the paraxial time [13] difference of a point between those two. If the error in this computation is greater than dTestTau, interpolation occurs as shown in Figure 4. As is clear from the previous discussion, the most important and CPU intensive calculations performed
Code written using STL can very easily be translated to its parallel analogue. STL is composed of three basic units: containers, algorithms and iterators. STL containers are converted to pContainers in STAPL, algorithms are converted to pAlgorithms, and iterators are converted to a new STAPL construct called pRanges. STAPL provides two forms of parallelization, automatic and manual. Automatic translation is a pre-processing phase in which the STL constructs are directly replaced by their parallel equivalents. It had been observed that in many cases the 4
automatic translation gives speedups very close to those obtained by manual translation. Manual parallelization takes full advantage of the power of STAPL and reduces the run-time overhead involved with automatic parallelization. STAPL exhibits the best performance gain when used directly through manual modification.
constructs. This loop does not make full use of STL functions. This section of the codes goes through all the patches in the wave front and checks whether it should or should not be interpolated. This code could be easily modified to make use of the STL construct for each, which iterates through each element in the vector. The modified code is given in the second column of Figure 7. The first and the second arguments specify the iterators which determine the boundaries between which the test should be done. The third argument is a pointer to the function that should be executed on each element in the vector. A number of loops (especially those that took a large amount of computational time) were converted using the for each construct in STL. After redesigning the code to make use of more STL functions, we tested auto-parallelization by STAPL again. This time, the parallelized code gave good speedups. The results of this parallelization is described in detail in section 4.
3.1 Automatic Parallelization
Automatic parallelization is a preprocessing phase which converts the STL constructs to their parallel equivalents. Small modifications need to be done to the sequential STL code to parallelize it. To parallelize a single function or a portion of a function, two lines (pre-processing tags) need to be added the program. Immediately before the portion of code that should be parallelized, the tag #include should be added and immediately after the portion of code the tag #include should be added. These include statements cause 3.2 Manual Parallelization any STL function between the two statements to be replaced by the equivalent STAPL function. An il- Manual parallelization deals with making use of the lustration of STL to STAPL conversion is shown in STAPL objects to directly write parallel programs. As a first step, the STL header files like vector.h Figure 6. As a first step towards parallelization we added and algorithm.h should be replaced by the STAPL these pre-processing tags to the Ray Tracing code header files p vector.h and p algorithm.h. and observed the behavior of the code. We ran the Instead of making use of the STL container types auto-parallelized code on multiple machines and we like vector, list and tree, their parallel equivwere surprised to find little improvement. The auto- alents such as p vector, p list, p tree should parallelized code gave less than 5% speedup when be used. Corresponding to the iterators in STL, we ran it on eight processors. To investigate this poor pRanges are used to access the member data in a speedup we went through the parts of the code that container. By explicitly allowing the user to make took the largest portion of the computational time. use of the STAPL code, the run time work done by One of the most computationally intensive parts of STAPL is reduced. Generally the best performance the program was the InterpolateFront() function. This is expected with manual parallelization. Neverthefunction dealt with testing and interpolating the patchesless, previous experimental results have shown that and took 55% of the total execution time of the pro- in some cases automatic translation was able to achieve gram. Parts of the code that dealt with this are dis- performance within very close to that attained by hand parallel coding (sometimes within 5%). cussed below. This corresponds to Figure 7. For our seismic ray tracing application, automatic As previously mentioned, STAPL replaces the standard STL constructs with their parallel equiva- translation from STL to STAPL and direct coding lents. To make full use of the capabilities of STAPL in STAPL turned out to be virtually identical. This and to assist in the automatic parallelization, the se- is mainly because the computational effort is conquential code should make maximum use of STL centrated in one portion of the program. After the
5
Before Translation (STL) #include accumulate(x.begin(),x.end(),0); for each(x.begin(),x.end(),foo); #include Figure 6:
|
After Translation (STAPL)
--> -->
pi accumulate(x.begin(),x.end(),0); pi for each(x.begin(),x.end(),foo);
Automatic parallelization of the code: Before and after STAPL translation
Original Code Vector< WaveFrontElement> iFront; for(i=0; i -->
Redesigned STL Code Vector< WaveFrontElement> iFront; for each(iFront.begin(),iFront.end(), PatchTest);
Figure 7: Example of redesigning the code to make use of STL
pre-processing phase, the automatic parallelization converted the for each loops (where most of the computation was conducted) to their STAPL equivalent pi for each. In this problem the manual parallelization meant removing the pre-processing tags and replacing the for each function with the STAPL function pi for each directly. So the behavior of the compiled program was the same for automatic and manual parallelization.
Figure 8: Speedup of RayField code with STAPL for an input size of 1600 Rays
4 Results and Discussion We note that STAPL was able to attain scalable speedups. As seen in Table 2 and Figures 8–10, when 8 processors were used we were able to attain a speedup of approximately 5.50. Performance gains remained consistent for various input parameters. STAPL was able to get better performance improvements for larger input sizes. The performance improvement with eight processors for various input sizes (number of rays) is given in Figure 11. The speedup is the highest when 7225 rays were used. As the number of rays increases the computation required at each time step increases. The higher the computation done in each time step, the lower is the fractional overhead involved in parallelization. Presently we are working on porting the Ray Trac-
To study the speedup obtained by STAPL we varied the number of processors (threads were allocated one per processor) and recorded the running time of the program. The program was also run for different values of the input parameters. The experiments were done in dedicated mode on an HP V2200 machine with 16 PA-8200 CPU’s running in 4GB of physical memory in the PARASOL Laboratory in the Department of Computer Science at Texas A&M University. The results of the experiments are summarized in Table 1. The speedup obtained for some input sizes is shown in Tables 1 and 2 and in Figures 8– 11. All times reported are the minimum over five executions of the experiment.
6
CPU Threads 1 2 3 4 6 8
1600 Rays 282.3 165.4 145.6 102.3 67.8 56.0
2500 Rays 497.8 287.8 265.3 190.7 114.3 98.2
3600 Rays 555.3 315.4 279.5 196.7 127.6 106.5
4900 Rays 780.4 440.7 390.2 277.2 171.1 145.4
6400 Rays 1089.5 365.6 512.1 384.6 230.1 194.6
7225 Rays 1289.5 663.7 604.7 421.2 265.8 227.6
Table 1: The time (in seconds) taken by the program for different number of processors used (rows), for various input sizes (columns)
CPU Threads 1 2 3 4 6 8
1600 Rays 1 1.71 1.94 2.76 4.16 5.04
2500 Rays 1 1.73 1.87 2.61 4.35 5.07
3600 Rays 1 1.76 1.98 2.82 4.35 5.21
4900 Rays 1 1.77 2.00 2.81 4.56 5.37
6400 Rays 1 1.71 2.13 2.83 4.73 5.60
7225 Rays 1 1.94 2.13 3.06 4.85 5.66
Table 2: Speedup obtained by STAPL for different number of processors used (rows), for various input sizes (columns)
Figure 9: Speedup of RayField code with STAPL for an
Figure 10: Speedup of RayField code with STAPL for an
input size of 4900 Rays
input size of 7225 Rays
ing code to the SGI platform. Once this phase is Parallel computing is often viewed as a field with complete we will run the parallel version of the code cumbersome coding, and it usually takes a long time on SGI supercomputers (Origin 2000 and Origin 3800) for a novice to become accustomed to it. Issues like available at Texas A&M University. Previous expericommunication and synchronization can become nightmental results have shown that speedups obtained on mares even for an expert programmer. STAPL brings the HP V-Class are comparable to those obtained on parallel computing within the reach of any programthe SGI supercomputers. Thus, we expect the Raymer who has a basic knowledge of STL. It should Field code to show similar speedups on the SGI maalso be mentioned here that STAPL is a product unchines. der development. The speedup obtained is likely to 7
final paper. We are exploring making the program adaptive. An adaptive program should be able to detect the resources available on the machine on which it is running and choose parameters depending on these resources.[3, 4, 12, 11]. There are also aspects to this application which lend themselves to adaptive optimization. For example, presently we are dealing with homogenous-anisotropic earth models. We intend to study the behavior of the code with more complex earth models, which will both increase the computational cost for each ray tracing and will require load balancing.
Figure 11: Comparison of speedup obtained for various input sizes using 8 CPU threads.
6 Acknowledgement improve as more portions of STAPL get optimized. We would like to thank the STAPL, PARASOL and Presently, STAPL has been ported to the HP V-Class, DSMFT groups in the Computer Science Department SGI Origin 3800, SGI Origin 2000 and SGI Power at Texas A&M. We also thank the research group Challenge systems. The future releases of STAPL of Dr. Gibson of Geosciences Department at Texas will be compatible with a wide range of platforms. A&M. The performance gains obtained here could easily be replicated on other machines with no additional programming effort besides porting the (sequential) ap- References plication code itself. Another advantage of using STAPL is that the [1] K. Aki and P. Richards. Quantitative Seismology, Vol. 1. W.H. Freeman, San Francisco, user is isolated from the underlying communication 1980. mechanism – these are all encoded in a generic STAPL communication interface. As the standards and technologies change, only the STAPL libraries need to be [2] K. Aki and P. Richards. Quantitative Seismology, Vol. 2. W.H. Freeman, San Francisco, modified. To summarize, STAPL is a good tool both 1980. for a novice programmer who does not want to get involved with the complexity of parallel computing [3] N. M. Amato, J. Perdue, A. Pietracaprina, and for a veteran programmer who does not want to G. Pucci, and M. Mathis. Predicting perforre-program his code every time an advance is made mance on SMPs. a case study: The SGI Power in parallel computing or when a new machine beChallenge. In Proc. International Parallel and comes available. Distributed Processing Symposium (IPDPS), pages 729–737, 2000.
5 Future Work
[4] N. M. Amato, A. Pietracaprina, G. Pucci, L. K. Dale, and J. Perdue. A cost model for communication on a symmetric multiprocessor. Technical Report 98-004, Dept. of Computer Science, Texas A&M University, 1998. A preliminary verson of this work was presented at the SPAA’98 Revue.
As STAPL becomes compatible with more platforms we plan to study the behavior of the code in those environments also. Presently STAPL is compatible with HP Unix and SGI platforms. Once the porting of the sequential RayField code to SGI machines is complete we will study the speedup obtained there with STAPL. These results will be reported in the 8
[5] P. An, A. Jula, S. Rus, S. Saunders, T. Smith, Kirchhoff and Born modelling. Geophysics, G. Tanase, N. Amato, and L. Rauchwerger. 64:1793–1805, 1999. STAPL: An adaptive, generic parallel programming library for C++. In Proc. of the Interna- [15] G. Lambar´e, P.S. Lucio, and A. Hanyga. Twodimensional multivalued traveltime and amplitional Workshop on Advanced Compiler Techtude maps by uniform sampling of a ray field. nology for High Performance and Embedded Geophys. J. Int., 125:584–598, 1996. Processors (IWACT), Bucharest, Romania, July 2000. [16] A. Leidenfrost, N. Ettrich, D. Gajewski, and D. Kosloff. Comparison of six different meth[6] P. An, A. Jula, S. Rus, S. Saunders, T. Smith, ods for calculating traveltimes. Geophys. G. Tanase, N. Amato, and L. Rauchwerger. Prosp., 47:269–298, 1999. STAPL: An adaptive, generic parallel programming library for C++. In Proc. of the 14th Inter- [17] D. Lumley. Angle-dependent reflectivity estinational Workshop on Languages and Compilmation. pages 746–749. Soc. Expl. Geophys., ers for Parallel Computing (LCPC), Cumber1993. land Falls, Kentucky, Aug 2001. [18] D.E. Miller, M. Oristaglio, and G. Beylkin. A [7] W. Beydoun and M. Mendes. Elastic ray-Born new slant on seismic imaging: migration and l2-migration/inversion. Geophys. J., 97:151– integral geometry. Geophysics, 52:943–964, 160, 1989. 1989. [8] G. Beylkin and R. Burridge. Linearized inverse [19] X. Song. Anisotropy of the earth’s inner core. scattering problems in acoustics and elasticity. Reviews of Geophys., 35:297–313, 1997. Wave Motion, 12:15–52, 1990. [20] A. Souriau. New seismological constraints on [9] S. Crampin, R. McGonigle, and M. Ando. differential rotation of the inner core from noExtensive-dilatancy anisotropy beneath Mt. vaya zemlya events recorded at DRV, AntarcHood, Oregon and the effect of aspect ratio on tica. Geophys. J. Int., 134:F1–F5, 1998. seismic velocities through aligned cracks. J. [21] V. Vinje, E. Iversen, and H. Gjøystdal. TravGeophys. Res., 91:12703–12710, 1986. eltime and amplitude estimation using wave[10] E. Forgues, G. Lambar´e, P. de Beukelaar, front construction. Geophysics, 58:1157–1166, F. Coppens, and V. Richard. An application of 1993. Ray + Born inversion on real data. pages 1004– [22] L. Vinnik, V. Farra, and B. Romanowicz. Az1007. Soc. Expl. Geophys., 1994. imuthal anisotropy in the earth from observa[11] M. Frigo. A fast Fourier transform compiler. tions of sks at geoscope and nars broadbands ACM, 1999. stations. Bull. Seis. Soc. Amer., 79:1542–1558, 1989. [12] M. Frigo and S.G. Johnson. FFTW: An adaptive software architecture for the FFT. pages 1381–1384. IEEE, 1998. [13] R. L. Gibson, Jr., A. G. Sena, and M. N. Toks¨oz. Paraxial ray tracing in 3-D inhomogeneous, anisotropic media. Geophys. Prosp., 39:473– 504, 1991. [14] H.H. Jaramillo and N. Bleistein. The link of Kirchhoff migration and demigration to 9