Numerical Calculations for Geophysics Inversion ... - Semantic Scholar

3 downloads 13252 Views 265KB Size Report
Department of Geoinformatics and Applied Computer Science, al. .... method does not give information of under-sampled areas and perspective areas ... Conventional job submission diagram in Apache Hadoop framework with C++ wrapper.
Numerical Calculations for Geophysics Inversion Problem Using Apache Hadoop Technology Šukasz Krauzowicz1 , Kamil Szostek1 , Maciej Dwornik1 , Paweª Oleksik1 , and Adam Piórkowski1 AGH University of Science and Technology, Faculty of Geology, Geophysics and Environment Protection, Department of Geoinformatics and Applied Computer Science, al. A. Mickiewicza 30, 30-059 Kraków, Poland {szostek,oleksik,pioro}@agh.edu.pl,[email protected] http://www.geoinf.agh.edu.pl

There are considerations on the problem of time-consuming calculations in this article. This type of computational problems concerns to multiple aspects of earth sciences. Distributed computing allows to perform calculations in reasonable time. This way of processing requires a cluster architecture. The authors propose using Apache Hadoop technology to solve geophysics inversion problem.This solution is designed rather for analyzing data, but it also enables to perform computations. There is an architecture of solution proposed and real test carried out to determine the performance of method1 .

Abstract.

parallel computing, distributed computing, cluster, numerical computing Key words:

1

Introduction

Earth sciences, and especially geophysics, are an area of intense research. The considerations are accompanied by a number of modeling and data analysis, which are tasks that require high computational power. A single computer is often not able to perform these calculations in a reasonable time. In such cases it is necessary to use clusters [1]. In the paper [2] the problem of ground vibration modeling is presented. The authors used a very large size of the model. The numerical calculations were timeconsuming, therefore calculations were made in parallel on an eective computer cluster. 1

This is the accepted version of: Krauzowicz L., Szostek K., Dwornik M., Oleksik P., Piorkowski A.: Numerical Calculations for Geophysics Inversion Problem Using Apache Hadoop Technology. Computer Networks, CCIS Vol. 291, Springer 2012, pp 440-447 The original publication is available at www.springerlink.com

2

Šukasz Krauzowicz et al.

Seismic wave eld modeling is a topic of [3, 4]. This is another time-consuming problem in geophysics. It can reveal the nature of an analyzed wave phenomenon. This modeling is often a part of complex and extremely time consuming methods with almost unlimited needs of computational resources, therefore computations are dedicated for academic centers, especially with support from oil and gas companies. The GPU-PC cluster and cluster of component environments were tested. Another geophysical phenomenon is geothermal eld [5, 6]. Heat transfer modeling is very important in solving physical problems in Earth science as volcanoes, intrusions, earthquakes, mountain building or metamorphism. This kind of calculations requires high computational power that exceeds the capabilities of a single PC. There is the ability to use a high performance cluster. A solution based on the component technologies was set, but it was not fault-tolerant and did not support a load balancing.

2

Inverse Problem for Vertical Transverse Isotropy Geological Medium

Knowledge of velocity distribution in geological medium is one of the most important things in mining exploration. Process of reconstruction velocity distribution is named an inverse problem. Solving inverse problem, especially in anisotropic medium, is a dicult process, because of the non-linear relationship between distribution of value of the elastic parameters and received travel times of wave propagation. This relationship implicates that deterministic method of inversion is useless. One of the methods to obtain velocity distribution is a stochastic inversion. In this paper Monte Carlo method was used to obtain parameters values. Stochastic inversion is based on generating a huge set of models, calculating theoretical travel times for each seismic ray and evaluating solution. To compare the estimated and received travel times, a following equation was used: L=

2.1

N 1 ∑ est |T − Tirec | N i=1 i

(1)

Seismic Anisotropy

Travel times were estimated using The Shortest Time Method (e.g. [7], [8], [9]). In this method geological medium was divided into several velocity cells described by Thomsen parameters [10]: √ vP 0 = √ vS0 =

c33 ϱ

(2a)

c44 ϱ

(2b)

Numerical Computations for Geophysics Inversion Problem ...

c66 − c44 2c44

(2c)

(c11 − c33 ) 2c33

(2d)

(c13 + c44 )2 + (c33 − c44 )2 2c33 (c33 − c44 )

(2e)

γ≡ ε≡ δ≈

3

where c11 , c13 , c33 , c44 , c66 are coecients of the stiness tensor. Using this parameters, velocity of seismic wave in θ direction in vertical transverse isotropy medium is described by following equation [10]: vP (θ) ≈ vP 0 [1 + δ · sin2 θ · cos2 θ + ε · sin4 θ] vSV (θ) ≈ vS0 [1 + (

vp0 2 ) · (ε − δ) · sin2 θ · cos2 θ] vs0

vSH (θ) = vS0 [1 + γ · sin2 θ]

2.2

(3a) (3b) (3c)

Seismic Inversion

Crosswell tomography is a method for reconstruction Thomsen parameters' distribution. In one well seismic wave is generated and received by geophones from next well. To reconstruct velocity eld 31 shot points and 76 received points were used. In this case only P-wave was used for Thomsen parameters reconstruction, so it was impossible to obtain γ and vS0 value. Geological medium was divided into 24 velocity cells. That gives 72 independent values to obtain. This hyperspace excludes regular sampling. Typical method is to generate huge numbers of models and remember only a few of the best solutions. This method does not give information of under-sampled areas and perspective areas (region with small value of error, which can be sampled more). First problem will not be discussed in this work. On the other hand, the second problem can be partially solved by sampling space near the best solutions.

3

The Apache Hadoop

The ApacheTM HadoopTM technology provides a framework for data sets distributed processing. It is meant to be run on clusters of computers, from single servers up to thousands of nodes and to process millions of megabytes. This library is designed to be independent from any hardware failures - it implements malfunction detection on application layer [11, 12]. Apache Hadoop technology is widely used in cloud computing systems directed to process large amount of heterogeneous data in a distributed environment [13]. The Apache Hadoop works on three main subprojects:  Hadoop Common, which supports other subprojects,  Hadoop Distributed Files System (HDFS) - a high-throughput data access distributed le system,

4

Šukasz Krauzowicz et al.

 Hadoop MapReduce - large data sets processing framework, dedicated for clusters.

Hadoop technology was invented to process large amount of data. In 2009 this framework won the one minute sort: 500GB was sorted in 59 seconds on 1406 nodes. Then 100 terabytes sort was performed in 173 minutes on 3400 nodes [14]. Many companies and organizations use Apache Hadoop for research and production, i.e. Amazon, Yahoo!, Google, Facebook and more [15]. 3.1

Hadoop Distributed File System and MapReduce

Apache Hadoop technology takes advantage of MapReduce programming model. It allows user to write his own Map and Reduce tasks in Java or, by use of wrappers, in other programming languages, like C++, Ruby or Python. As Apache Hadoop itself is written in Java, using this language for MapReduce tasks is the fastest option, because otherwise Hadoop Streaming or Hadoop Pipes have to be used, i.e. for C++ these wrappers executes C++ Map or Reduce class and communicate through sockets (Fig. 1). The second strongest side of the Hadoop technology is highly failure-tolerant HDFS, designed to cope with large amount of data. As the data is distributed along all cluster nodes in HDFS, MapReduce can easily process it in parallel. Client node

JobTracker node

MapReduce Program

1: run job

2: get new job ID 4: submit job

Job Client

JobTracker

5: initialize job

Client JVM 3: copy job resources

6: retrieve input splits

7: heartbeat (returns task)

Shared File System (HDFS) 8: retrieve job resources TaskTracker node child JVM

C++ wrapper library

Input key/values

Task Tracker

9: launch

Child

run

Map Task or Reduce Task

socket Output key/values

C++ Map or Reduce class

Launch

Conventional job submission diagram in Apache Hadoop framework with C++ wrapper. Fig. 1.

Numerical Computations for Geophysics Inversion Problem ...

3.2

5

Apache Hadoop job workow

The gure 1 shows default Apache Hadoop job workow, which is: 1. The program is executed by client users. In this paper Job Tracker is used, which is the main node. 2. Client receives Job's ID from Job Tracker. 3. Client copies data, that is program and input data, to the HDFS. 4. Client conrms the job. 5. Job Tracker initiates the job. 6. Job Tracker receive input data from HDFS, then split it into parts and sends back to HDFS. 7. Task Tracker sends a heartbeat signal to the Job Tracker periodically to inform Job Tracker about its presents and readiness to work. 8. All working Task Trackers receive split data from HDFS 9. Task Trackers start the task. 10. In child JVM the job is executed. If C++ program was submitted wrapper is used to run C++ MapReduce libraries. It communicates through the socket. Otherwise Java classes are used directly. 11. Task Trackers return results to the HDFS.

4

Tests

The main idea of the Apache Hadoop utilization is to search perspective regions by analysis of the generated data with them evaluate value. In order to take advantage of Apache Hadoop results, they should be sent back to inversion algorithm and tested to produce more accurate solution. Process should be repeated until satisfactory results achieved. The algorithm used to process this data was written in Java and C++, thus Apache Hadoop executes the second using wrappers, as mentioned before. The algorithm itself is not complicated, because most of the complex work is moved to Apache Hadoop framework. This signicantly reduces time for code writing. The algorithm is performed in two stages. First, in the Mapper stage, it gathers parameters with similar compare estimator and produces key-value tuplets, where key is the rounded estimator and value is the set of parameters. Next, in the Reducer stage, all sets of tuplets with the same key are processed. At this moment, for each parameter of one set, average values are calculated and emitted as output. The results of this approach are sucient, but may be extended in future for more accurate results. Tests were run on cluster of varying number of nodes: 1, 2, 4 and 8 nodes. Each of them was a low-cost PC running openSUSE 11.4 (Linux kernel 2.6.37), with Intel(R) Pentium(R) 4 CPU 2.8GHz, 1GB RAM and 100Mbit/s Ethernet connection. The impact of network speed in the cluster computations is discussed in the article [16]. Nodes were connected into star network using 100Mbit/s switch. The Job Tracker was also Task Tracker and was exactly the same machine as all other Task Trackers (Fig. 2).

6

Šukasz Krauzowicz et al.

Job Tracker

...

Switch

Nodes / Task Trackers

Fig. 2.

1 to 8.

The network conguration used in the tests. The number of nodes varying from

Input data was collected in 8 les that consist of the compare estimator and 72 values, 1.5GB in total. Every single test was repeated 30 times, measuring times of three stages: copying the input data to HDFS, performing calculations and copying results back from HDFS.

5

Results

Tests show that increasing number of nodes aects computation time, but the results are not impressive (Fig. 4). However, time required for data distribution to all nodes increases with number of nodes as well as computation time decreases slightly. What is more, time for copying data back from HDFS is inconsiderable, as it takes less than 1% of whole processing time, because of the size of results. For more accurate results the Reducer task should generate more data, which might signicantly increase processing and copying time, but on the other hand will take more advantage of all Apache Hadoop's features.

6

Conclusion and Future Works

In this paper the Apache Hadoop technology was used to perform geophysical numerical analysis. As this technology is directed to process large data sets,

Numerical Computations for Geophysics Inversion Problem ...

800 700

copying from HDFS calculations

600

copying to HDFS

500

]s [ e 400 m iT

300 200 100 0 2 nodes 4 nodes 8 nodes 2 nodes 4 nodes 8 nodes 2 nodes 4 nodes 8 nodes 2 nodes 4 nodes 8 nodes 182MB

Fig. 3.

364MB

728MB

1456MB

Times of data analysis for dierent number of nodes.

700 c++

600

java

500 ]s 400 [ e im T 300

200 100 0 2 nodes Fig. 4.

4 nodes

8 nodes

Times of data analysis for dierent number of nodes.

7

8

Šukasz Krauzowicz et al.

algorithm to reconstruct Thomsen parameters' distribution was implemented and tested. The main advantage of using Apache Hadoop technology appears to be high scalability and ease of MapReduce code writing. Unfortunately, to benet fully from this framework it is necessary to use more powerful computers in increased number as well as faster network conguration. The future works will be focused on increasing number of nodes and amount of data, hence the inverse problem will be extended to more accurate models, which in consequence produce more data - more suitable for Apache Hadoop framework. What is more, to increase speed and accuracy of the inverse problem, forward solution will be implemented as a part of MapReduce class. This will enable the usage of the MapReduce tasks' results in next Thomsen parameters estimation in more convenient way. As the small-le problem might by signicant for analysis speed in presented conguration, certain optimization should be considered in future work [17]. Furthermore, next tests should be performed using a slightly dierent conguration. As it is mentioned in Apache Hadoop documentation, when network consist of more than four nodes it is better to set up Job Tracker and Name Node on separate machines. Acknowledgments. The study was nanced in part by the statutory research project No 11.11.140.561 of the Department of Geoinformatics and Applied Computer Science, AGH UST and by grant No. N N525 256040 from Ministry of Science and Higher Education. This work was co-nanced by the AGH - University of Science and Technology, Faculty of Geology, Geophysics and Environmental Protection, Department of Geoinformatics and Applied Computer Science as a part of statutory project.

References 1. Onderka Z.: Stochastic Control of the Scalable High Performance Distributed Computations. Parallel Processing and Applied Mathematics Conference (PPAM 2011), LNCS, Lecture Notes in Computer Science (in press), (2012). 2. Pi¦ta, A., Danek ,T., Le±niak, A.: Numerical modeling of ground vibration caused by underground tremors in the LGOM mining area. Gospodarka Surowcami Mineralnymi - Mineral Resources Management, Vol. 25, No 3, pp 261-271 (2009). 3. Danek, T.,: Parallel and distributed seismic wave eld modeling with combined Linux clusters and graphics processing units. IEEE International Symposium on Geoscience and Remote Sensing IGARSS, pp 2588-2591 (2009). 4. Kowal A., Piórkowski A., Danek T., Pi¦ta A.: Analysis of selected component technologies eciency for parallel and distributed seismic wave eld modeling. Proceedings of the 2008 International Conference on Systems, Computing Sciences and Software Engineering (SCSS), part of the International Joint Conferences on Computer, Information, and Systems Sciences, and Engineering, CISSE 2008, Bridgeport, Connecticut, USA. In: Innovations and Advances in Computer Sciences and Engineering, Springer, pp 359-362 (2010). 5. Piórkowski, A., Pi¦ta A., Kowal A., Danek T.: The Performance of Geothermal Field Modeling in Distributed Component Environment. Proceedings of the 2009

Numerical Computations for Geophysics Inversion Problem ...

9

International Conference on Systems, Computing Sciences and Software Engineering (SCSS), part of the International Joint Conferences on Computer, Information, and Systems Sciences, and Engineering (CISSE 09), Bridgeport, Connecticut, December 4-12, 2009. In: Sobh, Tarek (ed.) et al., Innovations in Computing Sciences and Software Engineering. Springer, pp. 279-283 (2010). 6. Kowal A., Piórkowski A., Pi¦ta A., Danek T.: Eciency of selected component technologies for parallel and distributed heat transfer modeling. Mineralia Slovaca, ISSN 0369-2086. Vol. 41 no. 3 supl. Geovestnik pp. 361, (2009). 7. Moser, T.J., Shortest path calculation of seismic rays, Geophysics, 56, pp. 5967, (1991) 8. Fischer, R., Lees, J.L., Shortest path ray tracing with sparse graph, Geophysics, 58, pp.987996, (1993) 9. Dwornik, M., Pi¦ta, A., Ecient algorithm for 3D ray tracing in 3D anisotropic medium, 71 st EAGE Conference & Exhibition incorporating SPE EUROPEC 2009, Extended Abstracts, Amsterdam, Holland, P138, (2009) 10. Thomsen, L., Weak elastic anisotropy, Geophysics, 51, pp.19541966, (1986) 11. http://hadoop.apache.org/ 12. White, T.: Hadoop: The Denitive Guide, Second Edition, O'Reilly Media, ISBN: 978-1-449-38973-4, (2010) 13. Kim, H., Kim, W., Lee, K., Kim, Y.: A Data Processing Framework for Cloud Environment Based on Hadoop and Grid Middleware. In: Grid and Distributed Computing, CCIS, vol. 261, pp. 515-524. Springer, Heidelberg (2011) 14. Sort Benchmark Home Page, http://sortbenchmark.org/ 15. http://wiki.apache.org/hadoop/PoweredBy#G 16. Wrzuszczak-Noga, J., Borzemski L.,: Comparison of MPI Benchmarks for Dierent Ethernet Connection Bandwidths in a Computer Cluster. Computer Networks, Communications in Computer and Information Science, Volume 79. Springer-Verlag Berlin Heidelberg, pp. 342-348 (2010) 17. Mohandas, N., Thampi, S. M.: Improving Hadoop Performance in Handling Small Files. In: Advances in Computing and Communications, CCIS, vol. 193, pp. 187-194. Springer, Heidelberg (2011)

Suggest Documents