An open source, web based, simple solution for seismic ... - CiteSeerX

24 downloads 188747 Views 452KB Size Report
A viable solution consists of MySQL for data storage and retrieval, CWP/SU and GMT for data visualisation and a scripting layer driven by PHP that allows users ...
ARTICLE IN PRESS

Computers & Geosciences 31 (2005) 599–605 www.elsevier.com/locate/cageo

An open source, web based, simple solution for seismic data dissemination and collaborative research Paolo Diviacco Istituto Nazionale di Oceanografia e di Geofisica Sperimentale, Borgo Grotta Gigante 42/C, 34010 Sgonico, Trieste, Italy Received 27 April 2004; received in revised form 12 November 2004; accepted 15 November 2004

Abstract Collaborative research and data dissemination in the field of geophysical exploration need network tools that can access large amounts of data from anywhere using any PC or workstation. Simple solutions based on a combination of Open Source software can be developed to address such requests, exploiting the possibilities offered by the web technologies, and at the same time avoiding the costs and inflexibility of commercial systems. A viable solution consists of MySQL for data storage and retrieval, CWP/SU and GMT for data visualisation and a scripting layer driven by PHP that allows users to access the system via an Apache web server. In the light of the experience building the on-line archive of seismic data of the Istituto Nazionale di Oceanografia e di Geofisica Sperimentale (OGS), we describe the solutions and the methods adopted, with a view to stimulate both the attitude of network collaborative research of other institutions similar to ours, and the development of different applications. r 2004 Elsevier Ltd. All rights reserved. Keywords: Data dissemination; Seismic data; Web-based database; Web-based data visualisation; Apache; CWP/seismic Unix; Generic mapping tool (GMT); MYSQL; PHP

1. Introduction Whenever a critical mass is exceeded in collecting very large amounts of data, the need for specific tools for analysis always arises. Moreover, if collaborative analysis is planned among several groups of scientists, information technology (IT) network-based dissemination strategies have to be considered. Commercial systems are available, but mainly provide tools for work carried out in the enclosed environment of a large private company, and are, of course, expensive. What about the rest of the world? UniverTel.: +39 0402140380; fax: +39 040327521.

E-mail address: [email protected] (P. Diviacco).

sities, public research centres, and smaller firms cannot afford to purchase these systems. Moreover the needs of such institutions are generally different from those of a private company. This implies customisation that is not possible from a commercial point of view. On the other hand, developing a complete solution from scratch would mean spending a huge amount of time and effort devoted to a long-term project without any guarantee of success. An easier solution is to use existing tools that, even if not specifically developed for these applications, can be creatively combined. An important issue to consider in this perspective is the phenomenon of the Open Software community, which has released an impressive amount of software tools from scientific to business applications and from

0098-3004/$ - see front matter r 2004 Elsevier Ltd. All rights reserved. doi:10.1016/j.cageo.2004.11.008

ARTICLE IN PRESS 600

P. Diviacco / Computers & Geosciences 31 (2005) 599–605

operating systems to graphics processing packages. Besides being free, although certain restrictions do apply after General Public License (The GNU General Public License Open Source Initiative (OSI): http:// www.opensource.org/licenses/gpl-license.php), it is very important to underline that the software developed in this framework is highly flexible and reliable, being tested by a huge community of users (or used by a huge community of testers). Over the last several years we, at the Istituto Nazionale di Oceanografia e di Geofisica Sperimentale (OGS), have been facing the problem of improving the accessibility of our ever-increasing archive of geophysical data for both the ‘‘in-house’’ scientists and our network of collaborations with other researchers and institutions. Typically, subsets of available seismic data surveys need to be located within a geographical area which bounds an interesting seismic feature. In this situation, a tool is required that can be used from anywhere to efficiently search, format and visualise seismic lines based on their geographic location. The selected data sets can then be exported to the client workstation where they can be analysed, reprocessed, and eventually, if necessary, be uploaded back into the database itself. Having considered the requirements, resources and costs we decided to develop SNAP (Seismic database Network Access Point ((SNAP) http://snap.inogs.it)), a web based dynamic tool for searching and analyzing geophysical data, based exclusively on Open Source software.

2. The context Among the main activities of OGS are seismic data acquisition, processing and interpretation. If not properly managed, the archive containing data from the projects in which the institute has participated for several decades could easily become a problem rather than a resource. Searching physically through thousands of tapes of different formats and vintages, paper sheets or CDs could be quite a time consuming job. Retrieving data, or just, for example, the sampling rate of a seismic line can mean delays for a whole group of researchers. A solution to this would be to move the archive online, and to grant search, visualise and download facilities to users. Commercial applications exist that allow this, but as already stated, they imply investments that are far beyond the budget available. On the other hand, the storage capacity of even low cost computers is rapidly coming to a point where the idea can be considered feasible. In this perspective we decided to focus investments on the hardware using only Open Source software.

This way, not having to accommodate the needs of the rigid framework of a commercial application, it is possible to create a system that is simple enough to be developed and managed easily but, on the other hand, is powerful enough to achieve the required goals.

3. The network side Collaborative research, in general, is based on share resources. This raises the need of a network tool that can be easily used from anywhere using any PC or workstation. The natural and easiest solution to this is to exploit the TCP/IP-http world without introducing exotic protocols or services that would need particular care from the users and the system management. Coherent with this goal of simplicity, a good approach would be to avoid non-standard formats to maintain full compatibility with any web browser. As a web server, we adopted Apache (http://www.apache.org/). This popular Open software solution is very flexible, safe and provides a lot of excellent on line documentation. The rapid evolution of network technology allows very high data transfer speeds that makes, for example, audio and video data sharing very easy. Unfortunately the dimensions of seismic data files are of at least an order of magnitude larger, so that network manipulation of them is not advisable. For example it could be a quite a time-consuming task to have a local look-up of a remote seismic line file via ‘‘standard’’ network connections. This means that in an efficient system most of the work should be done in the server and only the results be sent over the network. This is called the ‘‘server side’’ paradigm. If most of the activities are centralised, there will be only one object to keep updated. This of course will ease the management of the system. A survey of the web-based data sets available on the Net shows that most of those sites are ‘‘static’’ in the sense that the web pages are manually prepared and updated. This of course means a lot of work, but even worse, it implies the existence of a previously prepared path in the search for information that could force the user in a direction he would not want to follow. A step further can be taken by switching to a ‘‘dynamic’’ site. This means that the published contents are created ‘‘on the fly’’ after a specific request. To accomplish this task, using software not specifically developed for this purpose, two new layers have to be added to the system (Fig. 1). The first is a database layer, which will manage the search for information and the second is a scripting layer that allows the database and the web server to communicate. We adopted MySQL (http://www.mysql. com/) as the former and PHP (http://www.php.net/) as

ARTICLE IN PRESS P. Diviacco / Computers & Geosciences 31 (2005) 599–605

the latter. Although this can seem to introduce difficulties, from our experience, after a very quick training period these tools are shown to be easy to use. The main advantage of a dynamic web system like this is that every time something is changed in the database the whole site automatically reconfigures itself. Security in general is a very important matter of concern, especially if the system has to be accessible via Internet, but even intranet calls have to be controlled to prevent improper use. Firewall protection against hackers has to be planned carefully, but also simple prevention of mistakes that could generate data corruption is important. In this sense, providing web pages with a graphic interface to the system at the same time grants access also to people not usually involved in

Fig. 1. System structure built from 3 layers. Users interact with Web server that pass queries to database via a scripting layer, that also collects results, formats and sends them back to Internet.

601

seismic data processing and, driving user requests, avoids any complexity and hidden dangers, for example, of a wrong SQL query or plotting parameters setting. The disadvantage of server side systems is essentially the overload of the server, but considering the everincreasing speed of modern computers this becomes daily less worrying.

4. The database layer The most important step in building and managing a database is defining the reality it has to represent. When insufficient analysis of the problems is performed, trying to compensate for errors can take a lot of time. Moreover if requirements not previously considered arise when the database is already populated, the task of adding fields and methods can be very tedious. The kernel of our problem was the accessibility of an increasing number of files uploaded from the original media. Therefore, the data in its basic form is the file itself, while the metadata, that is the data describing the data, can be information such as sampling rate, data length or format. Trying to be as simple as possible we created a database table called ‘‘Seismics’’ (Fig. 2) that holds all the relevant information about the file and the seismic data contained in it (Barry et al. 1975). In this framework, metadata are the criteria on which queries are sent to the database. For example it is possible to retrieve a list of seismic lines acquired in a specific year, or a list of deep lines with a recording length longer than 8 s. As in the following example (Fig. 3), a simple SQL query can be sent to the database to get a list of lines that satisfy the criteria of a sampling rate of 4 ms. An important key, on which such a database has to be searchable, is the geographic location of data. This can also be considered a form of metadata.

Fig. 2. Database structure and links.

ARTICLE IN PRESS 602

P. Diviacco / Computers & Geosciences 31 (2005) 599–605

Fig. 3. Example of an SQL query to database, asking a list of lines with sampling rate of 4 ms.

Different strategies of interactive geographic searches can be implemented. We decided to apply a policy to manage queries using two ‘‘resolutions’’. The lower one provides previously prepared snapshots, while the higher one specifically creates the requested object. In fact, testing the prototype, we realised that most of the calls to the database were asking for similar things. For example, requests were made for a map of all the seismic lines in a research project or in an area that corresponds to geographic regions such as, the Antarctic Peninsula or the Adriatic Sea. Having them computed on the fly each time, would overload the system, whilst, retrieving a previously prepared object, even if it does not exactly correspond to the query, can be an acceptable compromise for further searching and processing. When updating the database, it is very important that procedures are provided to synchronise the low-resolution objects with the updated data. The efficiency of this policy is proportional to how much the compromise fits the requests. In this sense we identified three cases where snapshots can be useful:

assuming that every project exists in just one geographic region. A relation is established between Record ID of the ‘‘Regions’’ table and the corresponding field of the ‘‘Projects’’ table. Within this framework SQL selections, based on the Join clause, can retrieve any metadata relative to a file. For example the following query (Fig. 4) will produce a list of seismic line names, of the region explored and of the institution that acquired them in the framework of the ‘‘test’’ project. Note that data was taken from three different tables: ‘‘Seismics’’, ‘‘Projects’’ and ‘‘Regions’’. To produce location maps, in our case, for marine seismic lines, we simply used the navigation files (UKOOA Exploration Committee, 1991). These introduce the need for another layer of metadata where information about their characteristics can be stored. To avoid confusing navigation and seismic files another database table, called ‘‘Navigation’’, is introduced. Each navigation file is linked to the corresponding seismic file through a relation between the ‘‘RecId’’ field of the ‘‘Seismics’’ table and the ‘‘RecIDSeismics’’ field of the ‘‘Navigation’’ table. The path to the navigation files is the same as for the seismic files and is defined in the ‘‘Projects’’ table. Additional fields can be introduced in the database structure to ease geographic line selection and plotting. Considering the low-resolution policy, the ‘‘Regions’’, ‘‘Projects’’ and ‘‘Seismics’’ tables need to hold the path to the corresponding snapshots, while for the ‘‘high resolution’’ policy, the ‘‘Regions’’ table needs somewhere to define geographic limits and a field where plotting parameters can be written. If exact geographic selection is needed, it is necessary, during map creation, to check the position of every

 Regions,  Projects,  Files. These are concentric, in the sense that Regions contain Projects that contain Files. Each entity provides previously prepared objects that describe its content. This way subsequent zooming can be performed quickly to get an answer close to the original request. From there, further more exact querying can be performed. To manage this policy a simple solution is to create two new database tables called ‘‘Regions’’ and ‘‘Projects’’ (Fig. 2). Assuming that every seismic line has been acquired in the scope of just one project, in a relational database it is sufficient to link the record ID field of the ‘‘Projects’’ table with the corresponding one of the ‘‘Seismics’’ table. This allows any record of the latter to access all the information of the former, thus avoiding redundancy. The same scheme can be used

Fig. 4. Example of an SQL query to database, asking a list of lines acquired within ‘‘test’’ project. Line Name, Project and Region Name are information taken from different database tables.

ARTICLE IN PRESS P. Diviacco / Computers & Geosciences 31 (2005) 599–605

seismic line. This can be accomplished by reading all the navigation file, but of course this will take a lot of time. A better approach is to use additional metadata fields in the ‘‘Navigations’’ table that can store the minimum and maximum latitude and longitude of each seismic line in the database. 4.1. Data repository To ease data access, we decided to also partially mimic the database concentric structure in the data storage. The directory tree consists of folders that correspond to projects and contain the relative files. Every time a new project has to be loaded, a directory named after it is automatically created and its location stored in the ‘‘Path’’ field of the ‘‘Projects’’ table. Each Project folder contains not only all the corresponding seismic and navigation files but also all the relative lowresolution seismic and map snapshots. This structure can be extended also to the ‘‘Regions’’ table, but since it does not correspond to any physical object except the relative snapshot map file, it was preferred to concentrate all region images in a separate directory.

5. Web server and database communication The scripting layer can be thought as a kind of glue between the database and the web server. Neither of them knows that the other exists but at the same time they both need each other. Scripting is also the hinge of flexibility, the means by which we can tailor the system for our needs. PHP, the scripting language we used, is particularly suitable for this task, since, using simple syntax, it interacts efficiently with both MySQL and Apache. The requirement of this layer is to create dynamic web pages built ‘‘on the fly’’ from the results of querying the database. PHP code is normally included in HTML files, in sections delimited by two tags: the opening "?php and the closing?". What is written in between is executed and cannot be read from the browser. When results are computed they are sent back to the web page. PHP manages variables in a smart and easy way. They do not need to be declared and can be recognized from the dollar character before the name, like $variable. It is possible to send HTML tags via PHP code. For example to print an HTML link to a web page stored in the variable $weblink, using PHP, we can use a simple syntax as follows: o?php print "oa href ¼ ".$weblink."4 link to a web site o/a4"; ?4 Allowing the database and the web server to communicate is also very easy. Five lines of PHP code are

603

sufficient. After a connection is established and a database is selected, as in the following commands: mysql_connect("mysql_host", "mysql_user", "mysql_password"); mysql_select_db("my_database"); it is possible, for example, to send an SQL query, where all data from the ‘‘Seismics’’ table are retrieved: $results ¼ mysql_query("SELECT * FROM Seismics"); To print the line name from every record a while loop is required while ($line ¼ mysql_fetch_array($results)) { print "Line Name".$line[LineName]. "oBR4"; } Of course this fits the easiest case. Complicated situations can need some care, but it is more a matter of focusing the problem rather than unraveling language syntax. A very important issue in the development of such a system is data representation. From this side PHP can become a bridge to any command line visualisation package. The ‘‘system’’ command can run and pass arguments to any executable file. Considering seismic data plotting, good results can be achieved using the CWP/SU Open Software package (Stockwell, 1997,1999) (CWP/SU Seismic Unix: http:// www.cwp.mines.edu/cwpcodes/). This can produce Postscript image files for further publishing as in the following example. system("segyread tape ¼ ".$SgyFile." | segyclean | supsimage perc ¼ 984".$Psfile); Unfortunately standard browsers cannot interpret Postscript, so that server side translation to supported formats is needed. To produce a compatible JPG image file, ghostscript or Imagemagik packages can be used. An optimal solution for mapping is to use the GMT (http://gmt.soest.hawaii.edu) Open software package. This can manipulate any geographic data set to produce, as CWP/SU, a Postscript image file. A PHP script can be written to produce location maps of a list of seismic lines retrieved from the database as in the following example: system("pscoast –options– 4".$PSfile); system("psxy ".$NavigationFile." –options– 44". $PSfile); The method we outline can be employed for both low and high-resolution interactions with the system and in both cases of seismic and map plotting. In the low-resolution case snapshots are prepared automatically once, on data loading or editing, while in the high-resolution case images are produced ‘‘on the fly’’ from interaction with a web form (Fig. 5 and 6). Images that are produced are included in the web page output as a simple html image tag embedded in a PHP "print" command, like:

ARTICLE IN PRESS 604

P. Diviacco / Computers & Geosciences 31 (2005) 599–605

6. User management and data property

Fig. 5. Location map and listing of seismic lines that occur in a selected area.

Whereas, for the first part, the implementation was fairly straightforward, there are a couple of arguments that are more complicated. Protection of data property is one of them. Different projects can have different ownership. Some of them can be public, while others can be private and there should be a means to prevent unauthorised network access to them. This can be done using PHP ‘‘sessions’’. The idea is similar to the notorious ‘‘cookies’’, with the difference that they are server side and are independent of browser settings. Using them, it is possible to authenticate users and allow or deny data access. Various strategies can be implemented, but to be as simple as possible, we developed a policy based on the assumption that a single user owns the data acquired within a project. Public data has no user and therefore can be seen by anyone. Moreover an administrative user is also needed to manage system updates. User information and passwords are stored in a separate database table called ‘‘Users’’. This is linked to the ‘‘Projects’’ table via the user record ID, so that each time data from a particular project is requested, consistent authentication must be provided by the session parameters. Testing the prototypes we realised that a type of queuing is often needed in accessing the system. When different users request the same thing at the same time, within the scope of PHP, such as for a database query, the problem is managed automatically, but when concurrent requests are sent to external software, problems can arise. To avoid this in the ‘‘Seismics’’ database table each record has a ‘‘busy’’ field that protects files from concurrent access. When necessary, users are assigned to a queue and wait for their turn. Another very interesting opportunity offered by PHP is data buffering. This allows streams to the browser to be captured and analyzed. For example if something goes wrong during processing of a system request, warnings can be captured and specific solutions can be adopted such as suggesting other paths or simply waiting for a while and retrying.

7. Conclusions Fig. 6. Once selected, seismic lines can be plotted interactively.

print "oimg src ¼ ".$imagetoshow."4"; Since web browsers cache files, to avoid showing nonupdated images when the high-resolution policy is used, it is necessary to use different file names each time the picture is generated. This can be accomplished easily using a random number used as name suffix.

A simple web based solution to seismic data dissemination and collaborative research can be developed using only the Open Software tools, Linux, Apache, MySQL and PHP. This avoids the many difficulties involved in building from top to bottom a completely new system, but at the same time avoids the lack of flexibility and costs of a commercial solution. In this way, dynamic web sites can be created, that,

ARTICLE IN PRESS P. Diviacco / Computers & Geosciences 31 (2005) 599–605

searching a database on specific criteria, can retrieve and plot selected seismic data and metadata. Controlled access to the system can be granted for anyone, anywhere, to browse the archives whilst, at the same time, protecting data property.

Acknowledgement The author thanks Ruben Levi, Giuliano Brancolini and Nigel Wardell for the contributions made to this project.

605

References Barry, K.M., Cavers, D.A., Kneale, C.W., 1975. Report on recommended standards for digital tape formats. Geophysics 40 (2/10), 344–352. Stockwell Jr., J.W., 1997. Free software in education: a case study of CWP/SU: Seismic Un*x. The Leading Edge 16 (7/ 10), 1045–1050. Stockwell Jr., J.W., 1999. The CWP/SU: Seismic Un*x Package. Computers & Geosciences 25 (4/10), 415–419. UKOOA Exploration Committee, 1991. UKOOA P1/90 post plot positioning data format. First Break 9 (4/12), 457–466.

Suggest Documents