as the back-propagation algorithm. The main difference with cascade-correlation is that the topology of the net- work is not fixed: it starts with a minimal net and ...
NEURAL NETS FOR FIRST BREAK DETECTION IN SEISMIC REFLECTION DATA C H Dimitropoulos and J F Boyce Wheatstone Laboratory, King’s College London, UK
Abstract We present a comparative study of the performance of reported neural net algorithms for the detection of first breaks in seismic reflection data with regard to accuracy, learning rate and generalisability.
formed for every trace. Current methods are interactive: an operator is presented with suggested picks based on the application of energy-based convolution operators to the individual traces. The parameters of the operators require tuning to the characteristics of the data, and retuning as these change along a seismic line. Even on data of high quality they are rarely more than 75% accurate.
INTRODUCTION Reported Methods Objective Neural nets are being applied to a variety of problems in geophysical exploration, including seismic trace editing and first-break picking [1], seismic horizon location [2], seismic event tracing [3] and velocity field inversion [4]. As they have been claimed to be highly successful both for trace editing, where ‘dead’ or noise-corrupted traces are identified and removed from subsequent processing, and first-break prediction, necessary for static correction of the seismic data, our objective has been to compare and assess the proposed methods as regards efficacy with respect to traditional methods, optimization of the neural net architecture, and the capacity both for learning and for generalization. We have treated the identification of dead traces as a preprocessing operation and have concentrated on the design of nets for first-break picking.
Statics Correction The objective of statics correction is to transform a set of shot records into that which would have been recorded on a horizontal surface, the datum; the transformed data is consequently free of elevation effects. The early part of each shot record, a collection of time-traces from a linear array of geophones, is composed of signals due to three modes of energy transfer from source to geophone: direct transmission, reflection from the most shallow seismic horizon, partial transmission as a surface wave along that horizon. For small shot-geophone distances the direct mode yields the earliest energy, the first-break. The significant information is at what displacement one of the other modes begins to dominate, as evidenced by a change in slope of the first-break time value as the shot-geophone displacement increases. As its name implies, first-break picking is the accurate location of the leading energy pulse received by a geophone is response to a seismic shot. First-break picking is per-
First-break picking by neural nets has been reported by McCormack [1], Veezhinathan et. al. [5], Kusuma and Brown [6], and by Murat and Rudman [7]. The main differences are the preprocessing applied to the data before presentation to the net and the architecture of the net. Apart from McCormack all of the authors extracted a few 5 features from the traces for input to the net. Feedforward multilayer nets, trained by back-error-propagation were employed, except by Kusuma and Brown, who utilized a cascade-correlation net. Our objective was to compare the different approaches with regard to accuracy, rate of learning, and generalisability.
NET SIMULATIONS The nets were implemented on a Meiko transputer surface which combines transputers and i860’s. Most of the nets were trained using both back-propagation and cascadecorrelation algorithms. Cascade-correlation was developed by Fahlman and Lebiere [8] to overcome some of the limitations of learning algorithms designed for multi-layered networks, such as the back-propagation algorithm. The main difference with cascade-correlation is that the topology of the network is not fixed: it starts with a minimal net and trains automatically, adding new hidden units one by one, as they are needed. Each new unit, which forms a singlenode hidden layer, receives a connection from each of the network’s inputs and also from all pre-existing hidden units. Once a hidden unit has been created its input weights are frozen and only the output connections are trained. In this way, powerful high-order feature detectors are created. It is not unusual for very deep networks to be created, with a high fan-in to the hidden units. Figure 1 compares the architectures of multi-layer perceptrons and cascade-
MULTI- LAYER PERCEPTRON
(1)
INPUT LAYER
HIDDEN LAYERS
OUTPUT LAYER
(1)
(3)
CASCADE-CORRELATION
Figure 1: Net Architectures correlation nets, for a 5-input, 2-output problem. The multi-layer perceptron shown has only one hidden layer, with 3 nodes. The cascade-correlation net has created three single-node hidden layers. The main advantages of the cascade-correlation architecture algorithm are that it learns very quickly, the network determines its own topology and size and it retains the structures it has built even if the training set changes, and therefore is useful for incremental learning, in which new information is added to an already-trained net. Two different types of seismic reflection data were considered: dynamite, a representative example of which is shown in Figure 3; Vibroseis1, as shown in Figure 4. The dynamite data, as is generally the case with impulsive source data, has well-defined first breaks, unlike the Vibroseis data with its emergent or noisy first breaks; typical of non-impulsive source data. We investigated different approaches using the dynamite data and then applied the optimum to the Vibroseis data.
Dynamite Data Sets A sliding time window was applied to each trace in all of the methods used for training the nets. The window is scanned down each trace and the data within the window can either be fed into the net directly (Amplitude Method), or can be pre-processed to extract attributes (Attribute Method). The window can be moved down the trace in steps of one sample at a time, hence processing all the samples in the trace, as in the amplitude method. Alternatively, only windows centered around peaks in the trace are considered, which is the way windows are used in the peak amplitude and attribute methods. 1 Trade
Mark of Conoco
Amplitude Method. The simplest way to form the input to the net is to use the amplitudes of the samples of the trace within the window directly, the output being whether the sample in the center of the window is above or below the first break. Two output nodes are used: if an output of (0,1) is produced, the true first break is below the current sample; if the output vector is (1,0), the first break is above the current sample. The predicted first break for a given trace is the point at which the output vector changes from (0,1) to (1,0). The only preprocessing applied before feeding the amplitudes to the net is to scale them to the range [0,255], unlike McCormack [1] who feeds a pixel image of the amplitudes to the net, and therefore requires about 40 times more neurons than our nets. The number of input nodes is determined by the size of the window. Various window lengths were tested. The dynamite data comprised 250 samples per trace with a sampling interval of 4msec; a window of 75 samples was found to produce the best results. The size of this window allowed approximately 6 full-cycles of the seismic trace to be fed to the net. As the net looks at the sample in the center of the trace to decide whether it is before or after a first break, the network was effectively looking at 3 full-cycles of the trace before the point of interest, and 3 full-cycles after. Back-Propagation Nets. For the networks trained with back-propagation, we used only one layer of hidden nodes. The standard algorithm of Rumelhart et al [9] was used for training. The best results were produced by a net with 8 hidden nodes. With fewer nodes the performance dropped, and with more, overtraining could occur. Also, the higher the number of hidden nodes, the more training time is needed. To avoid overtraining we split our data into two sets: a training and a validation set. The validation set, is used for monitoring the network’s generalization while it trains, and is not used by the net for learning. As the net is training we can monitor the overall error of the two sets. In a typical learning session, the training error is constantly decreasing. Although the validation error initially seems to follow the training error in its decreasing path, it may at some point start increasing, as the training patterns are being overfitted. That is the point where the learning process has to be stopped, and the weights stored. Given a large number of training patterns compared to the number of weights in the network, this problem does not occur, in which case the learning process is stopped whenever the training error is considered sufficiently low. For each network configuration, a few runs were made, with different initial random weights, and only the best of these nets was considered. Training on 36 traces from a shot record, while validating on the remaining 12 traces of that shot, for 5000 to 6000 iterations (epochs), takes about 3-5 hours on a Meiko surface containing two i860 processors. After training, the net was tested on two 48-trace shots from the same survey, together with the validation set. The best net could pick the first breaks
in 102 of the 108 traces (94.4%), within a deviation of one sample. That is to say, if the network’s answer was within 4msec of the actual first break, it was considered as a correct pick. About 70% of the traces were actually picked at the true sample. Only in 3 traces did the network pick a break more than 7 samples away (28msec) from the true location. In 19% of the traces there were one or more further picks, but the one occurring first was the one selected as the prediction from the net. It is worth mentioning that all 12 traces forming the validation set were correctly picked, within a one sample deviation, making the network’s performance 100%, for traces from the same shot as the training data. For the other two shot records of the test data, the performances were 91.6% and 95.9% respectively. Halving the window size to 37 samples, gave similar performances, but the accuracy of the network dropped slightly: fewer traces had their first breaks picked on exactly the right times, and a couple of traces were picked quite far from their actual positions. Nevertheless, if accuracy is not as important as minimizing training times, then a smaller window size might be preferred. Cascade-Correlation Nets. A cascade-correlation training algorithm was then applied to the data with the same preprocessing in operation. The number of output nodes was reduced to one, with (0) and (1) indicating signals before and after the first break respectively. With 75 input neurons, the best net, had successfully learned all training patterns in 2300 epochs, in about 4 hours of training, after having added 17 hidden single-node layers. More traces were correctly picked within a one sample margin of error when compared to the previous (back-propagation) net: 104 out of 108 (96.3%), but only 42% of the traces were picked exactly on the true sample. It was concluded that for the window amplitude approach, back-propagation was to be preferred due to its more accurate performance. Peak Amplitude Method. Trying to pick the real first break in the presence of noise is a difficult task. It can be simplified by selecting a fixed phase, such as the first peak of the seismic pulse. Hence, with this method, a first break is identified with a peak location and the network now examines only windows of data which are centered about peaks. The first break is defined as the peak at which the output vector first changes from (0) to (1). Peaks with amplitudes lower than a certain threshold are considered as noisy spikes and were not used for centering windows. With this method the number of training patterns is drastically reduced, and hence so is the training time. Cascade-correlation out-performs back-propagation by far using this method. In under 35 seconds, and with the creation of only one hidden unit, the net learnt to pick correctly in all 48 traces of a shot; when tested on another two shots from the same survey it picked 94 out of the 96 traces correctly (97.9%). Note that no deviations are tolerated with this method, since a first break picked on a peak further along, may be more than 100msec later. The optimum window length was found to be still one of length 75
samples. Various output activation functions were tested for the neurons. A gaussian function yielded superior performance to a sigmoid function. Back-propagation gives similar performance, but takes ten to forty times longer to train, and the user has to find the optimal number of hidden units by trial and error. There is also the problem of being trapped in local minima which is endemic to such nets. A small variation to this method is to change the meaning of the output, so that an output of (1) points to a first break peak, with (0) for all the other non-first break peaks. This variation gave inferior results compared to the previous method, perhaps because almost all of the training patterns were of the (0) output class and hence the energy surface was ill-conditioned. Attribute Method. This method also employs windows centered about peaks, but instead of presenting the amplitudes to the net, they are pre-processed by calculating a set of attributes for the samples within each window; the attributes are then input to the net. This reduced the number of input nodes from 75 to 4. Four attributes were determined for each window: peak amplitude, peak-to-(following)trough amplitude difference, root-mean-square(rms) amplitude ratio, and rms amplitude ratio on adjacent traces. The rms ratio is the rms amplitude in a window of n samples before the peak, to that after the peak. For each window
rms amp =
rP
=1 xi
n i
n
2
(1)
where xi is the ith data sample. This attribute gives a regional amplitude variation. For our data we selected n = 75. The fourth attribute is calculated by adding the two rms amplitude ratios of the adjacent traces. A smaller set of windows is used in the subsidiary traces: two n = 15 windows per trace. This provides the net with a check for spatial coherence in the occurrence of first breaks. The best network was a cascade-correlation trained net, which created 7 hidden layers, in under 15 seconds. The performance of the net was identical to the Peak Amplitude net, i.e. only 2 out of the 96 traces were incorrectly picked, giving a performance of 97.9%. For dynamite data the Peak Amplitude method with cascade-correlation is to be preferred, since it is simple and yields good results.
Vibroseis Data Sets The Vibroseis data set utilized consists of 240-trace shots; each trace having 250 samples, with a sampling interval of 4msec. Compared to the good quality of the dynamite data set, the Vibroseis data is of fairly poor quality. Only a small number of the 240 traces in each shot are clean. A large number of traces are of extremely poor quality, with
Clean_Vibroseis_Trace amplitude x 103 102 30.00 25.00 20.00 15.00 10.00 5.00 0.00 -5.00 -10.00 -15.00 -20.00 -25.00 sample 0.00
50.00
100.00
150.00
200.00
250.00
A window of 75 samples is used for data input to the net, equivalent to 0.3 seconds of seismic data. The nets were trained on 120 traces of a shot and tested on the remaining 120. With the peak amplitude method, the best cascadecorrelation net needed about 4 to 5 minutes of training, and created 19 hidden nodes. The performance of the net was 52%. When using the attribute method, only 2 minutes of training was required, but 42 hidden nodes were created. The net classified 72% of the patterns correctly, but 56% of the traces were absolutely correctly picked. On some of the traces the nets did not pick a first break. Considering the poor quality of the data, we might prefer getting a no pick response from the net, rather than erroneous picks on a trace. In this case, the picking success ratio of the net using the attributes was 60%. It should be noted that by using a narrow gate window of four candidate peaks per trace, a 75% performance was attained.
Noisy_Vibroseis_Trace amplitude x 103 1
20.00 18.00
CONCLUSIONS
16.00 14.00 12.00 10.00 8.00 6.00 4.00 2.00 0.00 -2.00 -4.00 -6.00 -8.00 -10.00 -12.00 -14.00 -16.00 -18.00 -20.00 sample 0.00
50.00
100.00
150.00
200.00
In most methods, back-propagation and cascadecorrelation achieved very similar performances. Only in the amplitude method used for the dynamite data is back-propagation more accurate. For all other methods cascade-correlation is to be preferred since there is no need to guess the number of hidden nodes or layers. Cascade-correlation also trains extremely fast; up to 40 times quicker than back-propagation. With cascadecorrelation incremental learning is also feasible, in which a trained net can be retrained on additional data, while retaining about 90% of its previous knowledge.
250.00
Figure 2: Vibroseis Traces
high levels of noise. Examples of both clean and noisy traces are shown in Figure 2. Amplitude Method. The amplitude method is totally impractical for this type of data, since massive training sets need to be created, and the training times are infeasible with back-propagation. Training times are an important consideration, assuming that this neural network could be part of a program used for seismic interpretation. As the networks need to be re-trained for each different survey, it is necessary to do so within a feasible time. With cascade-correlation the situation is very similar as the net creates a high number of hidden layers, before even approaching a solution to the problem. Hence, one of the Peak methods had to be used. Peak Amplitude and Attribute Methods. Geophysical interpreters consider the Vibroseis data set difficult to pick, compared to the dynamite set, which is easy. In order to assist the network with its training a gating window in the vicinity of potential first break peaks is selected on the trace, allowing eight peaks per trace to be analyzed.
For the dynamite data, both the peak amplitude and the attribute method give the same performance of about 98%. The peak amplitude method is simpler to implement, since no attributes need to be calculated. The amplitude method gives nets with a performance of about 96%, but that includes traces picked within a 4msec deviation of the true first break. For the Vibroseis data, the amplitude method was considered impractical, due to very long training times. Performances of 52% were achieved with the peak amplitude method, and 56% with the attribute method. Generalization is much poorer than with the nets used for dynamite data, and care has to be taken to avoid overtraining. For large poor quality data, the attribute method is superior, as it requires a small number of input neurons, hence reducing the training times; although deeper net structures are created. A further number of attributes may be used, hopefully giving nets with higher performances. In any case, the use of neural nets for first break picking is a feasible application, especially when using cascadecorrelation, due to its short training times. The authors wish to thank Simon Horizon Ltd. for supplying the data for the investigation and for many useful discussions with Mr. P. Haskey, Dr. R. L. Silva and Mr. R. Holden.
REFERENCES [1] M.D. McCormack. Seismic trace editing and firstbreak picking. In 60th. Ann. Internat. Mtg. Soc. Expl. Geophys, pages 321–324, 1990. Expanded Abstracts. [2] K.Y. Huang, W.R. Chang, and H.T. Yen. Self organising neural networks for picking seismic horizons. In 60th. Ann. Internat. Mtg. Soc. Expl. Geophys, pages 313–316, 1990. Expanded Abstracts. [3] X. Liu, P. Yue, and L. Li. Neural network method for tracing seismic events. In 59th. Ann. Internat. Mtg. Soc. Expl. Geophys, pages 716–718, 1989. Expanded Abstracts. [4] S.Y. Lu and J.G. Berryman. Inverse scattering, seismic traveltime and neural networks. Technical Report UCRL-JC-104358, Lawrence Livermore Natl. Lab., 1990. [5] J. Veezhinathan, D. Wagner, and J. Ehlers. First break picking using a neural network. In F. Aminzadeh and H. Simann, editors, Expert Systems in Exploration, chapter 8, pages 179–202. Society of Exploration Geophysicists, 1991. ISBN 0-56080-023-2. Figure 3: Neural Net’s picks on dynamite shot, using the amplitude method.
[6] T. Kusuma and M.M. Brown. First break picking and trace editing using cascade-correlation learning architecture. Int. Conf. on Petroleum Exploration and Production, 1992. [7] M.E. Murat and A.J. Rudman. Automated first arrival picking: a neural network approach. Geophysical Prospecting, 40:587–604, 1992. [8] S.E. Fahlman and C.Lebiere. The cascade-correlation learning architecture. In D.S. Touretzky, editor, Advances in Neural Information Processing Systems 2, pages 524–532. Morgan Kaufmann,San Mateo,USA, 1990. [9] D. Rumelhart, G. Hinton, and R. Williams. Learning internal representations by error propagation. In Parallel Distributed Processing 1, chapter 8. MIT Press, 1986.
Figure 4: Neural Net’s picks on Vibroseis shot, using the attribute method.