Data-driven modeling for groundwater exploration in fractured crystalline terrain, northeast Brazil Michael James Friedel & Oderson Antônio de Souza Filho & Fabio Iwashita & Adalene Moreira Silva & Sueli Yoshinaga Abstract It is not possible, using numerical methods, to model groundwater flow and transport in the fractured crystalline rock of northeastern Brazil. As an alternative, the usefulness of self-organizing map (SOM), k-means clustering, and Davies-Bouldin techniques to conceptualize the hydrogeology was evaluated. Also estimated was the well yield and groundwater quality across the Juá region. This process relies on relations in the underlying multivariate density function associated with a sparse local set of hydrogeologic (electrical conductivity, geology, temperature, and well yield) and a complete regional set of airborne geophysical (electromagnetic, magnetic, and Received: 14 July 2011 / Accepted: 29 March 2012 Published online: 15 May 2012 * Springer-Verlag (outside the USA) 2012 M. J. Friedel ()) Crustal Imaging & Geochemistry Science Center and Center for Computational and Mathematical Biology, US Geological Survey and University of Colorado, Box 25046, Denver Federal Center Mail Stop 964, Denver, CO 80225, USA e-mail:
[email protected] Tel.: +1-303-2367790 Fax: +1-303-2361425 O. A. Souza Filho Geological Survey of Brazil and University of Campinas, Rua Voluntários da Pátria, 475, sala 10, Curitiba CEP: 80.020-926 PR, Brazil e-mail:
[email protected] F. Iwashita Nevada System of Higher Eduation, Desert Research Institute, 2215 Raggio Parkway, Reno, NV 89512, USA e-mail:
[email protected] A. M. Silva Applied Geophysics Laboratory, Institute of Geosciences, University of Brasilia, Campus Darcy Ribeiro, Asa Norte, Brasília, CEP: 70.910-900 DF, Brazil e-mail:
[email protected] S. Yoshinaga Institute of Geosciences, University of Campinas, 13083-970, Campinas, SP Caixa-Postal: 6152, Brazil e-mail:
[email protected] Hydrogeology Journal (2012) 20: 1061–1080
radiometric) and satellite spectrometric measurements. Resampling of the regional well yield and electrical conductivity estimates provides sufficient resolution to construct variograms for stochastic modeling of the hydrogeologic variables. The combination of these stochastic maps provides a way to identify potential drilling targets for future groundwater development. The datadriven estimation approach, when applied to available airborne electromagnetic and water-well hydrogeologic measurements, provides a low-cost alternative to numerical groundwater flow modeling. In addition to fractured rock environments, the alternative modeling framework can provide spatial parameter estimates and associated variograms for constraints to improve the traditional calibration of equivalent groundwater-porous-media models. Keywords Brazil . Fractured rocks . Geophysical methods . Self-organizing map . Well yield
Introduction Northeast Brazil is home to about 25 million people that live in a 1 million km2 area, one of the most populated semi-arid regions on Earth. Every decade, this region is subject to one or more El-Niño related droughts lasting on average about 2 years (Glantz 1993). These droughts are devastating for the rural population because they promote disintegration of subsistence agriculture. To date, groundwater exploration and development of resources in this region only has been marginally successful (Coriolano et al. 2000) and annually there are huge federal and local investments for drilling campaigns and building of reservoirs (Cirilo 2008). Perhaps the greatest challenge to water-resource development is that most of the region (>65 %) is underlain by fractured Precambrian crystalline bedrock (Souza Filho 1998). Because the groundwater in these fractures is brackish or saline, the fractures are thought to behave as conductors in an induced electromagnetic field. Under this hypothesis, water-resource prospectors traditionally apply surface electromagnetic and electrical methods for locating water-bearing drill sites. Of these methods, the very low frequency technique is the most successful (Pinéo DOI 10.1007/s10040-012-0855-1
1062
2005; Neves 2005). However, given subjectivity in interpreting the associated measurements and high drilling cost, less than 100 wells actually are completed in this region. For these reasons, the government of Brazil encourages research that can help locate groundwater resources to augment the long-term rural water supply. As part of this initiative, a series of helicopter-borne (airborne) electromagnetic and magnetic data were surveyed in three states of northeastern Brazil under an international cooperative project (PROASNE 2007). These airborne surveys were flown over three areas of hundreds of square kilometers with quick acquisition periods of weeks for local evaluation of groundwater resources rather than being restricted to household properties, as generally occurs with surface geophysics for these applications. Despite the widespread use of these data types in Australia, China, Europe, and North America (Hildenbrand et al. 1990; Speed 2002; Meng et al. 2006), the usefulness of airborne data were only evaluated over the past decade for waterresource applications in Brazil (Pinéo 2005; Silva 2005; Souza Filho et al. 2006, 2007a, b, 2009, 2010). Like geophysical studies, groundwater studies produce data requiring a model to assist with their interpretation. The traditional groundwater modeling paradigm involves interpreting hydrogeologic processes by mining knowledge for data. That is, an investigator’s knowledge is used to formulate and test a groundwater hypothesis using the numerical model conceptualization, parameterization, and calibration process (Silva 2005). As part of this process, the governing equations describing the phenomena of interest are assumed to be known and appropriate data available to constrain the estimation of model parameter values. Unfortunately, the disparate and sparsely populated hydrogeologic data sets from northeastern Brazil render it virtually impossible to conceptualize, develop, and use conventional equivalent porous media or discontinuum models. One alternative modeling paradigm is to mine data for knowledge using a self-organizing map (SOM) technique (Kohonen 2001). The SOM can be combined with k-means clustering and other multivariate statistical techniques to evaluate and exploit complex nonlinear relations among sparse and noisy variables. Hydrogeologic applications involving the SOM include diagnosing the effect of storm-water infiltration on groundwater quality (Hong and Rosen 2001), identifying distinct classes of chemical composition associated with various aquifers (Peeters et al. 2006), evaluating hillslope chemical weathering relations (Iwashita et al. 2011a), and estimating hydrogeologic properties of soils (Iwashita et al. 2011b). The k-means clustering technique (McQueen 1969) has been used with numerical groundwater modeling applications to group geophysical (Dietrich and Tronicke 2009) and hydrogeologic (Brauchler et al. 2011) variables. In this paper, the efficacy of using the alternative modeling paradigm to assist in the location of probable well drilling sites in fractured crystalline terrain is explored. The objectives of the study are to use the SOM Hydrogeology Journal (2012) 20: 1061–1080
to: (1) devise a conceptual hydrogeologic framework, and (2) estimate probable well-yield and electrical conductivity under different climatic episodes for the Juá region of northeastern Brazil. To meet these objectives, it is hypothesized that it is possible to construct a multivariate hydrogeologic-geophysical density function that characterizes the regional hydrogeology. To the authors’ knowledge, this is the first attempt at applying the SOM to disparate and sparse hydrogeologicgeophysical measurements. This study extends the work of Souza et al. (2010) who sought to map groundwater electrical conductivity in the Juá region, the work of Friedel (2011) who sought to conceptualize post-fire hydrologic models by k-means clustering of SOM neurons, and it relies on a dataset compiled by Souza Filho (2008).
Study area The Juá study area (132 km2) is part of the Irauçuba municipality in the semi-arid State of Ceará (Fig. 1) and one the PROASNE pilot-areas (PROASNE 2007). The landscape and drainage are influenced by the aridity that characterizes northeastern Brazil. The geomorphology of the area is best described as undulating surfaces with altitudes (above sea level) in the range of 140 to 190 m and hills that rise to 300 m over the plains. Drainages are intermittent and structurally controlled by fractures, faults, and foliation. The Geological Survey of Brazil publishes lithologic, structural, and hydrologic data about this area (Souza Filho 1998; Souza Filho et al. 2003; Veríssimo and Feitosa 2002). Based on this information, about 25 wells are known to penetrate the surficial sediment and fractured crystalline bedrock at depths ranging from 7 to 80 m. Despite the relatively low yield (less than 7 m3/hr) that is characteristic of wells in this region, the groundwater flow system appears to be dynamic. This assertion is based on observations of seasonal fluctuations in electrical conductivity measurements—from 112.6 to 375 millisiemens per meter (mS/m)—during 1997 and 2002 (Silva et al. 2001) in wells J66 and J67 (Fig. 2). Although these wells only are 40 m apart, the differences in borehole electrical conductivity and temperature profiles emphasize that groundwater flow is heterogeneous and fracture controlled (Silva et al. 2003).
Geology The geology of the study area is available as mapped by Souza Filho (1998). Mapping at the 1:70,000-scale characterizes the region as Neoproterozoic supracrustal units with sillimanite-biotite gneisses that include layers of calc-silicate gneisses, marbles, deformed granitic sheetings, and amphibolites (Fig. 2). A colluvium cover, less than 0.5 m thick, also exists at the eastern and northwestern parts of the study area. Along the São Gabriel creek near Juá village, the Quaternary alluvium (composed of conglomeratic to fine-grained sand) reaches about 2.5 m in thickness of which the upper most part is characterized by DOI 10.1007/s10040-012-0855-1
1063
Fig. 1 Location of the Juá study area in Brazil
Fig. 2 Geology and location of drilled wells in the Juá study area (after Souza Filho 1998) Hydrogeology Journal (2012) 20: 1061–1080
DOI 10.1007/s10040-012-0855-1
1064
a 0.2-m-thick clay layer. The weathered bedrock usually is less than 5 m thick, but at some well locations (for example, MAN2, SIT1 and CD1) it continues to depths of 20 m (Souza Filho et al. 2006). According to Souza Filho (1998), faulting and fracturing in the area is complex. The main extensional NNW– SSE and E–W fault directions are related to the last episodes of the Braziliano Orogeny that occurred during the Neoproterozoic to Early Cambrian periods. A set of NW–SE trending mafic dikes mark the Mesozoic era, and normal faults oriented along the NNE–SSW direction that are filled with clay and silt indicate the influence of neotectonics. At the regional scale, a WNW–ESE trending lineament is recognized as a transtensional fault that is filled with 1-m thick granite sheeting. Outcrops are highly fractured on this lineament, especially near Juá village where frequencies are about 100 joints per square meter. Because this lineament coincides with the São Gabriel River (Fig. 2), it is thought to be an important regional hydraulic feature. Other secondary hydraulic features may be associated with weathering of foliation.
Climate and water balance Key climate factors affecting the water balance of northeastern Brazil are rainfall, temperature, and evapotranspiration. To quantify these factors, the Juá village maintains a weather station that is operated by the Ceará State Foundation for Meteorology and Hydric Resources. Data collected over the period of 1999–2007 indicates that annual rainfall is less than 800 mm of which most falls within the first four months (January– April). Rainfall conditions also vary by year: dry (250– 400 mm), average (400–600 mm), and wet (600– 800 mm). The respective annual dry and wet conditions are generally associated with El Niño and La Niña events. In the past decade, strong El Niño events of 2001, 2005, and 2007 resulted in significant reductions in annual rainfall from the previous year. For example, rainfall amounts went from 773 mm to 346 mm over the period of 2000–2001, from 683 mm to 359 mm over the period of 2004–2005, and from 489 mm to 272 mm over the period 2006–2007. Over this period, a strong La Niña event is associated with 2000 (National Oceanic and Atmospheric Administration 2011).
Data Unprocessed field measurements Unprocessed (no digital processing) measurements from Juá field investigations (Table 1) provide spatial variables for this study (Souza Filho 2008). Well yield (Yield) information (Table 2) is mostly obtained from drilling reports, owner knowledge, and anecdotal information during the dry period of the 2000 wet year (La Niña event). For these reasons, there exists some amount of uncertainty in the relative production values based on the system: submersible pump, injection pump, manual pump, or windmill. The hydrogeologic field surveys are used to Hydrogeology Journal (2012) 20: 1061–1080
Table 1 Summary of unprocessed Juá study measurement variables. (°C degrees Celcius; cps counts per second; Hz Hertz; mS/m millisiemens per meter; ppm parts per million of primary response; in-phase the real part of frequency signal; quadrature the imaginary part of the frequency signal) Variable (units)
Unprocessed measurement description
Hydrogeologic CoordX (m) UTM coordinate in easting direction CoordY (m) UTM coordinate in northing direction DEM (m) Digital elevation model EC2000 (mS/m) Electrical conductivity of groundwatera EC2005 (mS/m) Electrical conductivity of groundwaterb TEMP (°C) Temperature of groundwater measureda Yield (m3/hr) Well yield from pump test or owner informationb Airborne Geophysical K (cps) Gamma-ray spectrometer measurement of potassiumc Th (cps) Gamma-ray spectrometer measurement of thoriumc U (cps) Gamma-ray spectrometer measurement of uraniumc L1cxip (ppm) 918.5 Hz, coaxial configuration, in-phase componentd L1cxq (ppm) 918.5 Hz, coaxial configuration, quadrature componentd L1cpip (ppm) 874.3 Hz, coplanar configuration, in-phase componentd L1cpq (ppm) 874.3 Hz, coplanar configuration, quadrature componentd L2cxip (ppm) 4443 Hz, coaxial configuration, in-phase componentd L2cxq (ppm) 4443 Hz, coaxial configuration, quadrature componentd L2cpip (ppm) 4865 Hz, coplanar configuration, in-phase componentd L2cpq (ppm) 4865 Hz, coplanar configuration, quadrature componentd L3cpip (ppm) 33645 Hz, coplanar configuration, in-phase componentd L3cpq (ppm) 33645 Hz, coplanar configuration, quadrature componentd a Measurements made during April (dry period) 2000 (wet year; La Niña event) b Measurements made during April (dry period) 2005 (dry year; El Niño event) c Measurements made during October (dry period) 1977 d Measurements made during April (dry period) 2001 (dry year; El Niño event)
determine the distribution of electrical conductivity (EC2000) and temperature (TEMP) during 2000 and electrical conductivity during 2005 (EC2005) at 23 wells (Fig. 2) during respective dry (El Niño event) and wet (La Niña event) years. For these measurements, groundwater characteristics reflect collection directly from the well or at the end of a water-supply line coming from the well. Measurements of water samples are made using a portable and calibrated multi-parameter meter with measurement uncertainty of about +/−3 %. In addition to water analyses, three types of airborne geophysical measurements are available from surveys carried out across the region. These data include regional radiometric measurements (conducted during the dry period of 1977), and local electromagnetic plus magnetic measurements (conducted during the transition month DOI 10.1007/s10040-012-0855-1
1065 Table 2 Measured water yield at wells in the Juá study area (A alluvium, B Gneiss and granite-gneiss, C biotite-gneiss and calk-silicate rock or amphibolite, D biotite-gneiss and marble Sample number
Well identifier
Rock type
Well yield (m3/hr)
Information source
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
J65 J66 J67 CAIB2 CD1 ST1 CAI3 CAIB1 RM1 JC1 BR2 CB1p SL1 SP2 SP4 UB1 MAN2 MAN6 UB4 J1 RM1c CD1c CB1c
B-gneiss and granite-gneiss B-gneiss and granite-gneiss B-gneiss and granite-gneiss B-gneiss with quartz lens C-biotite-gneiss and calk-silicate rock B-biotite-gneiss B-biotite- gneiss and granitic sheeting B-gneiss with quartz lens B-garnet-biotite-gneiss B-biotite-gneiss B-biotite-gneiss and ultra-cataclasite B-biotite-gneiss D-biotite-gneiss and marble C-biotite-gneiss and amphibolite B-biotite-gneiss with quartz lens B-feldspar-quartz lens and garnet-biotite-gneiss D-biotite-gneiss and marble B-gneiss D-biotite-gneisse and marble B-biotite-gneiss B-garnet-biotite-gneiss Alluvium Alluvium
6 6 5 4.5 2 1.25 1.2 1.1 1 0.7 0.7 0.7 0.3 0.3 0.3 0.3 0.3 0.1 0.03 0.03 Not available Not available Not available
Construction report Construction report Construction report Given by owner Construction reporta Construction report Given by owner Given by owner Construction reportb Construction report Approximatedc Given by owner Given by owner Approximatedc Approximatedc Given by ownerb Construction reporta Given by owner Approximatedb Approximatedc Dug wellc Dug wellb Dug well
a
Measurements gathered during October (dry period) 2003 (wet year; La Niña event) Measurements gathered during October (dry period) 2005 (dry year; El Niño event) c Measurements gathered during October (dry period) 2000 (wet year; La Niña event) b
between wet and dry periods of 2001). The radiometric data, including potassium (K), thorium (Th) and uranium (U), are from a gamma-ray spectrometer (Exploranium DIGRS-3001 with a volume of 1017.87 cubic inches (in3) of thalium-activated sodium Iodide detectors) placed inside an aircraft (Lasa and Prospecções 2001). The spectrometer measures the rate of gamma radiation emitted by radioactive minerals present in soil and rock to depths of about 35 cm (Pine and Minty 2005). The sampling rate is equivalent to one measurement every 55 m while flying at a 150-m elevation above the ground. The north–south flight lines are spaced 500 m apart with orthogonal tie lines flown every 20,000 m. The radiometric data are corrected for leveling variations and initially interpolated to a regular grid comprising 250 m cells. However, to comply with the same unit area of the local airborne data and facilitate comparisons between the models, these data were resampled to a 25-m cell-size grid. Individual gamma-ray spectrometry data are discriminated with reference to the total energy and separated by channels for potassium, thorium and uranium. All radiometric data are corrected for dead-time, energy variations, altitude variations, and Compton scattering effects. The magnetic and electromagnetic data were collected simultaneously inside a so-called bird towed 30 m above ground surface using an Esquilo helicopter that flew along east–west lines spaced 100 m apart with tie lines at every 500 m (Lasa and Prospecções 2001). Given the 10-Hz system sampling rates and the airborne velocity, measurements are recorded along flight-lines approximately every Hydrogeology Journal (2012) 20: 1061–1080
4 m. The magnetic data were acquired using a Geometrics G-822 Cesium sensor magnetometer and refer to intensity measurements of the total magnetic field. The electromagnetic data were acquired using an Aerodat DSP-99 system whose measurements refer to in-phase (ip) and quadrature (q) components of the secondary magnetic field strength collected using vertical coaxial (cx) and horizontal coplanar (cp) configurations at five frequencies: L1cxip (918.5 Hz); L1cxq (918.5 Hz); L1cpip (874.3 Hz); L1cpq (874.3 Hz); L2cxip 184 (4443 Hz); L2cxq (4443 Hz); 185 L2cpip (4865 Hz); L2cpq (4865 Hz); L3cpip (33645 Hz); L3cpq (33645 Hz).
Processed field measurements The unprocessed hydrogeologic and geophysical measurements are difficult to interpret individually. Therefore, digitally processed geophysical and satellite data from a previous study (Souza Filho OA de 2008) are used for well sites where hydrogeologic information is available. These data types include hydrogeologic measurements: bedrock geology, structural lineaments; topographic, ground water electrical conductivity; and airborne geophysical measurements: satellite, radiometric, magnetic, and electromagnetic attributes. The digital processing is summarized together with corresponding variables in Table 3. The airborne geophysical data were digitally processed with the objective to enhance shallow-source anomalies and gradient trends related to hydraulic structures. After leveling the magnetic data they are reduced to the magnetic north pole to locate anomalies over the causative DOI 10.1007/s10040-012-0855-1
1066
source; they are then matched filtered in the frequency domain (Phillips 1997) to separate frequencies related to shallow (above 60 m deep) from deep magnetic sources (up to 2,000 m depth). The grid of the shallow magnetic source data (MAG60) is used because it correlates with the depth interval (7–100 m) of wells in Juá area. The reduced-to-the-pole magnetic data were also transformed to physical property maps of magnetization using a Table 3 Summary of unprocessed Juá study measurement variables. (°C degrees Celcius; cps counts per second; Hz Hertz; mS/m millisiemens per meter; ppm parts per million of primary response; in-phase the real part of frequency signal; quadrature the imaginary part of the frequency signal) Variable (units)
Processed measurement description
Hydrogeologic Rock types (name) Raspect (degrees) Rb100lin (dimensionless)
Geology Basin aspect Class of lineament azimuth up to 100 m from well Yield class (dimensionless) Class of well yield KDE (mS/m) Electrical conductivity estimated by Kriging with external drift STD (mS/m) Standard deviation of electrical conductivity estimated from 100 Monte Carlo trials Q75 (mS/m) Seventy fifth percentile of electrical conductivity estimated from 100 Monte Carlo trials IQR (mS/m) Interquartile range of electrical conductivity estimated from 100 Monte Carlo trials Airborne geophysical brdt ETM+/Landsat-7 images processed using the bandratio techniquea Cond900cx (mS/m) Reciprocal of layer resistivity (conductivity) derived from Marquardt inversion constrained by 918.5 Hz, in-phase and quadrature, coaxial measurementsb Cond900cp (mS/m) Reciprocal of layer resistivity (conductivity) derived from Marquardt inversiona constrained by 918.5 Hz, in-phase and quadrature, coplanar measurementsb Cond4500cx (mS/m) Reciprocal of layer resistivity (conductivity) derived from Marquardt inversiona constrained by 4,443 Hz, in-phase and quadrature, coaxial measurementsb Cond33000cp (mS/m) Reciprocal of layer resistivity (conductivity) derived from Marquardt inversiona constrained by 33,645 Hz, in-phase and quadrature, coplanar measurementsb Dif4500_900 (mS/m) Difference between Cond4500cx and Cond900cx MAG60 (nT) Magnetic data reduced to magnetic north pole and matched to emphasize shallow magnetic sourcesb TerMag (nT) Terrace-shape function derived using the second vertical derivativeb
a b
Measurements recorded during October (dry period) of 1999 Processing details provided in Fraser (1978)
Hydrogeology Journal (2012) 20: 1061–1080
technique (Cordell and McCafferty 1989) that transforms the continuous potential field data into discrete terraceshape distributions (TerMag). At each sounding location, the electromagnetic data was processed assuming a one-dimensional (1D) layered-earth model; that is, a single resistive layer over a more conductive halfspace. The Fraser (1978) inverse algorithm was used to estimate layer resistivities that were constrained using electromagnetic measurements recorded at 900 Hz (Cond900cp; Cond900cx), 4500 Hz (Cond4500cx), and 33000 Hz (Cond33000cp) and two coil configurations (coaxial and coplanar). Gridded maps of electrical conductivity were obtained by taking the reciprocal of estimated layer resistivity values. Because higher electromagnetic frequencies generally have lower exploration depths, another map was produced based on differences between conductivities for the 4,500 and 900 Hz maps (Dif4500_900) reflecting the variation of conductivity with depth. The more positive differences indicate higher conductivities at shallow depths, and negative differences indicate increasing conductivity at depth. The satellite images from the sensor ETM+/Landsat-7 were chosen to represent a dry hydrologic condition (October 1997). The processing followed the band ratio technique (Sabins 1997) to map weathering minerals and vegetation. Each image result is low-pass filtered to remove high-frequency noise and then classified into low, medium, and high content (BDRT). The results are available in Souza Filho et al. (2007a). The hydrogeologic data processing involves converting previously mapped geology to raster-type geographical information system files. Another type of hydrogeologic data refers to values of groundwater electrical conductivity that are estimated using the Kriging and stochastic simulation methods (Souza Filho et al. 2010). In both methods, the processed 4500 Hz electromagnetic layer conductivity is used as a secondary external drift function. The spatial distribution of groundwater electrical conductivity is then estimated by Kriging this external drift (KDE) and available well data. To evaluate the relative uncertainty in the set of 100 Sequential Gaussian simulations, the standard deviation (STD), 75th percentile (Q75), and interquartile range (IQR) are calculated as study variables.
Methods Training the self-organizing map During training of the SOM, the SOM learns to project, in a nonlinear manner, from the high-dimensional data input layer to a low-dimensional discrete lattice of competitive neurons called the output layer (Kohonen 2001). In this study, each input data item x 2 X is considered a vector in the set X ¼ ½x1 ; x2 ; :::; xN T with M being the dimension of the input data space. A fixed number of k neurons indexed to i is arranged on a regular grid G with each neuron associated with a weight vector wi in the set DOI 10.1007/s10040-012-0855-1
1067
W ¼ ½wi ; w2 ; :::wN , which has the same dimensionality as the input vectors. These weight vectors connect each input vector x in parallel to all neurons of G. The neurons are connected to each other, and in this case this interconnection is defined using a toroid topology. The process of mapping is facilitated using a cost function E(G, X) defined as the Euclidian norm given by
E ðG; X Þ ¼
2 1 X XM xj wi : h i;I i2G j¼1 N
ð1Þ
where h is the unnormalized Gaussian neighborhood function, N is the number of neurons, x is the input vectors, w is the weight vectors, and i and j index the vectors. Evaluating the Euclidian norm input for each vector xj requires calculating distances xj wj over the grid to determine which weight vector is the closest to the input vector. The neuron with the closest weight vector is the winning neuron, called the best matching unit (BMU), given by I ¼ arg min i; j2G kxðiÞ wð jÞk
ð2Þ
and defines the central position of a neighborhood function hi,I; I is the best matching vector and min is the minimization operator. The neighborhood function deter mines the contribution of each vector difference xj wj to the total cost function and measured with respect to the current BMU. In this study, the neighborhood function is defined as a Gaussian function given by " # kri rI k2 hi;I ðnÞ ¼ exp ; ð3Þ sðnÞ2 where kri rl k corresponds to the distance between neuron ri and BMU in the map grid and σ (n) defines the width of the neighborhood function, a monotonically decreasing function of the iteration (also called epoch) number n. The position of the BMU also determines the neighborhood center and hence which subsets of neurons learn the most about the input vector. In the second step, an updated weight is determined as a function of the distance to the current BMU expressed through the neighborhood function. The weight update vector is given by wi ðn þ 1Þ ¼ wi ðnÞ þ aðnÞhi;l ðnÞ xj ðnÞ wi ðnÞ ;
ð4Þ
where α (n) is a scalar value called the learning rate bounded on the interval [0,1]. The BMU ensures that the largest weight correction (hI,I(n)=1) is adjusted in the direction of the input vector. The association effect takes place at the neighboring nodes but to a lesser degree because of the Gaussian shape. This adaption procedure stretches the weight vectors of the BMU and its topological neighbors towards the input vector. Presenting similar input vectors to the map provides further activations in the same neighborhood and thereby tends to produce clustering of data in the feature space. Association between Hydrogeology Journal (2012) 20: 1061–1080
neurons decreases during the learning process (the width of the neighborhood function σ (n) is forced to decrease with n preserving large clusters of data while enabling the separation of clusters that are closely spaced. The SOM learning method that is used is the stochastic gradient described by Kohonen (1984). It consists of a two-step process that is performed each time an input pattern is presented to the map: competition to determine the BMU and cooperative learning (spreading information contained in the current input vector across the map). At the beginning of the unsupervised training phase, the weight vectors are initialized to small random numbers. The input data vectors X are presented to the map grid in a random fashion to generate data clusters without introducing bias for a specific class. Each time that an input vector is presented to the map grid, the cost function is calculated and number of times each neuron becomes a BMU is recorded. At the end of each iteration, the average cost function R(En , X) is calculated using RðEn ; X Þ ¼
1 XM E ðG; xÞ: n¼1 n M
ð5Þ
where En refers to n cost function realizations. The training process stops when R (En , X) is smaller than a prescribed fraction of its initial value; for example R (En , X) 0.7 or