GIS-based groundwater potential mapping using ...

3 downloads 0 Views 6MB Size Report
these methods are relatively new to prediction of spring location ... and (iii) new springs can be identified during drought to provide a ...... Lebanon. Earth Surface Processes and Landforms, 32(12),. 1770–1782. Breiman, L. ..... Township, southern Khorasan Province, Iran. ... New Jersey: John Wiley and Sons inc. Trigila, A.
Environ Monit Assess (2016) 188:44 DOI 10.1007/s10661-015-5049-6

GIS-based groundwater potential mapping using boosted regression tree, classification and regression tree, and random forest machine learning models in Iran Seyed Amir Naghibi & Hamid Reza Pourghasemi & Barnali Dixon

Received: 16 March 2015 / Accepted: 10 December 2015 # Springer International Publishing Switzerland 2015

Abstract Groundwater is considered one of the most valuable fresh water resources. The main objective of this study was to produce groundwater spring potential maps in the Koohrang Watershed, Chaharmahal-eBakhtiari Province, Iran, using three machine learning models: boosted regression tree (BRT), classification and regression tree (CART), and random forest (RF). Thirteen hydrological-geological-physiographical (HGP) factors that influence locations of springs were considered in this research. These factors include slope degree, slope aspect, altitude, topographic wetness index (TWI), slope length (LS), plan curvature, profile curvature, distance to rivers, distance to faults, lithology, land use, drainage density, and fault density. Subsequently, groundwater spring potential was modeled and mapped using CART, RF, and BRT algorithms. The predicted results from the three models were validated using the receiver operating characteristics

curve (ROC). From 864 springs identified, 605 (≈70 %) locations were used for the spring potential mapping, while the remaining 259 (≈30 %) springs were used for the model validation. The area under the curve (AUC) for the BRT model was calculated as 0.8103 and for CART and RF the AUC were 0.7870 and 0.7119, respectively. Therefore, it was concluded that the BRT model produced the best prediction results while predicting locations of springs followed by CART and RF models, respectively. Geospatially integrated BRT, CART, and RF methods proved to be useful in generating the spring potential map (SPM) with reasonable accuracy.

S. A. Naghibi Department of Watershed Management Engineering, College of Natural Resources, Tarbiat Modares University, Noor, Mazandaran, Iran

Introduction

H. R. Pourghasemi (*) Department of Natural Resources and Environmental Engineering, College of Agriculture, Shiraz University, Shiraz, Iran e-mail: [email protected] H. R. Pourghasemi e-mail: [email protected] B. Dixon Department of Environmental Science, Policy and Geography, University of South Florida, Saint Petersburg, FL, USA

Keywords Spring potential mapping . Boosted regression tree . Classification and regression tree . Random forest . GIS . Iran

Groundwater is considered one of the most valuable natural resources (Todd and Mays 2005) due to several qualities such as consistent temperature, widespread availability, limited vulnerability to contamination, low development cost, and drought reliability (Jha et al. 2007). The rapid increase in human population has increased the demand for groundwater supplies for drinking, agricultural, and industrial purposes (Lee et al. 2012). As demands for fresh groundwater increases, delineations of groundwater spring potential zones become an essential tool for implementing a

44

Page 2 of 27

successful groundwater determination, protection, and management program (Ozdemir 2011a). Many studies used integrated geographic information system (GIS), remote sensing (RS), and geostatistics methods for groundwater mapping (Jaiswal et al. 2003; Solomon and Quiel 2006; Jha et al. 2007; Ganapuram et al. 2009; Saha et al. 2010; Chung and Rogers 2012; Bhat et al. 2015; Varouchakis 2015). Many studies have evaluated groundwater potential using probabilistic models (Srivastava a nd Bhattacharya 2006; Arthur et al. 2007; Ghayoumian et al. 2007; Chowdhury et al. 2009; Murthy and Mamo 2009; Chenini et al. 2010; Chenini and Ben Mammou 2010; Corsini et al. 2009; Gupta and Srivastava 2010; Oh et al. 2011). Further, many recent studies used GIS methods integrated with frequency ratio (FR), logistic regression (LR), weights-ofevidence (WofE), and Shannon’s entropy (SE) models (Dixon 2009; Oh and Lee 2010; Oh et al. 2011; Ozdemir 2011a, b; Manap et al. 2012; Lee et al. 2012; Davoodi Moghaddam et al. 2015; Pourtaghi and Pourghasemi 2014; Naghibi et al. 2015). Although some machine learning models such as boosted regression tree (BRT), classification and regression tree (CART), and random forest (RF) techniques have been successfully applied in landslide susceptibility and hazard mapping (Stumpf and Kerle 2011a, b; Vorpahl et al. 2012; Trigila et al. 2013), gully susceptibility mapping (Bou Kheir et al. 2007; Geissen et al. 2007; Gutiérrez et al. 2009a, b), wildfire (Oliveira et al. 2012; Leuenberger et al. 2013), ecology (Cutler et al. 2007; Elith et al. 2008; Aertsen et al. 2010; Aertsen et al. 2011), environmental modeling (Prasad et al. 2006; Strobl et al. 2008), and groundwater studies (Demsar 2007; Baudron et al. 2013; Rodriguez-Galiano et al. 2014), applications of these methods are relatively new to prediction of spring location potential. Therefore, this paper will use BRT, CART, and RF integrated in a GIS to predict groundwater spring locations for the study area. The predicted output will be referred to as spring potential map (SPM). Furthermore, the FR model was applied to illustrate the quantitative relationship between distributions of groundwater spring occurrences with predictor factors. The combined approaches of BRT, CART, and RF hereon will be referred to as machine learning (ML) models in this paper. The ML technique is a rapidly growing area of predictive modeling that is concerned with identifying structure in complex, often nonlinear, data and

Environ Monit Assess (2016) 188:44

generating accurate predictive models (Olden et al. 2008). ML approaches often exhibit greater power for resolving complex relationships (i.e., nonlinear, nonmonotonic, multimodal relationships common with landscape and ecological applications) as ML approaches are not restricted to the traditional assumptions (about data characteristics) commonly used with conventional and parametric approaches (Olden et al. 2008). Traditional modeling approaches are commonly based on stricter statistical assumptions and data requirements and frequently utilize linear or additive modeling approaches that are not consistent with natural processes that occur in the landscape (Clapcott et al. 2013). Traditional methods such as bivariate and multivariate models namely frequency ratio, weights-of-evidence, logistic regression, index of entropy, evidential belief function, and analytical hierarchy process which have been used in groundwater potential mapping (Oh et al. 2011; Ozdemir 2011a, b; Manap et al. 2012; Davoodi Moghaddam et al. 2015; Pourtaghi and Pourghasemi 2014; Rahmati et al. 2014; Naghibi et al. 2015; Razandi et al. 2015). These models (BRT, CART, and RF) are examples of ML techniques that offer alternative approaches to the traditional methods of prediction. Furthermore, advantages of BRT, CART, and RF include their (i) ability to accommodate different types of predictor variables and missing values and (ii) ability to facilitate for Bfitting^ interactions between predictors (Friedman and Meulman 2003). Despite the obvious advantages and flexibility of these machine learning techniques to account for nonlinear relationships and handle uncertainty in data, they remain susceptible to overfitting the data, i.e., fitting noise resulting in unstable regression coefficients (Harrell et al. 1996; Guisan and Thuiller 2005). Hence, comparative sensitivity analyses of prediction accuracies among the model results (BRT, CART, and RF) are needed. Ability to predict groundwater spring potential in spatially explicit ways will allow for planning and zoning of areas where (i) groundwater can be reached with minimal effort, (ii) potentials for new springs are high, and (iii) new springs can be identified during drought to provide a water reserve (Corsini et al. 2009). The main objective of this study was to predict groundwater spring potential maps using BRT, CART, and RF techniques in the Koohrang Watershed, Iran. The output of this research will provide a methodology to develop SPM that can be used by government

Environ Monit Assess (2016) 188:44

agencies and operations, as well as private sectors for groundwater exploration, assessment, and protection. Study area The Koohrang Watershed is located in the west of Chaharmahal-e-Bakhtiari Province, Iran. The study area is located between upper left (49° 54′ E, 32° 36′ N) and lower right (50° 38′ E, 32° 0′ N) with an area of 1239 km2 (Fig. 1). Land-surface elevation in

Page 3 of 27 44

the study area ranges from 1660 to 4200 m above sea level, with an average of 2658 m. The mean annual precipitation is recorded as 1425 mm (Mojiri and Zarei 2006). The dominant land use of the study area consists of rangeland types and covers almost 60 % of Koohrang Basin followed by field crops, orchards, and forests. Based on a geological survey of Iran (GSI 1997), 44 % of the lithology covering the study area falls within the units represented as group 3 in Table 1, which includes the Undivided

Fig. 1 Spring locations with digital elevation model (DEM) map of the study area

44

Environ Monit Assess (2016) 188:44

Page 4 of 27

Table 1 Lithology characteristics of study area Name

Lithology

Formation

Group 1

Undivided Khami Group, consist of massive thin-bedded limestone

Surmeh, Hith Anhydrite, Fahlian, Gadvan, and Darian

Group 2

Thin- to medium-bedded, dark gray dolomite; thin-bedded dolomite; greenish shale; and thin-bedded argillaceous limestone Undivided Bangestan Group, mainly limestone and shale

Khaneshkat and Neyriz

Dark red, medium-grained arkosic to sub-arkosic sandstone and micaceous siltstone Limestone, dolomite, dolomitic limestone, and thick layers of anhydrite in alternation with dolomite in middle part Undivided Eocene rock

Lalun

Dolomite platy and flaggy limestone containing trilobite, sandstone, and shale Low weathering gray marls alternating with bands of more resistant shelly limestone Low-level piedmont fan and valley terrace deposits

Mila

Cream to brown-weathering, feature-forming, well-jointed limestone with intercalations of shale

Asmari

Group 3 Group 4 Group 5 Group 6 Group 7 Group 8 Group 9 Group 10

Bangestan Group, composed of mainly limestone and shale, which serves as suitable lithology for groundwater abundance. Additionally, this study area is endowed with favorable topological, geological, hydrogeological, geomorphologic, and environmental characteristics that lead to the abundance of springs. Exploitation of groundwater resources in the study area includes use of qanats, springs, and deep and semi-deep wells. The most important springs in the study area are Rostam-Abad, Cheshmeh-Mola, Morvarid, Mar-Boran, Sardab-Marboran, Koohrang, Kochak-Koohrang, Cal-Gachi, Chel-Cheshmeh, and Khak-Dalon. The average spring discharge is approximately 4 gal./s in the study area. The study area also consists of 27 wells where water is withdrawn from the alluvial fan and the well depths range between 7 and 20 m. The general trend of groundwater flow is from the north of the basin to the south of the plain, and the general topographic gradient of the plain is north to east. The relatively

Kazhdumi, Sarvak, Surgah, and Ilam

Dalan

Mishan

uneven topography of the study area leads to a range of water table depths, from 2 to 230 m in different regions.

Materials and methods Spring characteristics The spring locations potential map was developed for the study area using national reports (Iranian Department of Water Resources Management) and extensive field surveys at 1:50,000 scale. First, a base layer of the spring inventory map was created where 864 springs were identified. A random partition algorithm was used to separate training springs from the validation springs (Lee et al. 2012; Oh et al. 2011). Of the 864 spring locations, 605 springs (70 %) were selected for the training dataset and the remaining 259 springs (30 %) were used for the validation dataset (Fig. 1).

Table 2 Groundwater database of study area Source of data

Data layers

Data format

Scale

Topographic maps and field surveys

Spring locations map

Point

1:50,000

National Cartographic Center (NCC)

Topographic map

Line and point

1:50,000

Geology Survey of Iran (GSI)

Geological map

Polygon and line

1:100,000

National Geographic Organization (NGO)

Landuse map

Polygon

Landsat 7/ETM+ (30 × 30 m)

Environ Monit Assess (2016) 188:44

Page 5 of 27 44

Fig. 2 Topographical parameter maps of the study area: a slope degree, b slope aspect, c altitude, d topographic wetness index (TWI), e slope length (LS), f plan curvature, g profile curvature

Data development for critical factors related to groundwater springs Identification and development of appropriate spatial data layers that lead to favorable conditions for springs to occur were completed first. Based on the literature review, the following hydrologicalgeological-physiographical (HGP) factors (Table 2)

were identified as critical importance layers in identifying potential locations of springs. These are slope degree, slope aspect, altitude, topographic wetness index (TWI), slope length (LS), plan curvature, profile curvature, distance to rivers, distance to faults, lithology, land use, drainage density, and fault density (also known as predictor factors) (Ozdemir 2011a, b; Oh et al., 2011; Manap et al. 2012;

44

Page 6 of 27

Environ Monit Assess (2016) 188:44

Fig. 2 (continued)

Davoodi Moghaddam et al. 2015; Pourtaghi and Pourghasemi, 2014; Naghibi et al. 2015). The digital elevation model (DEM) was created using 20-m interval contours and survey base points representing the elevation values which were extracted from the 1:50,000-scale topographical maps. Altitude, slope degree, slope aspect, TWI, LS, plan curvature, and profile curvature were produced using the DEM and

shown respectively in Fig. 2a–g. Topography plays a critical role in the spatial variation of hydrological conditions such as groundwater flow and soil moisture. Therefore, the secondary topographic indices have been used for characterizing spatial patterns of soil moisture distribution (Moore et al. 1991). The altitude map was created from the 20-m digital elevation model based on the topographic

Environ Monit Assess (2016) 188:44 Fig. 3 a Distance to rivers (buffer); b distance to faults (buffer) maps

Page 7 of 27 44

44

Page 8 of 27

Fig. 4 Drainage density map of the study area

Fig. 5 Fault density map of the study area

Environ Monit Assess (2016) 188:44

Environ Monit Assess (2016) 188:44

maps and classified into five categories (3700 m) according to an equal-interval classification scheme (Fig. 2a). Slope aspect is another effective factor that was produced using the mentioned DEM and grouped into nine classes: flat (−1°), north (337.5°– 360°, 0°–22.5°), northeast (22.5°–67.5°), east (67.5°– 112.5°), southeast (112.5°–157.5°), south (157.5°– 202.5°), southwest (202.5°–247.5°), west (247.5°– 292.5°), and northwest (292.5°–337.5°) (Fig. 2b). Also, the slope degree map was prepared from the DEM and classified into four classes: (1) 0°–5°, (2) 5°–15°, (3) 15°–30°, and (4) >30° (Fig. 2c). Additionally, the water infiltration, matrix flow, and evaporation depend on the landscape and soil properties such as pore water pressure, permeability, and soil moistures as they affect soil resistance; thus, TWI was considered as a critical factor in this study. In addition, TWI has been extensively used to describe the effect of topography on the location and size of saturated source areas of runoff generation (Beven 1997; Beven and Freer 2001). The TWI is a secondary topographic factor

Fig. 6 Lithology map of the study area

Page 9 of 27 44

within the runoff model which is defined according to the following equation (Moore et al. 1991) (Fig. 2d): TWI ¼ lnðα=tanβÞ

ð1Þ

where α is the cumulative upslope area draining through a point and tan β is the slope angle at the point. LS is the combination of the slope length (L) and slope steepness (S) which is used to indicate soil loss potential from the combined slope properties (Fig. 2e). The combined LS factor can be calculated by the equation suggested by Moore and Burch (1986) (Eq. 2).  LS ¼

Bs 22:13

0:6 

sin α 0:0896

1:3 ð2Þ

where BS is the specific catchment area (m2). Additionally, using the system for automated geoscientific analyses (SAGA-GIS), the profile curvature and plan curvature maps were prepared (Fig. 2f, g) for the study area. These topographical factors can be estimated quickly from a DEM and applied in an integrated land and water information system (ILWIS 3.3)

44

Environ Monit Assess (2016) 188:44

Page 10 of 27

Fig. 7 Land use map of the study area

(ITC 2005) to account for the influence of terrain characteristics on hydrology. The distance to rivers (Fig. 3a) map was created using a topographic map, whereas the distance to faults (Fig. 3b) maps were created using a geological map of the study area. Also, drainage density (Fig. 4) and fault density (Fig. 5) maps were prepared according to topographical and geological maps, respectively. The resultant drainage and fault density maps were reclassified using the natural break method of the data analysis since the use of the natural break classification is effective when obvious breaks are present in the data (Ruff and Czurda 2008). The lithology map was obtained using a 1:100,000-scale geological map, and the lithological units were classified into ten groups (Fig. 6, Table 1). Landsat 7/ETM images for 2010 were used to derive the land use map for the study area where a supervised maximum likelihood algorithm was used. The resultant land use map showed four land use classes, including rangeland, field crops, orchard, and forest types (Fig. 7). Finally, for application of the mentioned models, all the conditioning factors were converted to a raster grid with 20 m × 20 m pixel size in the ArcGIS 9.3 software. All the maps are in Universal Transverse

Mercator (UTM) coordinate system and WGS84 spatial reference (WGS84-UTM-Zone39N).

Methods All three machine learning algorithms were fitted in R (R Development Core Team 2006) version 3.0.2, implementing gbm, dismo, rpart, and randomForest packages (Ridgeway 2006). Figure 8 shows the spring potential mapping analysis and methodology flowchart used in this study. Also, 605 nonspring points were produced in ArcGIS to be used in the machine learning techniques (Fig. 9). At first, the values of 13 groundwater effective factors were extracted for every single training spring. Then, data was stored and imported to the R 3.0.2 software. In R, CART, BRT, and RF models were run according to different packages such as random Forest, gbm, dismo, and rpart. Subsequently, In R open source software, groundwater potential was calculated for every pixel of the study area. Finally, these values were converted into Text format to SPSS and then to dbase format to make the map in the Arc GIS 9.3.

Environ Monit Assess (2016) 188:44

Page 11 of 27 44 Input data

Dependent Faactors D

Effecttive factors

(Springs) e SSlope Degree SSlope Aspectt Random Parttion R Digitaal Elevatiion Model (D DEM)

Altude (m) TWI

Validaaon Springss

ng Springs Trainin

Slope Length S

(25 59 or 30%)

(605 or 70%)

Plan Curvaturre ure ofile Curvatu Pro

210) on-Spring (12 Spring and No Topograaphic Factors

Disstance to Rivers Drrainage Denssity

RT, RF and BR RT models Applicaon of CAR e S Slope Degree Spriing Potenall Maps in Koohrang Watersshed, Iran

Geolo ogy Facto ors

S Slope Aspectt Altude (m))

of models ROC C curve and validaon v

odel Selecon of the best mo

Land Use U

Fig. 8 Flowchart of the used methodology

Frequency ratio In this research, the FR model was applied to illustrate the quantitative relationship between distributions of groundwater spring occurrences with predictor factors (i.e., HGP factors that play a critical role in the occurrence/locations of springs). A FR model can provide a simple geospatial assessment tool to calculate the probabilistic relationship between dependent and independent factors, including multiclassified maps (Oh et al. 2011). According to Bonham-Carter (1994), the frequency ratio is the probability of occurrence of a certain attribute. Specifically in this case, the frequency ratio approach is based on the observed relationships between distribution of groundwater spring locations and each groundwater-related factor to reveal the

correlation between groundwater spring locations and the factors in the study area. The calculation steps for a FR and a class of the spring-affecting factor are below (Eq. 3):

FR ¼

A=B C=D

ð3Þ

where A is the area of a class for the factor (for example