Effects of positional error on modeling species distributions: a perspective using presettlement land survey records Stephen J. Tulowiecki, Chris P. S. Larsen & Yi-Chen Wang
Plant Ecology An International Journal ISSN 1385-0237 Plant Ecol DOI 10.1007/s11258-014-0417-9
1 23
Your article is protected by copyright and all rights are held exclusively by Springer Science +Business Media Dordrecht. This e-offprint is for personal use only and shall not be selfarchived in electronic repositories. If you wish to self-archive your article, please use the accepted manuscript version for posting on your own website. You may further deposit the accepted manuscript version in any repository, provided it is only made publicly available 12 months after official publication or later and provided acknowledgement is given to the original source of publication and a link is inserted to the published article on Springer's website. The link must be accompanied by the following text: "The final publication is available at link.springer.com”.
1 23
Author's personal copy Plant Ecol DOI 10.1007/s11258-014-0417-9
Effects of positional error on modeling species distributions: a perspective using presettlement land survey records Stephen J. Tulowiecki • Chris P. S. Larsen Yi-Chen Wang
•
Received: 18 May 2014 / Accepted: 4 October 2014 Ó Springer Science+Business Media Dordrecht 2014
Abstract Presettlement land survey records (PLSRs) are the records of early land surveys in North America, and contain data regarding vegetation conditions prior to widespread European-American settlement. Researchers have used the data within PLSRs to develop species distribution models (SDMs), in order to generate predictions of the historical distributions of tree species. Despite their value for SDMs, PLSRs contain positional error, which may hinder their usefulness for modeling species distributions at fine spatial resolution. Using data from the Holland Land Company (HLC) township survey (1797–1799 CE) of Western New York, USA, this study examines the positional error associated with different approaches for georeferencing vegetation data within PLSRs. The study then examines the impact of positional error upon the predictive performance of SDMs that utilize PLSRs. Our study indicates that the
Communicated by J. P. Messina. S. J. Tulowiecki (&) C. P. S. Larsen Department of Geography, University at Buffalo, Buffalo, NY 14261, USA e-mail:
[email protected] C. P. S. Larsen e-mail:
[email protected] Y.-C. Wang Department of Geography, National University of Singapore, Singapore 117570, Singapore e-mail:
[email protected]
magnitude of positional error within PLSRs varies with georeferencing approach, and that more accurate georeferencing approaches produce better-performing SDMs. The study also indicates that the effects of positional error upon SDMs vary with the niche characteristics of species. Overall, this study affirms the importance of accurately georeferencing species data prior to developing SDMs, including applications that involve PLSRs. Keywords Species distribution models Positional error Georeferencing Presettlement land survey records Forest composition Historical ecology
Introduction Presettlement land survey records (PLSRs) provide one of the most spatially extensive data sources available for reconstructing past vegetated landscapes (Whitney 1996). PLSRs were created during the first European-American land surveys in North America (circa 17th–19th centuries CE), and contain distance and bearing measurements of original survey lines. PLSRs facilitated Euro-American settlement in the New World by delineating township and lot boundaries, and also by recording descriptions of land characteristics and soil quality. Most important to those who study past forest conditions, PLSRs contain descriptions of forest composition along survey lines
123
Author's personal copy Plant Ecol
(‘‘line-descriptions’’ or ‘‘species lists’’), as well as records of trees that surveyors used to mark important survey locations (‘‘witness-trees’’ or ‘‘bearing-trees’’). These vegetation data have long been valuable for mapping past tree species and community distributions, and for understanding the environmental variables and processes that shaped forested landscapes (Bourdo 1956; Wang 2005). Researchers have recently utilized the vegetation data within PLSRs, in conjunction with geographical information systems (GIS) and species distribution models (SDMs, Franklin & Miller 2009), in order to predict the past distributions of tree species (Fagin & Hoagland 2011; Fahey et al. 2012; Hanberry et al. 2012; He et al. 2007). SDMs relate species records to environmental variables via statistical or machinelearning methods, in order to predict the geographic distribution of a species. The number of studies that employ SDMs has increased exponentially over the past several decades (Peterson & Sobero´n 2012), but only recently have researchers used vegetation data from PLSRs to develop SDMs of presettlement forest composition. Although the vegetation data in PLSRs have been assessed for various forms of error, ambiguity, uncertainty, and bias (Black & Abrams 2001; Kronenfeld & Wang 2007; Manies et al. 2001), less attention has been given to the issue of positional error in PLSRs. Positional error refers to the difference between actual and estimated locations of spatial data. Though positional error persists within all spatial data (Shi 2009), additional sources of positional error are unique to PLSRs. If care is not exercised while georeferencing PLSRs, then positional error may be exacerbated by these sources. When using PLSR data to train SDMs, positional error becomes important, because error in species occurrence records has been shown to degrade the predictive ability of SDMs, potentially for species with narrow environmental tolerances (Fernandez et al. 2009; Graham et al. 2008; Johnson & Gillingham 2008). Positional error in PLSR data is also important, because an increase in positional accuracy would also allow finer spatial resolutions of analysis in SDMs. Causes of positional error in PLSRs Three unique aspects of PLSRs make their data susceptible to positional error: measurement errors, loss of original survey features such as ‘‘reference
123
points,’’ and reliance on ‘‘relative points’’ (Fig. 1). First, systematic and random measurement errors plagued the distance and bearing measurements of PLSRs throughout the 17th and 18th centuries CE, due to surveyors’ use of rudimentary measuring devices, such as wooden measuring compasses and metal chains (Ellicott 1937; Estopinal 2008; Johnson 1889). Factors such as the wearing of surveying equipment, human error, and landscape ruggedness also contributed to measurement errors. Measurement error in surveyed distance measurements, as a percentage of actual distance measurement, has been estimated by expert opinion to be 1 % in the 17th–18th centuries, and 0.2 % by the mid-19th century (Estopinal 2008), though errors of over 6 % in the late 18th century have been noted (White 1983). Second, the increasing loss over time of PLSR survey locations, such as original survey lines and ‘‘reference points’’ (or ‘‘reference objects;’’ cf. Guo et al. 2008), also influences positional error. Reference points are locations where point features from PLSRs (e.g. township or lot corners) can be accurately matched with point features upon modern landscapes or within existing spatial data, because the points continue to delineate modern features such as civil boundaries (e.g. town or county boundaries) or cadastral (‘‘real property’’) boundaries (Fig. 1). Generally, the number of locatable survey lines and reference points decreases with the increasing age of PLSRs, due to factors such as the ‘‘obliterating’’ (cf. Robillard et al. 2009) of these features through modern development (Thrower 1966). Where extant, the positional accuracy of reference points have been reported as 15 m in PLSR-based studies in Ohio (Dyer 2001) and Michigan (Barrett et al. 1995). The third unique aspect of PLSRs is that as the number of original survey lines and reference points decreases, remaining vegetation data at ‘‘relative points’’ must be georeferenced using PLSR-specified measurements. Relative points are objects whose locations may be estimated using only PLSR-specified distances and/or bearings from reference points (Fig. 1). Because relative points must be georeferenced in this manner, the positional uncertainty of relative points increases with both an increase in measurement error, and a decrease in the number of locatable original survey features in modern geometry. Like reference points, relative points are important for vegetation studies, because they include survey
Author's personal copy Plant Ecol
Fig. 1 An example illustrating the three causes of positional error in vegetation data within presettlement land survey records (PLSRs): measurement errors, loss of original survey features, and reliance on relative points. Some reference points and survey lines from the PLSR in (a) can be located in present times in (b), because their geometry has been preserved in features
such as cadastral boundaries. The remaining PLSR features (particularly relative points) may only be estimated using other methods, because the measurement error within the original PLSR distance measurements in (a) precludes the accurate locating of the remaining PLSR features
monumentation with associated bearing-trees, as well as the start- and end-points of vegetation linedescriptions.
plot witness-tree locations (Abrams & Ruffner 1995), and aggregating PLSR data by town to mitigate positional error issues (Cogbill et al. 2002). Even for public surveys of the well-documented Public Land Survey System, which possess readilyavailable GIS base layers from governmental websites for georeferencing (BLM 2011), procedures are still required to georeference additional PLSR features along survey lines (e.g. ‘‘line trees’’). For example, GIS-based scaling (Batek 1994) of PLSR measurements has been applied to transform distance measurements to fit between two reference points, in order to estimate additional survey locations such as relative points (a method that will later be referred to as ‘‘linear referencing’’). However, no studies have quantitatively examined the magnitude in positional error resulting from these approaches, nor have they explored how this positional error can subsequently impact the performance of SDMs.
Approaches towards georeferencing PLSRs Previous researchers have taken various approaches towards georeferencing PLSRs, which are generally classifiable by whether the PLSRs were constructed by private (e.g. by a land company) or public (e.g. by a federal government) land survey entities. Older PLSRs, such as those constructed by private land companies, have required various approaches to georeferencing, largely due to the lack of readilyavailable resources for locating survey locations, such as GIS base layers and metadata. Approaches to georeferencing these PLSRs have included strict adherence to the ‘‘surveyor geometry’’ provided by distance and bearing measurements in PLSRs (Wang 2007), tracing historical maps onto modern maps to
123
Author's personal copy Plant Ecol
Purpose This study contains two main purposes. The first purpose is to ascertain how approaches towards georeferencing PLSRs may impact the positional error of its vegetation data. We assess the positional error of PLSRs resulting from georeferencing that uses surveyor geometry, and two additional approaches that use linear referencing tools (Esri 2012) within a GIS. As opposed to the surveyor geometry approach, the two linear referencing approaches also use historical map collections, cadastral data, legal records, and GIS data to georeference PLSRs. The second purpose is to explore differences in variable selection and predictions made by SDMs, which are trained with vegetation data that are georeferenced using different approaches. The second purpose also explores whether differences in SDMs are related to the species characteristics of abundance, niche position, and niche breadth.
Study area and surveys The study area is the approximately 13,000 km2 Holland Land Company (HLC) Purchase of Western New York, USA (Fig. 2), containing portions of the Erie-Ontario Lowland and the Allegheny Plateau (Fenneman 1938). The HLC township survey delineated townships under Chief Surveyor Joseph Ellicott, and was conducted from 1797 to 1799 CE (Ellicott 1937). The HLC typically surveyed townships of 9.7 9 9.7 km (6 9 6 mi) or other rectangular polygons. HLC survey crews removed and blazed trees to delineate township lines (Ellicott 1937), and erected half-mile posts (or stone markers) along lines at typically 0.8 km (0.5 mi) intervals. Survey crews blazed and recorded two to four bearing-trees adjacent to township half-mile and township corner posts. It is believed that the HLC township surveyors selected bearing-trees for blazing with only slight bias for and against species. For instance, a previous study indicated that surveyors may have slightly avoided beech (Fagus grandifolia) and slightly favored sugar maple (Acer saccharum), but that most other species were selected with little or no bias (Kronenfeld & Wang 2007). For this study, township posts (mainly halfmile posts) are the survey features whose positional error is assessed, whereas the bearing-tree data are used to train SDMs.
123
HLC lot surveys subdivided townships into 1.21 9 1.21 km (0.75 9 0.75 mi) or 1.21 9 0.40 km (0.75 9 0.25 mi) lots for sale and settlement, mostly from 1799 until 1810 CE (Wyckoff 1988). Unlike the Public Land Survey System (White 1983), HLC lot surveys did not routinely re-use monuments created during the township survey (i.e. the half-mile posts) during subdivision, but instead established new monuments as lot corners. Only occasionally, surveyors constructed lot corners from the half-mile posts of the township survey, in which cases surveyors re-recorded the locations of these posts. Where these coincidences between lot corners and half-mile posts occur, property descriptions within deeds can be used to find half-mile post locations from the township survey (‘‘Linear referencing approaches’’ section), because they continue to delineate the corners of cadastral boundaries in these instances. However, because the vast majority of half-mile posts from the township survey did not serve any delineation purpose in subsequent surveys, most half-mile posts can be georeferenced only by using PLSR distance measurements from known reference points, and are routinely up to 4.8 km (3 mi) from township corners.
Methodology Assessing the positional accuracy of georeferencing approaches We used three approaches to georeference data from the HLC township survey: an approach that predominantly utilized surveyor geometry, and two approaches that used linear referencing to transform PLSR distance measurements to fit between reference points. To estimate positional error, we then compared the locations of half-mile posts in each georeferencing approach, to the more precise locations of half-mile posts georeferenced using an independent dataset (i.e. the lot surveys; described in ‘‘Study area and surveys’’ section above, and ‘‘Measuring and comparing positional error ‘‘ section below). All GIS operations were performed with ArcGIS 10.1 (Esri 2012). The surveyor geometry approach To perform the ‘‘surveyor geometry’’ approach (Fig. 3), we utilized a previously-georeferenced version of the HLC township survey, which was based upon strict
Author's personal copy Plant Ecol Fig. 2 The Holland Land Company (HLC) township survey. Township corners, and the half-mile posts used for measuring positional error, were located using methods outlined in ‘‘Linear referencing approaches’’ section
interpretations of HLC bearing and distance measurements recorded in the bearing-tree data of HLC ‘‘Range Books.’’ Wang (2007) transcribed and digitized bearing-tree data from the microfilmed Range Books from Reed Library at SUNY Fredonia. In this approach, the only reference points for georeferencing the HLC township survey were the southwestern corner of New York State, and additional points along the New YorkPennsylvania border; remaining data were georeferenced using the PLSR-specified distance and bearing measurements. This version of the georeferenced HLC dataset has been used in previous studies (e.g. Kronenfeld et al. 2010; Wang 2007). Linear referencing approaches Both linear referencing approaches first involved confirming township corners as reference points. The
two approaches then georeferenced half-mile posts by scaling PLSR distance measurements to fit between township corners, using linear referencing tools within a GIS. We utilized GIS data, historical maps, and property descriptions within deed records, in order to locate township corners (n = 333) within modern cadastral boundaries. Street centerline layers and cadastral layers were obtained from the New York State GIS Clearinghouse (NYS ITS GIS Program Office 2013) and county governments, respectively. Historical maps of the Julius Bien & Company (Bien 1895) were downloaded from the Rumsey Map Collection website (Cartography Associates 2013), which symbolized HLC township and lot lines, and were useful in estimating the locations of original township corners. Where modern cadastral boundaries coincided with the probable locations of township corners, we
123
Author's personal copy Plant Ecol Fig. 3 The differences in locations of half-mile posts, resulting from the three different georeferencing approaches: surveyor geometry, Linear Referencing-Traced (LR-T), and Linear ReferencingEuclidean (LR-E). The locations of half-mile posts are shown at three different spatial scales: a the entire Holland Land Company (HLC) Purchase, b a single township, and c a single township half-mile post
examined property descriptions within deed records at County Clerk’s offices, which continue to use the language of HLC surveys. Township corners were then digitized upon cadastral and street centerline geometry as indicated in property descriptions. ArcGIS linear referencing tools (Esri 2012) were used to estimate the locations of half-mile posts between township corners (Fig. 3). Linear referencing establishes locations of point or linear features along a line, by using distance measurements recorded along the line relative to a starting location. Using linear
123
referencing, the start and end distance measurements of the digitized township lines were specified using the HLC distance measurements, irrespective of the actual length of the digitized line. In this way, half-mile posts were georeferenced using distance measurements that were proportional to the HLC measurements, such that an entire digitized township line was scaled to fit between two adjacent township corners. Linear referencing is analogous to the scaling (Batek 1994) approach described previously (‘‘Approaches towards georeferencing PLSRs’’ section).
Author's personal copy Plant Ecol
The final two georeferencing approaches involved different versions of digitized lines to represent the township lines: a ‘‘Linear Referencing—Euclidean’’ (LR-E) and ‘‘Linear Referencing—Traced’’ (LR-T) approach. LR-E involved digitizing each township line as a straight line between adjacent township corners; the length of each township line was thus the Euclidean distance between the township corners. Other research has utilized the LR-E approach to digitize PLSR survey lines that were typically 1.21 km (0.75 mi) in length between known reference points (Tulowiecki 2014). The LR-T approach involved digitizing a township line as the combination of all digitized cadastral boundaries and street centerlines that connected two township corners, which presumably coincided with the true township line created by the HLC surveyors. Township lines were evident within civil and cadastral boundaries, and have been observed as such by other HLC researchers (e.g. Wyckoff 1988). Once township lines were digitized in these approaches, linear referencing was applied to georeference the half-mile posts. Measuring and comparing positional error We estimated the positional error of the georeferencing approaches by comparing a subset of half-mile post locations derived from the HLC township surveys, to the more accurate locations of the same half-mile posts that were re-recorded in the HLC lot surveys (‘‘Study area and surveys’’ section). We used the lot surveys, and the methods presented in ‘‘Linear referencing approaches’’ section, to digitize half-mile post locations where they coincided with lot corners (Fig. 2). Because half-mile post locations from the lot surveys could be established using these methods, they were presumed to possess higher positional accuracy. Due to surveyor inconsistency in using half-mile posts as lot corners, a systematic collection of half-mile posts was not achieved. We quantified positional error within a GIS as the distance between 80 half-mile posts that were located using lot surveys, and the corresponding half-mile posts that were located using each of the three georeferencing methods. To provide another estimate of positional error, the locations of township corners georeferenced using the surveyor geometry approach were compared to the locations that were georeferenced using property descriptions in ‘‘Linear referencing approaches’’ section.
We used the Friedman test for repeated measures, to assess whether there were significant differences in positional error among georeferencing approaches. If the Friedman test revealed significant differences, then Wilcoxon signed-rank tests explored differences in positional error between georeferencing approaches. A Bonferroni correction was applied to adjust the p values, because multiple comparisons were performed between georeferencing approaches. Comparing SDM performance We utilized bearing-tree data to develop and compare SDMs from each georeferencing approach. SDMs were implemented using the VisTrails Software for Assisted Habitat Modeling (VisTrails SAHM; Morisette et al. 2013), and were developed at a grid cell resolution of 100 m. Changes in model performance and variable selection were then plotted against species abundance, niche position, and niche breadth for each species (Hirzel et al. 2002). The correlations between the prediction surfaces generated by SDMs, associated with the different georeferencing approaches, were also calculated and compared. Vegetation data The 12 most abundant of the 38 taxa in the bearingtree data (Wang 2007) were modeled for this study, all of which were distinguishable to the species level, except for elm (Ulmus spp.). For each georeferencing approach, the surveyor-recorded bearings and distances (l = 5 m) from survey posts (i.e. township half-mile or township corner posts) were used to georeference each bearing-tree. This bearing-tree sample (n = 8394) represented the total after 398 bearing-trees along the ‘‘New York Reservation’’ were excluded from the original dataset, because of concerns that the high density of bearing-trees in this area would have unduly influenced SDMs (Black & Abrams 2001). A species was designated as ‘‘present’’ if at least one bearing-tree of that species was located within a 100 m grid cell, and ‘‘absent’’ if bearing-trees not of that species were located within a 100 m grid cell. Because georeferencing approaches produced different post locations, SDMs utilized slightly different presence and absence totals per species for each georeferencing approach, when presence or absence was determined by grid cell (Table 1). With this
123
Author's personal copy Plant Ecol Table 1 The sample sizes of bearing-trees for species distribution models (SDMs) in this study Taxon
Taxonomic equivalent
LR-T
LR-E
Surveyor geometry
npresence
nabsence
npresence
nabsence
npresence
nabsence 1,603
Beech
Fagus grandifolia Ehrh.
2,008
1,633
2,020
1,656
2,013
Sugar maple
Acer saccharum Marsh.
1,339
2,302
1,333
2,343
1,343
2,273
Hemlock
Tsuga canadensis (L.) Carr.
509
3,132
510
3,166
506
3,110
Basswood
Tilia americana L.
377
3,264
378
3,298
374
3,242
Elm
Ulmus spp.
309
3,332
312
3,364
308
3,308
Black ash
Fraxinus nigra Marsh.
179
3,462
182
3,494
178
3,438
Yellow birch
Betula alleghaniensis Britton
186
3,455
186
3,490
183
3,433
Red maple
Acer rubrum (L.)
167
3,474
166
3,510
165
3,451
White oak
Quercus alba L.
146
3,495
146
3,530
136
3,480
White pine
Pinus strobus L.
142
3,499
141
3,535
140
3,476
White ash
Fraxinus americana L.
159
3,482
158
3,518
156
3,460
Chestnut
Castanea dentata (Marsh.) Borkh.
88
3,553
88
3,588
84
3,532
Presence or absence of a taxon (all discernible as species, except for the elm genus, Ulmus spp.) was determined for each grid cell that contained at least one bearing-tree. Determinations of the taxonomic equivalents, based on the surveyor notes of the HLC township survey, are the same as those shown in Wang (2007). In keeping with Wang (2007), ‘‘ash,’’ ‘‘oak,’’ ‘‘birch,’’ and ‘‘pine’’ were assigned to ‘‘black ash,’’ ‘‘white oak,’’ ‘‘yellow birch,’’ and ‘‘white pine,’’ respectively LR-T linear referencing-traced, LR-E linear referencing-Euclidean
approach to determining presence and absence, we interpreted SDM prediction outputs as the probability that a given tree species would be selected as a bearing-tree using the HLC surveying protocols. SDM algorithms Three SDM algorithms were used to model each of the 12 species, all of which make use of presence-absence species data: Generalized Linear Models (GLM), Boosted Regression Trees (BRT), and Multivariate Adaptive Regression Splines (MARS). Model variables were selected in the GLM algorithm using a stepwise procedure based upon the Akaike information criterion. For the BRT algorithm, the selection of variables and optimal model parameters were performed using cross-validation techniques proposed by Elith et al. (2008). Model fitting and variable selection for the MARS algorithm was performed using techniques presented by Leathwick et al. (2006). All other default options in VisTrails SAHM were used. Eleven environmental variables were included for possible selection in SDM algorithms (Table 2), to represent various environmental and climatic conditions. All variables were mean-aggregated to a grid cell resolution of 100 m from initial raster layers of
123
finer resolutions, with the two following exceptions. Temperature and precipitation variables (PRISM Climate Group 2013) were resampled to 100 m from an approximately 800 m grid cell resolution, using cubic convolution resampling (Keys 1981). Soil variables (Natural Resources Conservation Service 2013) were initially downloaded as vector polygons that were originally digitized using 1:12,000 to 1:63,360 scale maps (Natural Resources Conservation Service 2014); these vector polygons were first converted to 10 m grid cell resolution, and then mean-aggregated to 100 m. Comparing SDMs among georeferencing approaches We evaluated the predictive performance of SDMs using the area under the receiver operating characteristic curve statistic (AUC). AUC is a thresholdindependent measure of model performance, which measures a model’s ability to discriminate presence locations from absence locations (Fawcett 2006). AUC ranges from 0 to 1, with 1.0 indicating perfect discriminating ability and 0.5 indicating that the model is equivalent to a random guess. Following other studies that investigated positional error and SDM performance (Osborne & Leitao 2009; Segurado
Author's personal copy Plant Ecol Table 2 The environmental variables considered in species distribution models (SDMs) Predictor variables
Key sources for data and methods
Actual evapotranspiration over potential evapotranspiration (AET/ PET), May–September
Dyer (2009, 2013); Esri (2012); PRISM Climate Group (2013); USGS (2013)
Average temperature, January
PRISM Climate Group (2013)
Average temperature, May– September
PRISM Climate Group (2013)
Compound topographic index (CTI)
USGS (2013); Beven and Kirkby (1979)
Potential evapotranspiration (PET), May–September
Dyer (2009, 2013); Esri (2012); PRISM Climate Group (2013); USGS (2013)
Precipitation, May– September
PRISM Climate Group (2013)
Soil drainage class, ranked (1 = excessively welldrained, 7 = very poorly drained) Soil percent clay
SSURGO (Natural Resources Conservation Service 2013)
SSURGO (Natural Resources Conservation Service 2013)
Soil percent sand
SSURGO (Natural Resources Conservation Service 2013)
Soil pH
SSURGO (Natural Resources Conservation Service 2013) USGS (2013); Esri (2012)
Solar radiation, May– September
& Araujo 2004), the training data were used to calculate AUC for each SDM. We performed a one-way analysis of variance (ANOVA) for repeated measures, to test whether AUC values differed among SDMs associated with the three georeferencing approaches. For some ANOVA tests, the p values were adjusted using the GreenhouseGeisser correction (Girden 1992), due to the violation of the assumption of sphericity. Differences were tested among georeferencing approaches when including all SDM algorithms together (12 species 9 3 SDM algorithms = 36 SDMs per georeferencing method), and among georeferencing approaches for each of the three SDM algorithms in isolation (12 species = 12 SDMs per georeferencing method). If ANOVA tests revealed significant differences, we used paired t-tests to explore the differences between the AUC values of the georeferencing approaches.
A Bonferroni correction was applied to adjust p values for multiple comparisons. Because all SDM algorithms utilized some variable selection procedure, we also compared SDMs (i.e. of the same species and algorithm) trained from different georeferencing approaches, to examine to what degree they selected the same variables. This comparison allowed another means of assessing differences in SDMs. To compare variables selected, we used Cohen’s kappa statistic to calculate the agreement in ‘‘selected’’ and ‘‘unselected’’ variables between SDMs of the same species and algorithm, but associated with different georeferencing approaches. Cohen’s kappa ranges from -1 to 1, with 1 indicating total agreement, and 0 indicating agreement that is equal to chance (Viera & Garrett 2005). Comparing changes in SDM performance to species characteristics Changes in SDM performance and variable selection were explored, in relation to three characteristics of the modeled species: abundance, niche position, and niche breadth. Specifically, changes in SDM performance and variable selection were examined between the two most different (i.e. LR-T and surveyor geometry) and the two most similar (i.e. LR-T and LR-E) georeferencing approaches, with respect to positional error. The abundance of each tree species was calculated using the bearing-tree totals from the HLC township survey. The environmental niche indices of ‘‘marginality’’ and ‘‘tolerance’’ for each species were calculated to represent niche position and niche breadth, respectively, using the Biomapper 4.0 software (Hirzel et al. 2002). Marginality and tolerance have been used to understand the relationships between SDM performance and the habitat characteristics of species (Hernandez et al. 2006; Segurado & Araujo 2004). Marginality is an index of a species’ niche position (Hirzel et al. 2002), ranging from 0 (preference for more average environmental conditions) to around 1 (preference for more marginal or extreme conditions). Tolerance is an index representing niche breadth, also ranging from 0 (a habitat-specialist) to 1 (a habitatgeneralist). Both marginality and tolerance are calculated using species presence records, and a set of environmental variables. To calculate marginality and tolerance for each species, the 11 environmental
123
Author's personal copy Plant Ecol
variables used in SDMs (Table 2) were inputted into the Biomapper 4.0 software (Hirzel et al. 2002). Because the LR-T approach produced the most accurate species locations (‘‘Positional error of the georeferencing approaches’’ section), species presence locations from this approach were used to calculate marginality and tolerance. Changes in the AUC values of SDMs were plotted against abundance, marginality, and tolerance, to examine whether SDMs exhibited greater differences in performance between georeferencing approaches, depending upon species characteristics. Agreement in variable selection, calculated using Cohen’s kappa, was also plotted against each species characteristic. Linear regression was used to assess whether correlations existed between the species characteristics and change in AUC values, as well as between the species characteristics and kappa values. Comparing the prediction surfaces of SDMs To further investigate the differences in SDMs associated with the different georeferencing approaches, the prediction surfaces of SDMs (‘‘Vegetation data’’ section) were compared, using Pearson correlation coefficients. Previous research has utilized Pearson correlation coefficients to assess SDM performance in a variety of ways (Franklin & Miller 2009), such as to compare SDM prediction surfaces with the surfaces that represented the ‘‘actual’’ distribution of simulated species (Elith & Graham 2009). In this study, three steps were performed to compare the prediction surfaces. First, a sample of the predicted probability values that were generated from all SDMs were collected at the township post locations, where all predictor variables were available (i.e. township corner and township half-mile posts; n = 3534). Second, the Pearson correlation coefficient between predicted probability values was calculated for each pair of SDMs. For example, the correlation in predicted probability values for BRT models of black ash was calculated, between SDMs associated with LR-T versus surveyor geometry approaches. Third, ANOVA and paired t-tests were utilized, in order to investigate differences in correlations between prediction surfaces. These tests provided an understanding, for instance, of whether prediction surfaces associated with the surveyor geometry approach were significantly less correlated
123
Fig. 4 The positional error resulting from the Linear Referencing-Traced (LR-T), Linear Referencing-Euclidean (LR-E), and surveyor geometry georeferencing approaches. Whiskers indicate the most extreme values in positional error. Locations of half-mile posts (n = 80) located within Holland Land Company (HLC) lot surveys were compared to corresponding posts within each georeferencing approach. The final box shows the positional error of township corners (n = 333), resulting from the surveyor geometry approach. Median positional error values are labeled next to each box; maximum positional error values extending above the y-axis range values are labeled
with those of the LR-T or LR-E approaches, in comparison to correlations between the prediction surfaces of the LR-T versus LR-E approaches. In the absence of meaningful independent evaluation data of high positional accuracy, this analysis provided some assessment of how different approaches towards georeferencing training data might lead to SDMs with different prediction surfaces.
Results Positional error of the georeferencing approaches Significant differences in the positional error of vegetation data were revealed, particularly between the surveyor geometry approach, and the two linear referencing approaches (Figs. 3, 4). Using the Friedman test, the three georeferencing approaches differed significantly with regards to positional error (n = 80, p \ 0.001; Fig. 4). After the Bonferroni correction, Wilcoxon tests revealed significant differences in positional error between all pairs of georeferencing approaches (p \ 0.001). When using the half-mile posts (n = 80) to quantify positional error, median positional error was modest for the LR-T (5.88 m) and
Author's personal copy Plant Ecol
error of the surveyor geometry approach was 339.47 m, with a maximum positional error of 897.80 m. Comparing SDMs Overall comparisons among SDM performance
Fig. 5 Area under the receiver operating characteristic curve (AUC) values associated with Linear Referencing-Traced (LRT; n = 36) or Linear Referencing-Euclidean (LR-E; n = 36) approaches, versus AUC values associated with the surveyor geometry approach. Each point represents the AUC values of models that share the same modeled species and model algorithm, but differ only in the manner in which the bearingtrees were georeferenced. Any point below the 1:1 line indicates a species distribution model (SDM) that performed more poorly when using surveyor geometry to georeference bearing-trees
LR-E (15.64 m) approaches, but greater for the surveyor geometry approach (407.99 m). When positional error was quantified using the confirmed township corner locations (n = 333, ‘‘Linear referencing approaches’’ section), the median positional
The LR-T and LR-E approaches produced SDMs with higher AUC values (Fig. 5). Considering all model algorithms together, repeated-measures ANOVA tests revealed significant differences among AUC values (Table 3), for models developed from different georeferencing approaches (n = 36, p \ 0.001). LR-T (median = 0.77) and LR-E (median = 0.76) produced models with slightly higher AUC values, in comparison to the surveyor geometry approach (median = 0.75). Though LR-T and LR-E approaches generally yielded models with higher AUC values, the differences in AUC values between SDMs (paired by algorithm and species) were also modest, when comparing LR-T to surveyor geometry (median difference = 0.01) and LR-E to surveyor geometry (median difference = 0.02). When performing repeated-measures ANOVA tests upon the AUC values for each algorithm in isolation, significant differences in AUC were still apparent for two algorithms (GLM, n = 12, p = 0.016; and MARS, n = 12, p = 0.006), but non-significant for the other (BRT, n = 12, p = 0.407). When examining all model algorithms together, paired t-tests revealed significant improvements in the AUC values of models using LR-T versus surveyor
Table 3 ANOVA tests and t-tests, exploring differences in area under the receiver operating characteristic curve (AUC) measures for species distribution models (SDMs) associated with different georeferencing approaches n
p values from ANOVA
p values from paired t-tests LR-T versus LR-E
LR-T versus surveyor geometry
LR-E versus surveyor geometry
All SDMs (BRT, GLM, MARS)
36
\0.001
0.198
0.005